0% found this document useful (0 votes)
65 views15 pages

A Comprehensive Analysis of Superpage Management Mechanisms and Policies

This paper analyzes superpage management mechanisms and policies for reducing address translation overhead in large memory workloads. It defines the five events in a superpage's lifecycle: allocation, preparation, mapping, destruction, and deallocation. It compares approaches in Linux, FreeBSD, and recent systems on when these events are triggered and at what granularity. It also presents new observations on coupling events, asynchronous promotion, speculative allocation, and bulk zeroing. Finally, it introduces Quicksilver, a reservation-based superpage manager that aims to match or beat existing systems' performance while limiting memory bloat and fragmentation.

Uploaded by

reg2reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views15 pages

A Comprehensive Analysis of Superpage Management Mechanisms and Policies

This paper analyzes superpage management mechanisms and policies for reducing address translation overhead in large memory workloads. It defines the five events in a superpage's lifecycle: allocation, preparation, mapping, destruction, and deallocation. It compares approaches in Linux, FreeBSD, and recent systems on when these events are triggered and at what granularity. It also presents new observations on coupling events, asynchronous promotion, speculative allocation, and bulk zeroing. Finally, it introduces Quicksilver, a reservation-based superpage manager that aims to match or beat existing systems' performance while limiting memory bloat and fragmentation.

Uploaded by

reg2reg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A Comprehensive Analysis of Superpage

Management Mechanisms and Policies


Weixi Zhu, Alan L. Cox, and Scott Rixner, Rice University
https://www.usenix.org/conference/atc20/presentation/zhu-weixi

This paper is included in the Proceedings of the


2020 USENIX Annual Technical Conference.
July 15–17, 2020
978-1-939133-14-4

Open access to the Proceedings of the


2020 USENIX Annual Technical Conference
is sponsored by USENIX.
A Comprehensive Analysis of Superpage Management Mechanisms and Policies

Weixi Zhu, Alan L. Cox and Scott Rixner


Rice University
{wxzhu, alc, rixner}@rice.edu

Abstract in their TLBs. Superpages can therefore increase these TLBs’


coverage from around 6MB (0.009% of the physical memory
Superpages (2MB pages) can reduce the address translation
in a computer with 64GB of DRAM) to 3GB. While this is
overhead for large-memory workloads in modern computer
still a small fraction of the computer’s physical memory, it is
systems. This paper clearly outlines the sequence of events
far more likely to capture an application’s short-term work-
in the life of a superpage and explores the design space of
ing set. The benefits of this increased coverage are obvious.
when and how to trigger and respond to those events. This
The challenge, however, is for the operating system (OS) to
provides a framework that enables better understanding of su-
transparently manage superpages in an effective manner.
perpage management and the trade-offs involved in different
design decisions. Under this framework, this paper discusses This paper first defines the five distinct events in the life
why state-of-the-art designs exhibit different performance cycle of a transparently managed superpage, and then it an-
characteristics in terms of runtime, latency and memory con- alyzes the various state-of-the-art approaches to handling
sumption. This paper illuminates the root causes of latency each event. Briefly, the events are as follows. First, a physical
spikes and memory bloat and introduces Quicksilver, a novel superpage must be allocated. Throughout this paper, unless
superpage management design that addresses these issues stated otherwise, “superpage” refers to a 2MB page, so this
while maintaining address translation performance. is the act of acquiring a contiguous, aligned 2MB region
from the physical memory allocator. Second, the physical
superpage must be prepared. For anonymous memory, the
1 Introduction entire 2MB region must be zeroed. For file-backed memory,
The physical memory size of modern computers continues to the entire 2MB region must be read from secondary stor-
grow at a rapid pace. Furthermore, there is an ever expanding age. Third, a superpage mapping from a 2MB aligned virtual
class of “large-memory” data-oriented applications — includ- memory region to the physical superpage must be created.
ing databases, data analysis tools, and scientific computations Fourth, the mapping must be destroyed. Finally, the physical
— that can productively utilize all of this memory. While some memory must be deallocated. FreeBSD, Linux’s Transpar-
of these applications expect the entirety of their data to reside ent Huge Pages (THP), and two recently proposed systems
within physical memory, others process data at a scale that (Ingens [20] and HawkEye [24]) differ in when these events
far exceeds its size. These others either use out-of-core com- are triggered (for instance, these events can be independent,
putation frameworks or implement schemes for caching data grouped, synchronous, asynchronous, etc.) and the granularity
from secondary storage that avoid swapping by the virtual of the operations (for instance, some operations can be per-
memory system. In either case, these applications have large formed incrementally or all at once). This classification of the
memory footprints, so the cost of virtual-to-physical address events enables a more principled comparison of the policies,
translation significantly impacts their performance. behaviors, and performance of these different systems.
The use of superpages, or “huge pages”, can reduce the cost This paper also presents several new observations about
of virtual-to-physical address translation. For example, the transparent superpage management. First, coupling physical
x86-64 architecture supports 2MB superpages. Using these allocation, preparation, and mapping of superpages, as is done
superpages (1) eliminates one level from the hierarchical page in Linux’s THP, leads to memory bloat and fewer superpage
table, thereby reducing the expected number of memory ac- mappings. Second, while alleviating tail latency problems
cesses to resolve a TLB miss, and (2) enables more efficient in server workloads, state-of-the-art asynchronous, “out-of-
use of the TLB’s limited number of entries. Intel’s recent place” promotion delays physical superpage allocation and
processors can store up to 1536 4KB or 2MB page mappings reduces address translation benefits. Third, speculatively al-

USENIX Association 2020 USENIX Annual Technical Conference 829


locating physical superpages enables “in-place” promotion to keep as many such regions available as possible (or create
and obviates the need for asynchronous, out-of-place promo- them when needed) using smart allocation policies or memory
tion. Fourth, in combination, reserving physical superpages migration. If no such region is available or can be created,
and delaying partial deallocation of those superpages as long then the system must fall back to allocating 4KB pages.
as possible fights fragmentation, leading to more superpage Even after 4KB pages have been allocated for a virtual
usage and address translation benefits. Finally, bulk zeroing is memory region, it is still possible to allocate a physical super-
more efficient on modern processors than repeated 4KB zero- page for that region asynchronously. In the background, the
ing. These observations are supported by evidence presented OS can use migration to create free physical superpages or
throughout the paper. wait for them to be freed by applications. Once a free physical
Finally, this paper introduces Quicksilver1 , an innovative superpage exists, it could be allocated to a previously accessed
transparent superpage management system that is based virtual memory region. At that point, all previously allocated
on FreeBSD’s reservation-based physical memory allocator. 4KB pages would need to be migrated into the newly acquired
Quicksilver achieves the benefits of aggressive superpage al- physical superpage.
location, but mitigates the memory bloat and fragmentation
issues that arise from underutilized superpages. Quicksilver is 2.2 Physical Superpage Preparation
able to match or beat the performance of existing systems in
scenarios with either lightly or heavily fragmented memory. Once a physical superpage has been allocated, it must be pre-
For example, when using synchronous preparation, on a heav- pared with its initial data before it can be mapped. A physical
ily fragmented system it achieves a 2x speedup over Linux for superpage can be prepared in one of three ways. First, if the
GraphChi performing PageRank on a dataset that exceeds the virtual memory region is anonymous, i.e., not backed by a file,
physical memory size. Furthermore, on Redis, Quicksilver then the page simply needs to be zeroed. Second, if the virtual
is able to maintain the same throughput and tail latency as memory region is a memory-mapped file, then the data must
fragmentation increases, whereas the throughput of other sys- be read from the file. Finally, if the virtual memory region is
tems degrades and tail latency increases. Finally, Quicksilver currently mapped to 4KB pages, then the contents of those
is able to limit memory bloat as well as Ingens [20], which is existing pages must be copied into the physical superpage.
a recent research prototype specifically designed to combat Note that any constituent pages that were not already mapped
memory bloat. would need to be prepared appropriately, either via zeroing or
reading from the backing file.
Physical superpages can be prepared all at once or incre-
2 Transparent Superpage Management mentally. Furthermore, as they are prepared, they can have
Managing superpages transparently to the application in- some, or all, of their constituent pages mapped as 4KB pages
volves five distinct events: physical superpage allocation, (each constituent page that is mapped must have already been
physical superpage preparation, superpage mapping creation, prepared). At a minimum, on a page fault, the 4KB page that
superpage mapping destruction, and physical superpage deal- triggered the fault must be prepared immediately in order to
location. Figure 1 illustrates the life cycle of a superpage in allow the application to resume. However, upon a page fault,
terms of these events. This section discusses the trade-offs the OS can choose to prepare the entire physical superpage,
between the possible choices, including those made by pro- only prepare the required 4KB page, or prepare the required
duction and prototype systems [5, 16, 20, 23, 24], for when to 4KB page, allow the application to resume, and prepare the
trigger and how to handle these events. remaining 4KB pages later (either asynchronously or when
they are accessed).
2.1 Physical Superpage Allocation The three types of preparation — zeroing, copying, and
file reading — have different costs, and so may impact the
The OS can choose to allocate a physical superpage to back choice of when and how much of a physical superpage to
any 2MB-aligned virtual memory region. A physical super- prepare. Incremental preparation decreases page fault latency
page could be allocated synchronously upon a page fault, and minimizes unnecessary preparation for 4KB pages that
or asynchronously via a background task. If there are free may ultimately never get accessed. However, as the page is
physical superpages, synchronous allocation is a relatively incrementally prepared, the constituent pages will be using
inexpensive operation given the widespread use of buddy 4KB mappings. In contrast, all at once preparation eliminates
allocators for physical memory management. future page faults to the virtual memory region and allows for
However, in order to allocate a physical superpage, the the immediate creation of a superpage mapping.
physical memory allocator must have an available, aligned,
2MB region. Under severe memory fragmentation, such re- 2.3 Superpage Mapping Creation
gions may not be available. A memory manager could attempt
Once a physical superpage has been fully prepared, it must
1 https://github.com/rice-systems/quicksilver then be mapped as such in order to achieve address transla-

830 2020 USENIX Annual Technical Conference USENIX Association


Ph ical Alloca ion Ph ical P e a a ion Ma ing C ea ion Ma ing De c ion Ph ical Dealloca ion
Con ig o 4KB mapping A ingle 2MB mapping Aligned 4KB mapping Aligned 4KB mapping Unaligned
OS ma ack ph ical SP TLB TLB
Vi al
ail ... head 4 4 ... 4 S pe page 4 ... 4 4 A ... 4 B

he

l lb
cac
?

in
Di k- ead
Page Fa l

F
A f ee ph ical Page Ze o
A nc
SP
... ... ...
Mig a ion
F ll p epa ed F ll p epa ed F ll p epa ed Pa iall p epa ed
*Allocation ma fail because of *An incrementall prepared SP can *TLB coverage increases when *4KB mappings can be created for A: expect a virtual SP in the future
memor fragmentation be mapped as 4KB pages caching created SP mappings some or all constituent 4KB pages B: use individuall for other purposes

Figure 1: The five events in the life cycle of a superpage (SP).

tion benefits. Before the superpage is mapped, the physical entire 2MB can be returned to the physical memory allocator
memory can still be accessed via 4KB mappings; afterwards, or the physical superpage can be “broken” into 4KB pages.
the OS loses the ability to track accesses and modifications at If the physical superpage is broken into its constituent 4KB
a 4KB granularity. Therefore, an OS may delay the creation pages, the OS can return a subset of those pages to the physi-
of a superpage mapping if only some of the constituent pages cal memory allocator. However, returning only a subset of the
are dirty in order to avoid unnecessary future I/O. constituent pages increases memory fragmentation, decreas-
A superpage mapping is typically created upon a page fault, ing the likelihood of future physical superpage allocations.
on either the initial fault to the memory region or a subsequent Before part or all of a physical superpage is returned to the
fault after the entire superpage has been prepared. However, if physical memory allocator, any constituent pages that have
the physical superpage preparation is asynchronous, then its been prepared but not freed must be preserved. Preservation
superpage mapping may also be created asynchronously. Note typically happens in one of three ways. In-use pages can be
that on some architectures, e.g., ARM, any 4KB mappings kept rather than returned to the allocator, and 4KB mappings
that were previously created must first be destroyed. can be created to those pages. Alternatively, the in-use pages
can be copied to other physical pages, allowing the entire
2.4 Superpage Mapping Destruction physical superpage to be returned. The last option is to write
the in-use pages to secondary storage before returning them.
Superpage mappings can be destroyed at any time, but must
be destroyed whenever any part of the virtual superpage is
3 State-of-the-art Designs
freed or has its protection changed. After the superpage map-
ping is destroyed, 4KB mappings must be recreated for any This section compares the state-of-the-art designs for trans-
constituent pages that have not been freed. parent superpage management in FreeBSD, Linux, and recent
With superpage mappings, the OS cannot track whether research prototypes (Ingens [20] and HawkEye [24]), with a
constituent pages are accessed or modified. Therefore, in particular focus on how they manage the events described in
some scenarios, the OS may choose to preemptively destroy the previous section.
a superpage mapping and substitute 512 4KB mappings for it
to enable finer-grained memory management. For example, 3.1 FreeBSD
when a clean superpage is first modified, the OS could choose
to destroy the superpage mapping in order to only mark the FreeBSD supports transparent superpages for all kinds of
single modified 4KB page as dirty, potentially reducing fu- memory, including memory-mapped files and executables. It
ture I/O operations. This would require the OS to make a decouples physical superpage allocation from preparation by
read-only superpage mapping and use the page fault caused using a reservation-based memory allocator [23,29]. FreeBSD
by the write access to destroy the mapping and replace it with tries to allocate (“reserves”) a physical superpage upon the
4KB mappings. Similarly, the OS could choose to destroy a first page fault to any aligned 2MB region. If physical su-
superpage mapping when under memory pressure to enable perpages are available, they are allocated for any memory-
swapping pages at a finer granularity. mapped file exceeding 2MB in size. Anonymous memory
always uses superpages if available, regardless of size, as
anonymous memory is expected to grow.
2.5 Physical Superpage Deallocation
Once a physical superpage is allocated for anonymous
Generally, a physical superpage is deallocated when an ap- memory, only the 4KB page that caused the page fault is
plication frees some or all of the virtual superpage, when an prepared, and a reservation entry is created to track all of the
application terminates, or when the OS needs to reclaim mem- constituent pages. Any subsequent page fault to that 2MB re-
ory. If a superpage mapping exists, it must be destroyed before gion skips page allocation and simply prepares one additional
the physical superpage can be deallocated. Then, either the 4KB page of the physical superpage. The physical superpage

USENIX Association 2020 USENIX Annual Technical Conference 831


preparation finishes once all of its constituents have been faults within the region and destroying the existing 4KB map-
prepared. For file-backed memory, the process is the same, pings. It then prepares all of the physical superpage’s con-
except memory is prepared in 64KB batches to minimize I/O stituent 4KB pages, one at a time. For a previously mapped
overhead. page, the contents are copied. Previously unmapped pages are
FreeBSD creates superpage mappings synchronously dur- zeroed. Finally, it installs a superpage mapping.
ing page faults. FreeBSD only creates a superpage mapping if Khugepaged’s preparation is more costly than the first-
the characteristics (e.g., protection and modified state) of all touch preparation that occurs in a page fault. It blocks ac-
the constituent 4KB mappings are the same. Identical protec- cesses to the 2MB region, causes TLB shootdowns, and pol-
tions are required for correctness; identical dirty states ensure lutes CPU caches. As a result, it is allowed by default to
that FreeBSD will not do unnecessary I/O to preserve the allocate at most 8 superpages every 10 seconds (1.6 MB/s).
contents of the page when it is later deallocated. When an application partially frees memory within a
Superpage mappings are destroyed on partial memory pro- superpage without unmapping the virtual memory (e.g.,
tection changes and partial unmappings. FreeBSD also pre- MADV_DONTNEED), it triggers the destruction of the superpage
emptively destroys clean superpage mappings before modifi- mapping and the deallocation of the physical superpage. The
cation. As a result, only one 4KB mapping is marked as dirty, remaining in-use memory then gets mapped as 4KB pages.
instead of the entire superpage. Once the last clean 4KB page However, when khugepaged scans this 2MB region, it will
is modified, a dirty superpage mapping gets created. unnecessarily migrate the mapped memory into another allo-
FreeBSD defers physical superpage deallocation as long cated superpage and effectively reallocate the freed memory.
as possible in order to minimize memory fragmentation and It is precisely this behavior of khugepaged which has led to
preserve the availability of free physical superpages. How- the severe memory bloating reported in recent work [20, 24].
ever, under memory pressure, FreeBSD looks for a partially
prepared physical superpage and breaks the corresponding 3.3 Ingens and HawkEye
reservation to allow the unused memory within that 2MB Recent state-of-the-art prototypes (Ingens [20] and Hawk-
physical memory region to be reclaimed for other uses. Eye [24]) attempt to mitigate the page fault latency spikes
incurred by Linux’s first-touch superpage policy as well as the
3.2 Linux
memory bloat incurred by khugepaged — behaviors which
Linux’s THP only uses superpages for anonymous memory have led many to suggest that Linux’s transparent superpage
and tries to allocate a physical superpage on the first page support be disabled for best performance.
fault to a 2MB-aligned virtual memory region. If allocation Both systems disable Linux’s first-touch policy, instead
fails and defragmentation is enabled (the default), it immedi- allocating, preparing, and mapping only a single 4KB page
ately does memory compaction via page migration to create on a page fault. They then effectively modify khugepaged to
a free physical superpage. This blocks the faulting process, more aggressively manage superpages.
increasing page fault latency. Under severe fragmentation, Khugepaged’s behavior differs in default Linux, Ingens,
migration may still fail to create a free physical superpage. and HawkEye in terms of order, threshold, and rate for su-
Linux does all-at-once physical superpage preparation: the perpage creation. To prevent excessive memory bloat, Ingens
entire physical superpage is always zeroed right after being increases the threshold to trigger creation of a superpage from
allocated. This increases the initial page fault latency, but one in-use 4KB page to 90% in-use, meaning there must be
may reduce the average latency [24]. After this preparation, at least 460 4KB mappings in a 2MB region in order to create
a superpage mapping is immediately created. The superpage a superpage for that region.
mapping will be destroyed if some or all of the superpage is Ingens maintains a list of candidate 2MB-aligned regions
unmapped or has its protection settings changed. Once some on page faults. As long as the list is not empty, Ingens keeps
or all of the superpage has been freed, the physical superpage creating superpage mappings. However, asynchronous super-
is deallocated and free memory is immediately reclaimed. page creation introduces a fairness problem that the scanning
This “first-touch” superpage policy only allocates physical order of page tables can lead to long delays for some pro-
superpages at the time of the first page fault. However, Linux cesses. To alleviate this, Ingens prioritizes processes with
also includes a kernel daemon called “khugepaged”, which fewer superpages. In addition, Ingens actively compacts non-
asynchronously scans the system page tables. When it finds an referenced memory at an aggressive rate.
aligned 2MB anonymous virtual memory region that contains HawkEye uses the same threshold as default khugepaged:
at least one dirty 4KB mapping, khugepaged tries to allocate one 4KB page. Under memory pressure, it scans mapped
a physical superpage. If a free physical superpage exists, it superpages and makes their zero-filled 4KB pages copy-on-
acquires it; otherwise, it calls Linux’s memory compaction to write to a single zero-filled page to reclaim memory.
reclaim one by migrating pages. HawkEye also maintains a list of candidate 2MB-aligned
Before preparing this physical superpage, khugepaged regions, but further weights them by their memory utilization,
blocks accesses to the virtual 2MB region by blocking page the process’s resident size, sampled access frequency and

832 2020 USENIX Annual Technical Conference USENIX Association


TLB overheads. HawkEye then creates a superpage mapping Workload Linux-4KB Linux-noKhugepaged Linux
for the one with the most weight that is believed to bring the Del-70 11.6 GB 11.7 GB 19.8 GB
highest TLB overhead, called fine-grained superpage man- Range-XL 14.4 GB 25.7 GB 30.7 GB
agement in the paper [24]. It attempts to obtain considerable
Table 1: Redis memory consumption. Linux-noKhugepaged
address translation benefits with fewer superpages.
disables khugepaged.
HawkEye’s fine-grained superpage management further
consumes CPU resources besides the migration-based su-
perpage mapping creations. To avoid interference with run-
ning processes, it uses the same promotion rate (1.6MB/s) as
Linux’s default khugepaged.

4 Analysis of Existing Designs


This section analyzes the designs for transparent superpage
management described in the previous section and presents
several novel observations about them. These observations
motivate the design of Quicksilver.
Platforms. All designs were evaluated on an Intel E3-1245
v6-based server with maximum turbo performance and hyper-
threading enabled. This server has 4 physical cores, 32GB
DDR4 2400 ECC RAM, and a 256GB NVMe SSD. Linux Figure 2: Linux’s first touch policy fails to create superpages.
version 4.3 was used, as both Ingens and HawkEye are based
tion, and mapping of superpages leads to memory bloat
on that version. FreeBSD version 11.2 was used, upon which
and fewer superpage mappings. It also is not compatible
Quicksilver is built. Swapping is disabled under every OS.
with transparent use of multiple superpage sizes.
Benchmarks. A large variety of benchmarks are evaluated.
Linux’s first-touch policy couples physical superpage alloca-
GUPS performs 232 serial random memory accesses to 230 64-
tion, preparation and superpage mapping creation together.
bit integers (8GB) [13]. Graphchi-PR, BlockSVM and ANN
As a result, it enjoys two obvious benefits. First, it provides
use out-of-core implementations to solve big-data tasks [21,
immediate address translation benefits, including shorter page
32]. Graphchi-PR computes 3 iterations of PageRank on the
walk time and increased TLB efficiency. Second, it eliminates
preprocessed Twitter-2010 dataset [19]. BlockSVM trains a
a large number of page faults for a heavily utilized superpage.
classification model on the kdd2010-bridge dataset [28]. ANN
Therefore, it is usually the best mapping policy when there is
randomly queries nearest neighbors on 2GB preprocessed
abundant contiguous free memory.
hash tables. XSBench is a parallel computation kernel of
the Monte Carlo neutron transport algorithm [30]. Canneal However, this coupled policy has several drawbacks. First,
and freqmine are PARSEC benchmarks with large memory it can easily bloat memory and waste time preparing underuti-
footprints [10]. Gcc, mcf, DSjeng and XZ are SPEC CPU2017 lized superpages. In a microbenchmark that sparsely touches
benchmarks with large memory footprints [11]. Buildkernel 30GB of anonymous memory, Linux’s first-touch policy takes
compiles the FreeBSD 11.2 kernel. 1.4s to run and consumes 30GB compared to 0.06s and 0.2GB
Graphchi-PR, XSBench, canneal and Buildkernel are multi- when disabling transparent huge pages. While such a corner
threaded to fully utilize CPU resources. Cold and Warm are case is rare when applications use malloc to dynamically
Redis workloads benchmarking throughput and tail latency allocate memory, it may still happen in a long-running server,
from a separate client machine with 8 threads and 16 re- e.g., Redis. Table 1 shows that Linux’s first touch policy bloats
quest pipelines. The Cold workload populates an empty Redis memory by 78% compared to Linux-4KB on the workload
instance with 16GB of 4KB objects. The Warm workload Range-XL, which inserts objects of random sizes ranging
queries the fully populated 16GB Redis instance with a set/get from 256B to 1MB.
ratio of 5:5 using 4KB objects. Del-70, Del-50, Range-S and Second, it misses chances to create superpage mappings
Range-XL are Redis workloads benchmarking memory con- when virtual memory grows. During a page fault, Linux can-
sumption. Del-70 and Del-50 insert 2 million 8KB objects and not create a superpage mapping beyond the heap’s end, so it in-
randomly delete 70% and 50% of them, respectively. Range-S stalls a 4KB page which later prevents creation of a superpage
and Range-XL insert randomly sized objects from small and mapping when the heap grows. Figure 2 shows such behavior
large size ranges, respectively. Detailed benchmark settings for gcc [11], which includes three compilations. Linux’s first-
and scripts can be found in the Quicksilver repository. touch policy creates a few superpage mappings early in each
compilation, but fails to create more as the heap grows. In-
Observation 1: Coupling physical allocation, prepara- stead, promotion-based policies can create more superpages,

USENIX Association 2020 USENIX Annual Technical Conference 833


Page Size Anonymous NVMe Disk Spinning Disk Workloads Ingens Ingens* HawkEye HawkEye* FreeBSD
2MB 91 us 1.7 ms 11 ms GUPS 0.87 0.84 0.28 0.88 0.96
1GB 46 ms 0.9 s 7.7 s Graphchi-PR 0.58 0.58 0.53 0.60 0.77
BlockSVM 0.81 0.79 0.73 0.81 0.96
Table 2: Page fault latency. Bold numbers are estimations.
Table 3: Speedup over Linux with unfragmented memory. All
e.g., FreeBSD and Linux’s khugepaged. systems have worse performance than Linux.
Third, it cannot be extended to larger anonymous or file-
backed superpages. Table 2 estimates the page fault latency on latency spikes. So khugepaged works as their primary super-
both 1GB anonymous superpages and 2MB/1GB file-backed page management mechanism.
superpages. Faulting a 2MB file-backed superpage on the However, out-of-place promotion delays physical super-
NVMe disk costs 1.7ms and faulting a 1GB anonymous su- page allocations and ultimately superpage mapping creation,
perpage takes 46ms. These numbers may cause latency spikes because the OS must scan page tables to find candidate
in server applications. Furthermore, it cannot easily determine 2MB regions and schedule the background tasks to pro-
which page size to use on first touch. This is arguably more mote them. Table 3 compares in-place promotion (FreeBSD)
of an immediate problem on ARM processors, which support with out-of-place promotion (Ingens and HawkEye) on ap-
both 64KB and 2MB superpages. plications where superpage creation speed is critical. While
GUPS only involves random accesses, both Graphchi-PR
Observation 2: Asynchronous, out-of-place promotion and BlockSVM [21, 32] represent important real-life appli-
alleviates latency spikes but delays physical superpage al- cations – using fast algorithms to process big data that can-
locations. not fit in memory. To better illustrate the problem, Ingens*
and HawkEye* were tuned to be more aggressive, so that all
Promotion-based policies can use 4KB mappings and later
2MB regions containing at least one dirty 4KB mapping are
replace them with a superpage mapping. This allows for po-
candidates for promotion. Specifically, Ingens* uses a 0%
tentially better informed decisions about superpage mapping
utilization threshold instead of 90%; HawkEye* uses a 100%
creation and can easily be extended to support multiple sizes
maximum CPU budget to promote superpages. However, Ta-
of superpages. Specifically, there are two kinds of promotion
ble 3 shows that FreeBSD consistently and significantly out-
policies, named out-of-place promotion and in-place promo-
performs both of them. In other words, the most conservative
tion. They differ in whether previously prepared 4KB pages
in-place promotion policy creates superpage mappings faster
require migration when preparing a physical superpage.
than the most aggressive out-of-place promotion policy.
Under out-of-place promotion a physical superpage is not
allocated in advance, on a page fault a 4KB physical page is Observation 3: Reservation-based policies enable specu-
allocated that may neither be contiguous nor aligned with its lative physical page allocation, which enables the use of
neighbors. When the OS decides to create a superpage map- multiple page sizes, in-place promotion, and obviates the
ping, it must allocate a physical superpage, migrate mapped need for asynchronous, out-of-place promotion.
4KB physical pages and zero the remaining ones. At this time,
previously created 4KB mappings are no longer valid. In-place promotion does not require page migration. It creates
Linux and recent prototypes [20,24] perform asynchronous, a physical superpage on the first touch, then incrementally
out-of-place promotion to hide the cost of page migration. As prepares and maps its constituent 4KB pages without page
discussed in Section 3.2, Linux includes khugepaged as a allocation. Therefore, the allocation of a physical superpage
supplement to create superpage mappings for growing heaps. is immediate, but its superpage mapping creation is delayed.
The steady, slow increase of Linux’s superpages in Figure 2 To bypass 4KB page allocations, it requires a bookkeeping
is from khugepaged’s out-of-place promotions. However, system to track allocated physical superpages, e.g., FreeBSD’s
khugepaged can easily bloat memory. Table 1 shows a mem- reservation system. On x86-64, after it substitutes a superpage
ory bloat from 11.6GB to 19.8GB on workload Del-70, which mapping for the 4KB mappings, it need not flush the previous
randomly deletes 70% of the objects. On workload Range-XL, 4KB mappings from the TLB.
it bloats memory from 25.7GB to 30.7GB. FreeBSD implements an in-place promotion policy based
Ingens and HawkEye [20, 24] disable Linux’s first-touch on its reservation system as described in Section 3.1. It con-
policy and instead improve the behavior and functionality of servatively creates superpage mappings to avoid making per-
khugepaged, motivated by avoiding latency spikes in server formance worse. Navarro, et al. reported negligible overheads
workloads. Under memory fragmentation, Linux tries to com- from the reservation system [23].
pact memory when it fails to allocate superpages, which FreeBSD immediately allocates physical superpages but
blocks the ongoing page fault and leads to latency spikes. delays superpage mapping creation, sacrificing some address
Ingens and HawkEye enhanced khugepaged and offloaded translation benefits. Table 3 shows that Linux consistently out-
superpage allocations from the critical path, alleviating such performs FreeBSD when memory is unfragmented, though

834 2020 USENIX Annual Technical Conference USENIX Association


Linux-4KB Linux CPU (GHz) DRAM temporal non-temporal
Frag-0 1.04 GB/s (5.6 ms) 1.34 GB/s (4.1 ms) Bulk Size: (MHz) 4KB 32KB 2MB 4KB 32KB 2MB
Frag-50 1.04 GB/s (5.7 ms) 0.92 GB/s (10.2 ms) E3-1231v3 (3.40) 1600 92 88 87 114 99 97
E3-1245v6 (3.70) 2400 84 67 65 92 74 71
Table 4: Mean throughput and 95th latency of Redis Cold E5-2640v3 (2.60) 1866 355 287 280 154 112 106
workload. E5-2640v4 (2.40) 2133 409 334 325 163 113 106
R7-2700X (4.30) 2666 185 183 159 99 60 53
they created similar numbers of anonymous superpage map-
pings. Table 5: 2MB page zeroing time (us) drops consistently using
However, FreeBSD aggressively allocates physical super- a larger bulk size.
pages for anonymous memory. Upon a page fault of anony-
mous memory, it always speculatively allocates a physical five modern machines. Existing OSes take 84–409us to zero
superpage, expecting the heap to grow. This eliminates one a 2MB superpage. After using a larger bulk size, the range
of the primary needs for khugepaged in Linux. In Figure 2, is improved to 67–334us. Furthermore, these machines have
FreeBSD has most of the memory quickly mapped as super- a consistently short non-temporal (moventi or clzero) bulk
pages, because most speculatively allocated physical super- zeroing latency (53–106us). The AMD Ryzen 7 2700X CPU
pages end up as fully-prepared pages. achieves 53us with the highest CPU and DRAM frequency
and its specific clzero implementation.
Observation 4: Reservations and delaying partial deallo-
cation of physical superpages fight fragmentation. 5 Design and Implementation
Superpages are easily fragmented on a long-running server. This section describes Quicksilver, an improved transparent
A few 4KB pages can consume a physical superpage, which superpage management system based upon the observations
benefits little if mapped as a superpage. Existing systems deal from the previous section. To benefit from in-place promo-
with memory fragmentation in three ways. tions, Quicksilver is built upon FreeBSD’s reservation-based
Linux compacts memory immediately when it fails to al- superpage management strategy.
locate a superpage. It tries to greedily use superpages, but
risks blocking a page fault. Table 4 evaluated the performance 5.1 Design
of Redis. Under fragmentation, Linux obtains slightly higher
throughput but much higher tail latency than Linux-4KB. Aggressive Physical Superpage Allocation. Section 4
FreeBSD delays the partial deallocation of a physical super- shows that aggressive allocation on first touch (as done by
page to increase the likelihood of reclaiming a free physical Linux and FreeBSD) is effective. Moreover, Observation 3
superpage. When individual 4KB pages get freed sooner, they shows that FreeBSD’s reservation system allows speculative
land in a lower-ordered buddy queue and are more likely to physical allocation for anonymous memory and creates even
be quickly reallocated for other purposes. Therefore, perform- more superpages than Linux, as shown in Figure 2. Since it
ing partial deallocations only when necessary due to memory also supports multiple superpage sizes and avoids memory
pressure decreases fragmentation. bloating, Quicksilver retains FreeBSD’s reservation system:
Ingens actively defragments memory in the background allocating physical superpages when virtual memory regions
to avoid blocking page faults. It preferably migrates non- that may use superpages are first accessed. Allocation is per-
referenced memory, so that it minimizes the interference with formed synchronously to avoid page migrations.
running applications. As a result, Ingens generates fewer la- The drawbacks of FreeBSD’s use of reservations are
tency spikes compared with Linux [20]. These migrations, twofold. First, FreeBSD delays preparation and mapping of
however, do consume processor and memory resources. superpages, resulting in lower performance than Linux in
some scenarios, as shown in Table 3. However, this is not
Observation 5: Bulk zeroing is more efficient on modern inherent in the use of reservations for allocation, but rather
processors than repeated 4KB zeroing. should be addressed via preparation and mapping policies.
Second, holding underutilized physical superpages in reserva-
Modern OSes have abandoned asynchronous page zeroing tions can prevent future superpage allocations. However, this
because it usually degrades performance in a multiprocess is better resolved via deallocation policies that recognize and
situation. Furthermore, the introduction of ERMS (Enhanced recover from such situations.
REP MOVSB/STOSB) has accelerated page zeroing. How-
ever, existing OSes fail to fully exploit the benefits of ERMS Hybrid Physical Superpage Preparation. Quicksilver
support, because they still zero pages 4KB at a time. Modern strikes a balance between incremental and all-at-once prepa-
CPUs can zero a 2MB page much faster with bulk zeroing, ration. Reservations are initially prepared incrementally. This
which calls the assembly language page zeroing code at a size minimizes the initial page fault latency, but loses prompt ad-
larger than 4KB. Table 5 compares 2MB zeroing speed on dress translation benefits. Therefore, Quicksilver has an addi-

USENIX Association 2020 USENIX Annual Technical Conference 835


tional threshold, t. Once t 4KB pages get prepared, it prepares preemptive deallocation usually evacuates underutilized and
the remainder of the superpage all-at-once. less frequently accessed superpages.
This reduces bloat, as discussed in Observation 1, because This approach has three advantages. First, fewer pages are
it does not immediately prepare and map the superpage. How- migrated. Second, the preemptive migration happens in the
ever, it enables address translation benefits sooner than wait- background, so it does not happen on the critical path of any
ing for the entire superpage to be accessed. The use of a OS function executed by the application. Finally, it is likely to
threshold is further based on previous work showing that the have minimal impact on running processes, as it is operating
utilization of physical superpages is largely bimodal [34]. on pages that come from less frequently accessed superpages.
Once more than about 64 4KB pages have been accessed, it
is very likely that the physical superpage will eventually be
fully populated (or very nearly so). Therefore, at that point,
5.2 Implementation
it is very likely to be beneficial to prepare the remainder of Quicksilver was implemented within FreeBSD 11.2. Quick-
the page and create a superpage mapping for it. Motivated by silver focuses on anonymous memory, with FreeBSD’s su-
Observation 5, bulk zeroing is used to accelerate page zeroing perpage support for file-backed memory slightly improved
when zero-filling the remainder of the superpage. (access bit equivalence is no longer required for promotion).
Relaxed Superpage Mapping Creation. Once an entire This section further describes the page zeroing mechanism
physical superpage has been prepared, there is little downside and the migration/deallocation daemon.
to immediately creating a superpage mapping for anonymous
memory, which is rarely, if ever, swapped in modern systems. Hybrid Preparation. A physical superpage is incrementally
Therefore, Quicksilver relaxes FreeBSD’s current design — prepared until it reaches a population threshold, t. Then the
which does not create a superpage mapping if the accessed or remainder of the physical superpage is prepared by zero-
modified states of the constituent pages differ — to always filling it. The system can do this either synchronously or asyn-
create a mapping once the physical superpage has been fully chronously, named Sync-t and Async-t. Specifically, Async-t
prepared, as do Linux, Ingens, and HawkEye. periodically scans the linked list of partially populated phys-
For file-backed superpages, Quicksilver retains FreeBSD’s ical reservations and starts zero-filling from the most active
write-protection mechanism to avoid extra disk I/O, but no ones reaching the population threshold t. Therefore, it incurs
longer examines if all constituent pages are accessed. Because no fairness issue because the order is determined by physical
memory-mapped files are usually prefetched 64KB at a time, allocation activity, not process IDs.
file-backed superpages may not be fully accessed when they In both cases, zero-filling uses non-temporal stores. When
get fully prepared. By allowing different access bits, more using Sync-t, pages are zeroed using the largest bulk size
file-backed superpage mappings can be created. Note that possible, as motivated by Observation 5. Since zeroing is
Linux and its variants do not use superpages at all for files. done by the page fault handler, the page fault handler can
create a superpage mapping immediately after zeroing is com-
On-demand Superpage Mapping Destruction. There is plete. When using Async-t, 4KB pages are zeroed individu-
no reason to destroy a superpage mapping unless some or ally. While this yields lower zeroing performance, it reduces
all of the memory within the superpage is freed, its protection lock contention when operating on pages. Since zeroing is
is changed, or the physical superpage must be deallocated to done asynchronously and independently of any process’s vir-
reclaim memory. Therefore, Quicksilver maintains FreeBSD’s tual address space, a superpage mapping is not created until
policy of only destroying mappings in the aforementioned the first soft page fault after all pages have been zeroed.
situations.
Preemptive Physical Superpage Deallocation. As dis- Relaxed Mapping Creation. For anonymous memory, the
cussed in Observation 4, delaying partial deallocation of phys- superpage mapping creation condition is relaxed to ignore
ical superpages effectively limits fragmentation. However, to checking for dirty and access bits. This allows a superpage
maximize the effectiveness of synchronous physical super- mapping to be created immediately after Sync-t completes the
page allocation, there must be available superpages to allocate. zero-filling (these pages are clean). For file-backed memory,
Superpage availability can have a considerable impact on per- superpage mappings are created on a soft page fault of a file-
formance as was shown in Table 4. Therefore, Quicksilver backed physical superpage. Default FreeBSD skips mapping
maintains a target number of free physical superpages. creation because the access bits are inferred not to all be set
Underutilized reservations that are inactive for a long pe- when prefaulting the prefetched disk data. After relaxing the
riod are preemptively deallocated. These partially prepared access bit checking, Quicksilver tries to create a superpage
physical superpages are not yet mapped as superpages, so mapping at that point.
the deallocation reduces memory bloat and recovers memory
contiguity. Such preemptive deallocation copes well with hy- Preemptive Deallocation. Physical superpages are often un-
brid preparation under a population threshold t. As a result, derutilized [20, 34]. Given Observation 4, the system delays

836 2020 USENIX Annual Technical Conference USENIX Association


Linux FreeBSD tation of a standard function may change performance sig-
Threads
default aggressive default emulate Linux ELF nificantly. For example, the libc string library in FreeBSD
1 1.05 1.19 1.15 1.16 11.2 does not use ERMS optimizations, so memory copy-
8 1.07 0.91 1.11 1.18 intense applications have worse performance. To remove this
difference, applications were compiled and statically linked
Table 6: Canneal performance speedup. Only bold numbers
on Linux and then run on FreeBSD using FreeBSD’s Linux
are comparable.
system call emulation. Table 6 shows that natively compiled
Linux FreeBSD canneal on FreeBSD runs slower than emulated canneal, be-
default aggressive default patched [1] match Linux cause of slower memory copying in dynamic array resizing.
1.01 1.24 1.02 1.07 1.19 Although a libc library with ERMS optimizations could be
0.4 K 8.2 K 0.0 K 1.2 K 8.2 K ported from FreeBSD 12.0, this methodology ensures that the
exact same binaries are run on all systems, eliminating any
Table 7: Throughput speedup and number of created super- possible library differences.
page mappings of a Redis server populated by Del-70. Only There are three exceptions. GraphChi-PR uses dlopen to
bold numbers are comparable. dynamically link the openmp library, so it cannot be statically
compiled. Redis calls gettimeofday() very frequently, causing
partial deallocation of physical superpages. However, to en- huge emulation overhead. Therefore, they are compiled na-
sure that there are sufficient free physical superpages for fu- tively on FreeBSD-based systems after porting the libc library
ture allocations, Quicksilver uses an evacuation daemon to from FreeBSD 12.0. They therefore may have minor library-
reclaim free physical superpages by preemptively deallocat- induced performance differences between the Linux-based
ing underutilized physical superpages. and FreeBSD-based systems. Lastly, FreeBSD’s Linux emu-
The daemon maintains a target number of free physical lation caused significant performance degradation on GUPS,
superpages. It periodically scans the list of partially populated because of cache misses resulting from an unaligned dynami-
reservations and examines their inactive time, during which cally allocated data structure. To fix this, GUPS was modified
they are neither populated nor deallocated. If they are inactive to use malloc_aligned.
for a long time, e.g., 5 seconds, the daemon then reclaims a
free physical superpage by migrating out its constituent 4KB
System Tuning. When there are idle CPUs, tuning Linux’s
pages. To avoid contention with running applications, the
khugepaged to be more aggressive can obtain better perfor-
daemon is restricted to use a maximum memory bandwidth
mance. Table 6 shows this in a single-threaded case. This tun-
of 1GB/s. This is less than 5% of the evaluated machine.
ing also yields higher throughput for single-threaded Redis, as
shown in Table 7. However, performance degrades when the
6 Methodology application uses all CPUs and competes with khugepaged, so
Fragmentation. Three fragmentation levels are modeled Linux remains unchanged for the remainder of the evaluation.
to mimic long-running servers, named Frag-0, Frag-50 and FreeBSD 11.2 has suboptimal Redis performance due to
Frag-100. They represent situations from non-fragmented to three reasons. First, it uses a network socket buffer size suit-
severely-fragmented. Specifically, Frag-50 leaves 50% of the able for 1Gbps NICs. Second, its libc has no ERMS opti-
application’s maximum resident memory as free superpages. mizations, while memory copying dominates Redis’s per-
The three fragmentation levels are crafted by a user-space formance. Third, it is unlikely to repromote superpages af-
tool which works under a first-touch physical superpage al- ter MADV_FREE (Redis uses MADV_FREE on FreeBSD to save
location policy (available in both Linux and FreeBSD). It page faults). Therefore, FreeBSD was tuned to use the cor-
first fragments superpages until there is memory pressure, rect buffer size for a 40Gbps NIC, and the libc library was
then starts over and fragments a target number of superpages. ported from FreeBSD 12.0. Additionally, a recent patch [1]
Unlike a previous memory fragmentation method [24] that to FreeBSD was applied to increase the likelihood of super-
only performs the latter step, Linux’s memory compaction page repromotion, creating 1.2K more superpage mappings
usually fails either in page faults or khugepaged. To fragment in Table 7. The dirty bit requirement for anonymous mem-
a superpage, the tool touches part of a 2MB-aligned virtual ory was relaxed to match Linux’s performance, creating 8.2K
region and unmaps the untouched part. This will trigger a superpage mappings.
physical superpage allocation and force a partial deallocation, Ingens and HawkEye are evaluated with their default set-
fragmenting that physical superpage. tings. Ingens promotes superpages with a 90%-utilization
threshold. HawkEye promotes superpages at the speed of
Library Differences. FreeBSD dynamically links exe- 1.6MB/s guided by performance counters. Ingens* and Hawk-
cutable files with its natively shipped libc, while Linux uses Eye* are aggressively tuned variants. Specifically, Ingens*
GNU libc. This makes any performance comparison between uses a utilization threshold of 0% instead of 90% and en-
FreeBSD and Linux unfair, because a different implemen- ables 1GB/s proactive memory compaction. HawkEye* uses

USENIX Association 2020 USENIX Annual Technical Conference 837


a 100% CPU maximum with a promotion threshold of 1. XSBench that were reported in the original paper [24], be-
cause in these application runs, most of its memory com-
7 Evaluation paction fails and its important data was not allocated at the
high end of the address space.
Four variants of Quicksilver are considered, named Sync-
The four variants of Quicksilver all consistently perform
1, Sync-64, Async-64 and Async-256. They all handle the
well on both non-server and server workloads, because their
five superpage events similarly except for superpage prepara-
background defragmentation not only avoids increasing page
tion. Therefore, for clarity they are named after their prepara-
fault latency, but also succeeds in recovering unfragmented
tion policies. These four variants represent reasonable design
performance. Specifically, on the Redis Cold workload, Sync-
points in the Sync-t and Async-t space, and use the same
1 maintained the highest throughput (1.31 GB/s) while pro-
1GB/s active defragmentation daemon. They share the same
viding low (4.5 ms) tail latency under Frag-100. However, the
library and system tunings with FreeBSD. All performance
per-second background scanning of the evacuation daemon
numbers are the mean of three runs.
may fail to improve performance when applications quickly
touch all of their memory in the beginning (e.g. GUPS and
7.1 Non-fragmented (Frag-0) Performance ANN). As a result, there is high performance variation on
Sync-1 vs. Linux. Sync-1 uses the same superpage prepara- GUPS and ANN performance is not improved over other
tion and mapping policy for anonymous memory as Linux. systems, as shown in Table 8.
With no fragmentation, Tables 8 and 9 show that they perform
Graphchi-PR. On all applications in Table 8, the Sync-t
similarly. However, there are two notable differences. First,
and Async-t systems all match or outperform Linux. Since
Sync-1 speculatively allocates superpages for growing heaps,
Graphchi-PR is an important and real-world task, it is selected
which allows it to outperform Linux on canneal and gcc. Their
to elaborate how the design choices described in Section 5
similar speedups on reservation-based systems validate Ob-
contribute to the 2.18 speedup of Sync-1 under Frag-100.
servation 3. Second, Sync-1 creates file-backed superpages
Under Frag-100, Async-64 obtains a speedup of 1.68,
and outperforms Linux on ANN and Graphchi-PR.
which is higher than the 1.15 speedup obtained by Ingens* on
Promotion Speed. Under Frag-0, FreeBSD often outper- Graphchi-PR. When Graphchi-PR terminated, Ingens* has
forms Ingens, HawkEye and their aggressively tuned variants, a total of 1,926 (mean of 3 runs) free physical superpages,
as shown in Table 8. This validates Observation 2, as the but Async-64 has 11,955 free physical superpages. Although
issue is that out-of-place promotion has a slower promotion they have the same memory bandwidth budget (1GB/s) for
speed. Furthermore, as shown in Table 9, on the Redis Cold active defragmentation, Quicksilver’s evacuation daemon de-
workload, Ingens, HawkEye and their aggressively tuned vari- fragments memory more efficiently by identifying inactive
ants even show a slight degradation compared to Linux-4KB. fragmented superpages. The in-place promotions further con-
These systems introduce some noticeable interference with tribute to the higher speedup of Async-64. When memory is
running applications when they manage superpages in the not fragmented, Async-64 obtains a speedup of 0.83, higher
background. than all other non-Quicksilver systems.
Sync-64 mostly outperforms Async-64, because Async-64 Sync-64 obtains an even higher speedup of 2.11. The shared
zeros pages in the background which can cause interference. evacuation daemon allows both Async-64 and Sync-64 to al-
The comparable performance of Sync-64 and Sync-1 shows locate a similar number of superpages, but the synchronous
that less aggressive preparation and mapping policies can all-at-once preparation implemented by bulk zeroing in Sync-
achieve comparable results to immediately mapping super- 64 efficiently removes the delay of creating superpages. With
pages on first touch. the same number of superpages, Sync-64 is able to reduce
page walk pending cycles by 76%. The highest speedup is
7.2 Performance Under Fragmentation obtained by Sync-1 with a more aggressive promotion thresh-
old.
Table 9 shows that Linux obtains a much higher tail latency
on the Redis Cold workload under Frag-50/100 than Linux-
7.3 Memory Bloat
4KB, because its on-allocation defragmentation significantly
increases page fault latency. In contrast, FreeBSD does not All systems suffer less than 1% memory bloat compared to
actively defragment memory, so it generates no latency spikes. Linux-4KB on the applications shown in Table 8. However,
Ingens and HawkEye offload superpage allocation from long-running servers may still suffer from memory bloat.
page faults and compact memory in the background, so they When applications frequently allocate and deallocate memory,
reduce interference and generate few latency spikes on the an aggressive superpage preparation policy may preemptively
Redis Cold workload. Furthermore, as shown in Table 8, prepare a superpage and sacrifice free memory for minor ad-
their speedup over Linux increases as fragmentation increases. dress translation benefits, ultimately creating false memory
However, HawkEye does not achieve the same speedups on pressure.

838 2020 USENIX Annual Technical Conference USENIX Association


Frag-0 GUPS Graphchi-PR BlockSVM XSBench ANN canneal freqmine gcc mcf DSjeng XZ
Ingens 0.87 0.58 0.81 0.98 1.00 0.95 0.99 1.00 0.99 0.99 0.96
Ingens* 0.84 0.58 0.79 0.97 0.97 0.92 0.99 1.01 0.96 0.99 0.92
HawkEye 0.28 0.53 0.73 0.88 1.00 0.95 0.99 0.99 0.94 0.86 0.90
HawkEye* 0.88 0.60 0.81 0.98 1.00 0.97 1.00 0.99 0.97 0.99 0.94
FreeBSD 0.96 0.77 0.96 0.99 0.98 1.14 1.00 1.05 0.99 1.00 0.99
Sync-1 0.99 1.07 1.00 1.00 1.07 1.14 0.99 1.05 1.00 1.00 1.00
Sync-64 0.98 1.05 1.00 1.00 1.08 1.14 0.99 1.05 1.00 1.00 1.00
Async-64 0.96 0.83 0.97 0.99 1.08 1.14 1.00 1.05 1.00 1.00 0.99
Async-256 0.96 0.82 0.97 0.99 1.08 1.14 0.99 1.05 0.99 1.00 0.99
Frag-50 GUPS Graphchi-PR BlockSVM XSBench ANN canneal freqmine gcc mcf DSjeng XZ
Ingens 0.98 0.71 0.82 1.01 1.00 0.99 1.00 1.00 1.00 1.00 0.99
Ingens* 1.24 0.73 0.86 1.00 0.98 1.00 0.99 1.02 0.99 1.04 0.97
HawkEye 0.62 0.68 0.77 0.91 1.00 0.96 1.00 0.99 0.97 0.92 0.94
HawkEye* 0.89 0.68 0.80 1.00 0.99 0.99 1.00 0.99 0.99 0.98 0.99
FreeBSD 0.98 0.94 0.89 1.02 0.97 1.01 1.00 1.05 1.01 1.02 1.01
Sync-1 2.04(0.08) 1.37 1.04 1.03 1.04 1.17 1.00 1.05 1.03 1.05 1.05
Sync-64 2.01 1.32 1.06 1.03 1.04 1.18 1.00 1.05 1.03 1.06 1.05
Async-64 2.11 1.06 1.02 1.03 1.03 1.17 1.00 1.05 1.03 1.06 1.04
Async-256 2.11 1.05 1.02 1.03 1.03 1.17 1.00 1.05 1.03 1.06 1.04
Frag-100 GUPS Graphchi-PR BlockSVM XSBench ANN canneal freqmine gcc mcf DSjeng XZ
Ingens 1.02 1.13 0.86 1.04 1.00 1.00 1.00 1.00 1.01 1.01 1.02
Ingens* 1.30 1.15 0.88 1.13 0.99 1.06 1.00 1.02 1.03 1.08 1.06
HawkEye 0.97 1.11 0.85 1.03 1.00 1.01 1.00 1.00 0.99 0.97 1.02
HawkEye* 0.96 1.11 0.84 1.03 1.00 1.01 1.00 0.99 0.99 0.97 1.01
FreeBSD 0.96 1.10 0.85 1.04 0.98 1.05 1.00 1.00 1.00 1.04 1.02
Sync-1 2.35(0.30) 2.18 1.12 1.07 1.04 1.12 1.00 1.05 1.02 1.10 1.14
Sync-64 2.29(0.14) 2.11 1.13 1.07 1.01 1.12 1.00 1.05 1.05 1.11 1.14
Async-64 1.91(0.21) 1.68 1.11 1.06 0.98 1.12 1.00 1.05 1.05 1.11 1.13
Async-256 2.10(0.22) 1.65 1.10 1.08 0.98 1.16 1.00 1.06 1.05 1.08 1.13

Table 8: Performance speedup over Linux under three fragmentation levels. Red boxes indicate that the system performs worse
than Linux on that application. The normalized standard deviation of runtime is no greater than 5% unless specified in parentheses.

Table 10 compares the memory consumption of four Redis 200k superpages, while the less aggressive Sync-64 only cre-
workloads. Among these workloads, Linux bloats memory the ates around 100k. Over half of the superpages created by
most, consistent with previous findings [20]. However, Sync- Sync-1 had less than 13% utilization. Consequently, Sync-
1 exhibits lower memory consumption than Linux despite 1 spends 13.9% more system time preparing them, which
similar policies. In fact, it is khugepaged that bloats memory. outweighs their benefits. In a long running server, using an
When a partially deallocated superpage is scanned, it allocates aggressive policy like Sync-1 could waste both power and
the memory back to recreate a superpage, undermining the memory contiguity by creating underutilized superpages. In
application’s efforts to free and defragment memory. contrast, Sync-64 avoids such cases and suffers from less per-
All systems other than Linux limit memory consumption formance degradation than Sync-1 in both Table 8 and Table 9.
for the first three workloads; they only really differ on Range- Therefore, it is more preferable for long-running servers.
XL. HawkEye, FreeBSD, and Async-256 exhibit the lowest
memory consumption on Range-XL, whereas the other sys- 8 Related Work
tems bloat memory by 40–60%. HawkEye stops allocating
superpages when the TLB overhead is minor, FreeBSD only Direct segments have been proposed as a supplement to exist-
promotes fully utilized superpages, and Async-256 has a con- ing page-based address translation for large-memory applica-
servative promotion threshold. tions [9, 14, 18]. While they are effective at reducing the cost
of address translation, they are limited to systems that allocate
Sync-1 vs. Sync-64 Besides bloating memory, aggressive nearly all of the system memory to a single application with
preparation policies may cause excessive creation of super- the same access rights. While these ideas can be generalized
pages. This is common when many small processes get forked. to some degree, they ultimately limit the flexibility of the OS
For example, Table 11 shows what happens in a 9-threaded to allocate and use physical memory.
compilation of the FreeBSD kernel. Sync-1 creates more than Automatic TLB entry coalescing to increase the effective

USENIX Association 2020 USENIX Annual Technical Conference 839


Cold Linux-4KB Linux Ingens Ingens* HawkEye HawkEye* FreeBSD Sync-1 Sync-64 Async-64 Async-256
Frag-0 1.04(5.6) 1.34(4.1) 1.00(5.9) 0.98(6.3) 1.00(5.9) 1.00(5.8) 1.11(6.1) 1.26(4.5) 1.20(4.8) 1.10(6.0) 1.11(6.0)
Frag-50 1.04(5.7) 0.92(10.2) 0.95(5.9) 0.97(5.9) 1.02(5.9) 1.03(5.8) 1.04(6.2) 1.25(4.5) 1.27(4.7) 1.09(6.0) 1.09(6.0)
Frag-100 1.07(5.6) 0.81(9.9) 0.94(6.1) 0.97(6.1) 1.00(5.8) 1.02(5.8) 0.98(6.5) 1.31(4.5) 1.26(4.6) 1.14(5.9) 1.08(5.9)
Warm Linux-4KB Linux Ingens Ingens* HawkEye HawkEye* FreeBSD Sync-1 Sync-64 Async-64 Async-256
Frag-0 1.06(6.5) 1.32(5.2) 1.23(5.7) 1.21(5.8) 1.03(6.7) 1.06(6.5) 1.30(5.6) 1.32(5.5) 1.31(5.5) 1.31(5.5) 1.30(5.6)
Frag-50 1.07(6.5) 1.17(5.9) 1.09(6.4) 1.19(5.8) 1.03(6.7) 1.05(6.7) 1.18(6.1) 1.32(5.5) 1.32(5.5) 1.31(5.5) 1.31(5.5)
Frag-100 1.07(6.5) 1.16(5.9) 1.01(6.9) 1.09(6.4) 1.05(6.6) 1.07(6.5) 1.10(6.6) 1.33(5.4) 1.34(5.5) 1.33(5.4) 1.31(5.5)

Table 9: Redis throughput (GB/s) and 95th latency (ms) of workloads Cold and Warm. Numbers in parentheses are 95th latencies.
The maximum standard deviation is 0.04GB/s for throughput and 0.57ms for 95th latency.

Workload Linux-4KB Linux Ingens Ingens* HawkEye HawkEye* FreeBSD Sync-1 Sync-64 Async-64 Async-256
Del-70 11.6 19.8 11.6 11.7 11.6 11.6 11.6 11.6 11.6 11.6 11.6
Del-50 16.7 19.8 16.8 16.8 16.7 16.9 16.7 16.8 16.8 16.8 16.8
Range-S 14.3 15.6 16.0 15.6 14.9 14.5 14.3 15.6 15.6 15.3 15.1
Range-XL 14.4 30.7 22.7 23.3 15.7 20.6 14.9 23.1 20.9 19.5 15.9

Table 10: Redis memory consumption (GB) of four workloads. Khugepaged further bloats memory in Linux.
Buildkernel real user sys # SP # PF decreasing memory fragmentation and more carefully allo-
Sync-1 197.7 1409.4 89.4 200.5 K 5.3 M cating physical superpages using Linux’s idle page tracking
Sync-64 196.9 1408.8 78.5 99.6 K 10.3 M mechanisms [20, 22, 24, 25, 31]. Others have shown that it is
FreeBSD 203.7 1436.7 98.0 36.9 K 30.2 M beneficial to decrease memory fragmentation and increase
the contiguity of physical memory. To achieve this, several ef-
Table 11: Runtime (seconds) and numbers of superpages and
forts have focused on minimizing migration and reducing its
page faults of compiling the FreeBSD 11.2 kernel.
performance impact, while still attempting to reduce fragmen-
reach of the TLB has been proposed and implemented [26,27]. tation and increase contiguity [7,8,22,25,31]. The deprecated
Essentially, a page walk will load multiple 4KB mappings lumpy reclaim from Linux was also developed to increase
found in the same cache line. If these mappings refer to con- contiguity [2]. It reclaims a 2MB superpage by finding an
tiguous pages and have identical access privileges, then they inactive 4KB page and swaps out all dirty 4KB pages inside
are merged into a single TLB entry. Although TLB entry the 2MB block. Because these dirty 4KB pages may also con-
coalescing occurs automatically in hardware, it nonetheless tain active ones, swapping them out may hurt performance
requires the OS to allocate physically contiguous memory. instead. Besides efforts on anonymous superpages, Zhou, et
AMD Ryzen processors do such coalescing [12]. al. augmented FreeBSD to synchronously page-in code and
pad code sections to create more code superpages [33].
A large body of work has shown that using superpages can
reduce the cost of address translation. Originally, OS support
for superpages required the administrator to manually control
9 Conclusions
the use of superpages. For example, Linux has long supported This paper has performed a comprehensive analysis of su-
persistent huge pages [4]. A huge page pool with a static perpage management mechanisms and policies. The explicit
number of huge pages must be allocated by the administrator enumeration of the five events involved in the life of a su-
before running applications. The persistent huge pages are perpage provides a framework around which to compare and
pinned in memory and can only be used via specific flags to contrast superpage management policies. This framework
mmap system calls. Superpage support in Windows and OS X and analysis yielded five key observations about superpage
are similar to Linux persistent huge pages [3, 6]. management that motivated Quicksilver’s innovative design.
To eliminate the need for manual control, FreeBSD, Linux, Quicksilver achieves the benefits of aggressive superpage
and many research prototypes have explored transparent su- allocation, while mitigating the memory bloat and fragmen-
perpage support, as described in Section 3. This support has tation issues that arise from underutilized superpages. Both
been extensively described and studied [16, 17, 20, 23, 24, 29]. the Sync-1 and Sync-64 variants of Quicksilver are able to
As this transparent support for superpages has become widely match or beat the performance of existing systems in both
available in production OSes, many people have argued that lightly and heavily fragmented scenarios, in terms of applica-
effectively handling all of the issues that can arise still re- tion performance, tail latency, and memory bloat. However,
quires further improvements to OS memory management sup- Sync-64 is preferable for long-running servers, as it does not
port [15–17,20,22,24,25]. For example, some of these people aggressively create underutilized superpages.
have worked to improve Linux’s superpage management by

840 2020 USENIX Annual Technical Conference USENIX Association


References [12] Mike Clark. A new x86 core architecture for the next
generation of computing. In Hot Chips 28 Symposium
[1] FreeBSD MADV_FREE heuristics. https: (HCS), 2016 IEEE, pages 1–19. IEEE, 2016.
//svnweb.freebsd.org/base?view=
revision&revision=350463. Viewed 2020-05- [13] II Earl Joseph. Gups (giga-updates per second) bench-
31. mark. URL http://www. dgate. org/˜ brg/files/dis/gups,
2000.
[2] Linux’s lumpy reclaim. https://lkml.org/lkml/
2012/3/28/323. Viewed 2020-05-31. [14] Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and
Michael M. Swift. Efficient memory virtualization: Re-
[3] OS X superpage support. https://www.unix.com/
ducing dimensionality of nested page walks. In 47th An-
man-page/osx/2/mmap/. Viewed 2020-05-31.
nual IEEE/ACM International Symposium on Microar-
[4] Persistent huge pages in Linux. https: chitecture, MICRO 2014, Cambridge, United Kingdom,
//www.kernel.org/doc/Documentation/vm/ December 13-17, 2014, pages 178–189, 2014.
hugetlbpage.txt. Viewed 2020-05-31.
[15] Fabien Gaud, Baptiste Lepers, Jeremie Decouchant,
[5] Transparent huge pages in Linux. https: Justin Funston, Alexandra Fedorova, and Vivien Quéma.
//www.kernel.org/doc/Documentation/vm/ Large pages may be harmful on NUMA systems. In
transhuge.txt. Viewed 2020-05-31. 2014 USENIX Annual Technical Conference (USENIX
ATC 14), pages 231–242, 2014.
[6] Windows large page support. https:
//docs.microsoft.com/en-us/windows/desktop/ [16] Mel Gorman and Patrick Healy. Supporting superpage
memory/large-page-support. Viewed 2020-05-31. allocation without additional hardware support. In Pro-
ceedings of the 7th international symposium on Memory
[7] Neha Agarwal and Thomas F Wenisch. Thermo- management, pages 41–50. ACM, 2008.
stat: Application-transparent page management for two-
tiered main memory. In ACM SIGARCH Computer [17] Mel Gorman and Patrick Healy. Performance charac-
Architecture News, volume 45, pages 631–644. ACM, teristics of explicit superpage support. In International
2017. Symposium on Computer Architecture, pages 293–310.
Springer, 2010.
[8] Rachata Ausavarungnirun, Joshua Landgraf, Vance
Miller, Saugata Ghose, Jayneel Gandhi, Christopher J [18] Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar,
Rossbach, and Onur Mutlu. Mosaic: Enabling Adrián Cristal, Mark D. Hill, Kathryn S. McKinley,
application-transparent support for multiple page sizes Mario Nemirovsky, Michael M. Swift, and Osman S.
in throughput processors. ACM SIGOPS Operating Unsal. Redundant memory mappings for fast access to
Systems Review, 51(1):27–44, 2018. large memories. In Proceedings of the 42nd Annual In-
ternational Symposium on Computer Architecture, Port-
[9] Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, land, OR, USA, June 13-17, 2015, pages 66–78, 2015.
Mark D. Hill, and Michael M. Swift. Efficient vir-
tual memory for big memory servers. In The 40th An- [19] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue
nual International Symposium on Computer Architec- Moon. What is twitter, a social network or a news me-
ture, ISCA’13, Tel-Aviv, Israel, June 23-27, 2013, pages dia? In Proceedings of the 19th international conference
237–248, 2013. on World wide web, pages 591–600. AcM, 2010.

[10] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, [20] Youngjin Kwon, Hangchen Yu, Simon Peter, Christo-
and Kai Li. The PARSEC benchmark suite: characteri- pher J. Rossbach, and Emmett Witchel. Coordinated
zation and architectural implications. In Proceedings of and efficient huge page management with ingens. In
the 17th international conference on Parallel architec- 12th USENIX Symposium on Operating Systems Design
tures and compilation techniques, pages 72–81. ACM, and Implementation, OSDI 2016, Savannah, GA, USA,
2008. November 2-4, 2016., pages 705–721, 2016.

[11] James Bucek, Klaus-Dieter Lange, et al. SPEC [21] Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin.
CPU2017: next-generation compute benchmark. In Graphchi: Large-scale graph computation on just a PC.
Companion of the 2018 ACM/SPEC International Con- In 10th USENIX Symposium on Operating Systems De-
ference on Performance Engineering, pages 41–42. sign and Implementation, OSDI 2012, Hollywood, CA,
ACM, 2018. USA, October 8-10, 2012, pages 31–46, 2012.

USENIX Association 2020 USENIX Annual Technical Conference 841


[22] Theodore Michailidis, Alex Delis, and Mema Rous- [29] Madhusudhan Talluri and Mark D. Hill. Surpassing
sopoulos. Mega: overcoming traditional problems with the TLB performance of superpages with less operating
os huge page management. In Proceedings of the 12th system support. In ASPLOS-VI Proceedings - Sixth
ACM International Conference on Systems and Storage, International Conference on Architectural Support for
pages 121–131. ACM, 2019. Programming Languages and Operating Systems, San
Jose, California, USA, October 4-7, 1994., pages 171–
[23] Juan Navarro, Sitaram Iyer, Peter Druschel, and Alan L. 182, 1994.
Cox. Practical, transparent operating system support
for superpages. In 5th Symposium on Operating Sys-
tem Design and Implementation (OSDI 2002), Boston, [30] John R Tramm, Andrew R Siegel, Tanzima Islam, and
Massachusetts, USA, December 9-11, 2002, 2002. Martin Schulz. Xsbench-the development and verifica-
tion of a performance abstraction for monte carlo reactor
[24] Ashish Panwar, Sorav Bansal, and K Gopinath. Hawk-
analysis.
eye: Efficient fine-grained OS support for huge pages.
In Proceedings of the Twenty-Fourth International Con-
ference on Architectural Support for Programming Lan- [31] Zi Yan, Daniel Lustig, David Nellans, and Abhishek
guages and Operating Systems, pages 347–360. ACM, Bhattacharjee. Translation ranger: operating system
2019. support for contiguity-aware tlbs. In Proceedings of the
[25] Ashish Panwar, Aravinda Prasad, and K Gopinath. Mak- 46th International Symposium on Computer Architec-
ing huge pages actually useful. In ACM SIGPLAN No- ture, ISCA 2019, Phoenix, AZ, USA, June 22-26, 2019,
tices, volume 53, pages 679–692. ACM, 2018. pages 698–710, 2019.

[26] Binh Pham, Abhishek Bhattacharjee, Yasuko Eckert, and


Gabriel H. Loh. Increasing TLB reach by exploiting [32] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-
clustering in page translations. In 20th IEEE Inter- Jen Lin. Large linear classification when data cannot
national Symposium on High Performance Computer fit in memory. In Twenty-Second International Joint
Architecture, HPCA 2014, Orlando, FL, USA, February Conference on Artificial Intelligence, 2011.
15-19, 2014, pages 558–567, 2014.
[27] Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, [33] Yufeng Zhou, Xiaowan Dong, Alan L Cox, and Sandhya
and Abhishek Bhattacharjee. Colt: Coalesced large- Dwarkadas. On the impact of instruction address trans-
reach tlbs. In 45th Annual IEEE/ACM International lation overhead. In 2019 IEEE International Symposium
Symposium on Microarchitecture, MICRO 2012, Van- on Performance Analysis of Systems and Software (IS-
couver, BC, Canada, December 1-5, 2012, pages 258– PASS), pages 106–116. IEEE, 2019.
269, 2012.

[28] J Stamper, A Niculescu-Mizil, S Ritter, GJ Gordon, and [34] Weixi Zhu. Exploring superpage promotion policies
KR Koedinger. Bridge to algebra 2008–2009. Challenge for efficient address translation. Master’s thesis, Rice
data set from KDD Cup, 2010. University, 6100 Main St, Houston, TX 77005, 2019.

842 2020 USENIX Annual Technical Conference USENIX Association

You might also like