A Comprehensive Analysis of Superpage Management Mechanisms and Policies
A Comprehensive Analysis of Superpage Management Mechanisms and Policies
he
l lb
cac
?
in
Di k- ead
Page Fa l
F
A f ee ph ical Page Ze o
A nc
SP
... ... ...
Mig a ion
F ll p epa ed F ll p epa ed F ll p epa ed Pa iall p epa ed
*Allocation ma fail because of *An incrementall prepared SP can *TLB coverage increases when *4KB mappings can be created for A: expect a virtual SP in the future
memor fragmentation be mapped as 4KB pages caching created SP mappings some or all constituent 4KB pages B: use individuall for other purposes
tion benefits. Before the superpage is mapped, the physical entire 2MB can be returned to the physical memory allocator
memory can still be accessed via 4KB mappings; afterwards, or the physical superpage can be “broken” into 4KB pages.
the OS loses the ability to track accesses and modifications at If the physical superpage is broken into its constituent 4KB
a 4KB granularity. Therefore, an OS may delay the creation pages, the OS can return a subset of those pages to the physi-
of a superpage mapping if only some of the constituent pages cal memory allocator. However, returning only a subset of the
are dirty in order to avoid unnecessary future I/O. constituent pages increases memory fragmentation, decreas-
A superpage mapping is typically created upon a page fault, ing the likelihood of future physical superpage allocations.
on either the initial fault to the memory region or a subsequent Before part or all of a physical superpage is returned to the
fault after the entire superpage has been prepared. However, if physical memory allocator, any constituent pages that have
the physical superpage preparation is asynchronous, then its been prepared but not freed must be preserved. Preservation
superpage mapping may also be created asynchronously. Note typically happens in one of three ways. In-use pages can be
that on some architectures, e.g., ARM, any 4KB mappings kept rather than returned to the allocator, and 4KB mappings
that were previously created must first be destroyed. can be created to those pages. Alternatively, the in-use pages
can be copied to other physical pages, allowing the entire
2.4 Superpage Mapping Destruction physical superpage to be returned. The last option is to write
the in-use pages to secondary storage before returning them.
Superpage mappings can be destroyed at any time, but must
be destroyed whenever any part of the virtual superpage is
3 State-of-the-art Designs
freed or has its protection changed. After the superpage map-
ping is destroyed, 4KB mappings must be recreated for any This section compares the state-of-the-art designs for trans-
constituent pages that have not been freed. parent superpage management in FreeBSD, Linux, and recent
With superpage mappings, the OS cannot track whether research prototypes (Ingens [20] and HawkEye [24]), with a
constituent pages are accessed or modified. Therefore, in particular focus on how they manage the events described in
some scenarios, the OS may choose to preemptively destroy the previous section.
a superpage mapping and substitute 512 4KB mappings for it
to enable finer-grained memory management. For example, 3.1 FreeBSD
when a clean superpage is first modified, the OS could choose
to destroy the superpage mapping in order to only mark the FreeBSD supports transparent superpages for all kinds of
single modified 4KB page as dirty, potentially reducing fu- memory, including memory-mapped files and executables. It
ture I/O operations. This would require the OS to make a decouples physical superpage allocation from preparation by
read-only superpage mapping and use the page fault caused using a reservation-based memory allocator [23,29]. FreeBSD
by the write access to destroy the mapping and replace it with tries to allocate (“reserves”) a physical superpage upon the
4KB mappings. Similarly, the OS could choose to destroy a first page fault to any aligned 2MB region. If physical su-
superpage mapping when under memory pressure to enable perpages are available, they are allocated for any memory-
swapping pages at a finer granularity. mapped file exceeding 2MB in size. Anonymous memory
always uses superpages if available, regardless of size, as
anonymous memory is expected to grow.
2.5 Physical Superpage Deallocation
Once a physical superpage is allocated for anonymous
Generally, a physical superpage is deallocated when an ap- memory, only the 4KB page that caused the page fault is
plication frees some or all of the virtual superpage, when an prepared, and a reservation entry is created to track all of the
application terminates, or when the OS needs to reclaim mem- constituent pages. Any subsequent page fault to that 2MB re-
ory. If a superpage mapping exists, it must be destroyed before gion skips page allocation and simply prepares one additional
the physical superpage can be deallocated. Then, either the 4KB page of the physical superpage. The physical superpage
Table 8: Performance speedup over Linux under three fragmentation levels. Red boxes indicate that the system performs worse
than Linux on that application. The normalized standard deviation of runtime is no greater than 5% unless specified in parentheses.
Table 10 compares the memory consumption of four Redis 200k superpages, while the less aggressive Sync-64 only cre-
workloads. Among these workloads, Linux bloats memory the ates around 100k. Over half of the superpages created by
most, consistent with previous findings [20]. However, Sync- Sync-1 had less than 13% utilization. Consequently, Sync-
1 exhibits lower memory consumption than Linux despite 1 spends 13.9% more system time preparing them, which
similar policies. In fact, it is khugepaged that bloats memory. outweighs their benefits. In a long running server, using an
When a partially deallocated superpage is scanned, it allocates aggressive policy like Sync-1 could waste both power and
the memory back to recreate a superpage, undermining the memory contiguity by creating underutilized superpages. In
application’s efforts to free and defragment memory. contrast, Sync-64 avoids such cases and suffers from less per-
All systems other than Linux limit memory consumption formance degradation than Sync-1 in both Table 8 and Table 9.
for the first three workloads; they only really differ on Range- Therefore, it is more preferable for long-running servers.
XL. HawkEye, FreeBSD, and Async-256 exhibit the lowest
memory consumption on Range-XL, whereas the other sys- 8 Related Work
tems bloat memory by 40–60%. HawkEye stops allocating
superpages when the TLB overhead is minor, FreeBSD only Direct segments have been proposed as a supplement to exist-
promotes fully utilized superpages, and Async-256 has a con- ing page-based address translation for large-memory applica-
servative promotion threshold. tions [9, 14, 18]. While they are effective at reducing the cost
of address translation, they are limited to systems that allocate
Sync-1 vs. Sync-64 Besides bloating memory, aggressive nearly all of the system memory to a single application with
preparation policies may cause excessive creation of super- the same access rights. While these ideas can be generalized
pages. This is common when many small processes get forked. to some degree, they ultimately limit the flexibility of the OS
For example, Table 11 shows what happens in a 9-threaded to allocate and use physical memory.
compilation of the FreeBSD kernel. Sync-1 creates more than Automatic TLB entry coalescing to increase the effective
Table 9: Redis throughput (GB/s) and 95th latency (ms) of workloads Cold and Warm. Numbers in parentheses are 95th latencies.
The maximum standard deviation is 0.04GB/s for throughput and 0.57ms for 95th latency.
Workload Linux-4KB Linux Ingens Ingens* HawkEye HawkEye* FreeBSD Sync-1 Sync-64 Async-64 Async-256
Del-70 11.6 19.8 11.6 11.7 11.6 11.6 11.6 11.6 11.6 11.6 11.6
Del-50 16.7 19.8 16.8 16.8 16.7 16.9 16.7 16.8 16.8 16.8 16.8
Range-S 14.3 15.6 16.0 15.6 14.9 14.5 14.3 15.6 15.6 15.3 15.1
Range-XL 14.4 30.7 22.7 23.3 15.7 20.6 14.9 23.1 20.9 19.5 15.9
Table 10: Redis memory consumption (GB) of four workloads. Khugepaged further bloats memory in Linux.
Buildkernel real user sys # SP # PF decreasing memory fragmentation and more carefully allo-
Sync-1 197.7 1409.4 89.4 200.5 K 5.3 M cating physical superpages using Linux’s idle page tracking
Sync-64 196.9 1408.8 78.5 99.6 K 10.3 M mechanisms [20, 22, 24, 25, 31]. Others have shown that it is
FreeBSD 203.7 1436.7 98.0 36.9 K 30.2 M beneficial to decrease memory fragmentation and increase
the contiguity of physical memory. To achieve this, several ef-
Table 11: Runtime (seconds) and numbers of superpages and
forts have focused on minimizing migration and reducing its
page faults of compiling the FreeBSD 11.2 kernel.
performance impact, while still attempting to reduce fragmen-
reach of the TLB has been proposed and implemented [26,27]. tation and increase contiguity [7,8,22,25,31]. The deprecated
Essentially, a page walk will load multiple 4KB mappings lumpy reclaim from Linux was also developed to increase
found in the same cache line. If these mappings refer to con- contiguity [2]. It reclaims a 2MB superpage by finding an
tiguous pages and have identical access privileges, then they inactive 4KB page and swaps out all dirty 4KB pages inside
are merged into a single TLB entry. Although TLB entry the 2MB block. Because these dirty 4KB pages may also con-
coalescing occurs automatically in hardware, it nonetheless tain active ones, swapping them out may hurt performance
requires the OS to allocate physically contiguous memory. instead. Besides efforts on anonymous superpages, Zhou, et
AMD Ryzen processors do such coalescing [12]. al. augmented FreeBSD to synchronously page-in code and
pad code sections to create more code superpages [33].
A large body of work has shown that using superpages can
reduce the cost of address translation. Originally, OS support
for superpages required the administrator to manually control
9 Conclusions
the use of superpages. For example, Linux has long supported This paper has performed a comprehensive analysis of su-
persistent huge pages [4]. A huge page pool with a static perpage management mechanisms and policies. The explicit
number of huge pages must be allocated by the administrator enumeration of the five events involved in the life of a su-
before running applications. The persistent huge pages are perpage provides a framework around which to compare and
pinned in memory and can only be used via specific flags to contrast superpage management policies. This framework
mmap system calls. Superpage support in Windows and OS X and analysis yielded five key observations about superpage
are similar to Linux persistent huge pages [3, 6]. management that motivated Quicksilver’s innovative design.
To eliminate the need for manual control, FreeBSD, Linux, Quicksilver achieves the benefits of aggressive superpage
and many research prototypes have explored transparent su- allocation, while mitigating the memory bloat and fragmen-
perpage support, as described in Section 3. This support has tation issues that arise from underutilized superpages. Both
been extensively described and studied [16, 17, 20, 23, 24, 29]. the Sync-1 and Sync-64 variants of Quicksilver are able to
As this transparent support for superpages has become widely match or beat the performance of existing systems in both
available in production OSes, many people have argued that lightly and heavily fragmented scenarios, in terms of applica-
effectively handling all of the issues that can arise still re- tion performance, tail latency, and memory bloat. However,
quires further improvements to OS memory management sup- Sync-64 is preferable for long-running servers, as it does not
port [15–17,20,22,24,25]. For example, some of these people aggressively create underutilized superpages.
have worked to improve Linux’s superpage management by
[10] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, [20] Youngjin Kwon, Hangchen Yu, Simon Peter, Christo-
and Kai Li. The PARSEC benchmark suite: characteri- pher J. Rossbach, and Emmett Witchel. Coordinated
zation and architectural implications. In Proceedings of and efficient huge page management with ingens. In
the 17th international conference on Parallel architec- 12th USENIX Symposium on Operating Systems Design
tures and compilation techniques, pages 72–81. ACM, and Implementation, OSDI 2016, Savannah, GA, USA,
2008. November 2-4, 2016., pages 705–721, 2016.
[11] James Bucek, Klaus-Dieter Lange, et al. SPEC [21] Aapo Kyrola, Guy E. Blelloch, and Carlos Guestrin.
CPU2017: next-generation compute benchmark. In Graphchi: Large-scale graph computation on just a PC.
Companion of the 2018 ACM/SPEC International Con- In 10th USENIX Symposium on Operating Systems De-
ference on Performance Engineering, pages 41–42. sign and Implementation, OSDI 2012, Hollywood, CA,
ACM, 2018. USA, October 8-10, 2012, pages 31–46, 2012.
[28] J Stamper, A Niculescu-Mizil, S Ritter, GJ Gordon, and [34] Weixi Zhu. Exploring superpage promotion policies
KR Koedinger. Bridge to algebra 2008–2009. Challenge for efficient address translation. Master’s thesis, Rice
data set from KDD Cup, 2010. University, 6100 Main St, Houston, TX 77005, 2019.