An Analysis of Performance Evolution of Linux's Core Operations
An Analysis of Performance Evolution of Linux's Core Operations
small-read 28 29 27 25 24 23 18 21 20 -3 146 -4 153 151 132 132 3 -1 -1 -1 0 -6 -3 -1 -3 0 6 0 6 11 12 11 11 10 77 98 109 103 103 103 99
med-read 23 24 21 25 22 23 21 21 20 -2 15 -2 16 15 13 16 1 0 1 0 0 0 0 2 1 3 3 0 9 8 9 10 9 4 11 16 18 15 15 16 16
big-read 94 94 96 93 83 84 79 79 79 54 55 57 58 1 -2 3 -2 -1 1 0 0 2 1 -1 -1 1 2 50 0 0 2 0 72 72 71 75 79 77 78 76 74
small-write 0 1 4 11 2 9 13 12 15 -1 56 -1 61 59 50 48 0 0 -2 -1 0 9 11 -2 -3 -2 4 1 1 -2 -2 -2 -4 -2 23 46 56 56 51 51 50
med-write 6 6 8 9 5 6 6 7 7 -4 4 -2 8 7 4 5 0 -2 -1 -1 0 0 -2 -3 -4 -3 -2 -1 -1 -5 -6 -7 -8 -9 -6 0 2 2 1 2 2
big-write 2 2 3 3 2 1 3 4 3 -2 -1 0 0 -1 -1 -1 -1 -1 -1 -1 0 1 0 -1 -1 1 0 2 3 -1 0 0 -1 -3 -3 0 0 -1 -2 0 -1
mmap * 10 9 13 13 11 16 15 8 19 -3 123 -2 130 126 114 107 2 2 0 1 0 6 4 5 5 3 16 7 9 11 11 18 14 18 73 147 142 120 120 119 116 150%
small-munmap 48 48 53 48 45 43 28 29 32 6 63 4 71 69 55 60 3 7 2 4 0 -1 -3 -2 -4 1 5 3 1 4 5 7 9 13 67 94 81 82 81 77 78 125%
med-munmap 29 31 36 27 23 19 19 17 18 -4 10 -3 15 13 31 32 14 4 1 2 0 -4 -4 -3 -4 -1 0 -1 1 3 3 5 4 6 63 73 70 69 68 68 68 100%
big-munmap 74 74 68 66 70 65 62 61 79 46 46 57 56 52 51 44 35 7 2 0 0 1 -5 -3 -6 -1 -2 -3 0 4 0 0 -1 -1 0 -7 -12 -17 -15 -16 -14
75%
fork 20 20 21 21 16 16 16 17 18 -5 -1 -4 5 3 2 3 0 -1 1 0 0 -2 -1 -3 -1 1 2 -2 0 2 -1 1 0 -1 4 14 13 12 13 0 2
50%
big-fork 21 16 16 19 16 23 19 11 26 -5 -4 -4 -2 -5 3 1 3 6 2 6 0 0 0 1 2 0 1 33 25 26 26 36 37 39 29 50 49 43 44 48 42
25%
thrcreate 72 69 65 52 46 47 51 26 28 6 38 11 27 33 38 64 -3 1 0 3 0 -1 -5 -2 9 -6 13 -2 -5 5 17 6 3 7 31 46 49 102 94 85 83
0%
send & recv * 34 35 23 24 21 20 26 26 24 0 136 -2 148 144 122 117 -1 0 -1 1 0 -6 -4 1 -1 -1 5 3 10 9 9 9 3 6 75 90 96 104 104 107 100
small-pagefault 17 19 19 20 17 18 17 27 19 2 37 -1 43 41 31 31 -1 1 0 3 0 -3 -3 1 -3 -1 5 -2 3 2 1 3 2 4 36 54 51 45 45 46 46
big-pagefault -48 -47 -46 -46 -52 -52 -53 -50 -50 -59 -48 -60 -46 -47 -49 15 1 3 3 2 0 -3 -4 -4 -5 3 1 -4 2 -9 -13 -13 -14 -12 0 9 9 8 10 10 6
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
(b) Enabled Changes
Spectre patch
Meltdown patch
Harden usercopy
led bled
Rand. SLAB freelist
User pagefault handling
a
En
Fault around
TLB layout spec.
ab
Force context tracking
Dis
Hugepages disabled
Missing CPU idle states
cgroup mem. controller
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
Linux Kernel Versions
Figure 1. Main result. (a) shows the latency trend for each test across all kernels, relative to the 4.0 kernel. (We use the 4.0
kernel as a baseline to better highlight performance degradations in later kernels.) (b) shows the timeline of each performance
affecting change. Each value in (a) indicates the percentage change in latency of a test relative to the same test on the 4.0
kernel. Therefore, positive and negative values indicate worse and better performance, respectively. *: for brevity, we show the
averaged trend of related tests with extremely similar trends, including the average of all mmap tests, the send and recv test,
and the big-send and big-recv test.
(67%) slow down by at least 50% and some by 100% over steady creep of slowdown in core operations, and disruptive
the last seven years (e.g., mmap, poll & select, send & recv). slowdowns that persist over many versions (e.g., a more than
Performance has also fluctuated significantly over the years. 100% slowdown that persists across six versions). Such sig-
Drilling down on these performance fluctuations, we ob- nificant impacts are introduced by security enhancements
serve that a total of 11 root causes are responsible for the and features, which often demand complex and intrusive
major slowdowns. These root causes fall into three categories. modifications to central subsystems of the kernel, such as
First, we observe a growing number of (1) security enhance- memory management. The last category of root causes is
ments and (2) new features, like support for containers and (3) configuration changes, some of which are simple miscon-
virtualization, being added to the kernel. The effect of this figurations that resulted in severe slowdowns across kernel
trend on kernel performance manifests itself in two ways: a operations, impacting many users.
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
While many forms of slowdowns result from fundamental Application Workload % System Time
trade-offs between performance and functionality or secu- Apache Spark v2.2.1 spark-bench’s minimal example 3%
rity, we find a good number could have been avoided or Redis v4.0.8 redis-benchmark
significantly alleviated with more proactive software engi- with 100K requests 41%
neering practices. For example, frequent lightweight test- PostgreSQL v9.5 pgbench with scale factor 100 17%
ing can easily catch the simple misconfigurations that re- Chromium browser Watching a video and reading
v59.0.3071.109 a news article 29%
sulted in widespread slowdowns. The performance of certain
Build toolchain Compiling the 7%
kernel functions would also benefit from more eager opti-
(make 4.1, gcc 5.3) 4.15.10 Linux kernel
mizations and thorough testing: we found some features
Table 1. Applications and respective workloads used to choose
significantly degraded the performance of core kernel opera-
core kernel operations, and each workload’s approximate execution
tions in the initial release; only long after having been intro- time spent in the kernel.
duced were they performance-optimized or disabled due to
performance complaints. Furthermore, a few other changes
that introduced performance slowdowns simply remained
unoptimized—we patched two of the security enhancements
to eliminate most of their performance overhead without re- the impact of the 11 identified root causes on three real-
ducing security guarantees. At the same time, we recognize world applications and show that they can cause slowdowns
the difficulty of testing and maintaining a generic OS kernel as high as 56%, 33%, and 34% on the Redis key-value store,
like Linux, which must support a diverse array of hardware Apache HTTP server, and Nginx web server, respectively.
and workloads [27], and evolves extremely quickly [52]. On The rest of the paper is organized as follows. §2 describes
the other hand, the benefit of being a generic OS kernel is LEBench and the methodology we used to drive our analy-
that Linux is highly configurable—8 out of the 11 root causes sis. We summarize our main findings in §3 before zooming
can be easily disabled by reconfiguring the kernel. This cre- into each change that caused significant performance fluc-
ates the potential for Linux users to actively configure their tuations in §4. §5 discusses the performance implications of
kernels and significantly improve the performance of their core kernel operations on three real-world applications. §6
custom workloads. validates LEBench’s results on a different hardware setup.
Out of the many performance-critical parts of the kernel, We discuss the challenges of Linux performance tuning in
we chose to study core kernel operations since the signif- §7, and we survey related work in §9 before concluding.
icance of their performance is likely elevating; recent ad-
vances in fast non-volatile memory and network devices 2 Methodology
together with the flattened curve of microprocessor speed Our experiments focus on system calls, thread creation, page
scaling may shift the bottleneck to core kernel operations. faults, and context switching. To determine which system
We also chose to focus on how the kernel’s software design calls are frequently exercised, we use our best efforts to select
and implementation impact performance. Prior studies on a set of representative application workloads. Table 1 lists
OS performance mostly focused on comparing the implica- the applications and the workloads we ran. We include work-
tions of different architectures [5, 12, 56, 64]. Those studies loads from three popular server-side applications: Spark, a
occurred during a time of diverse and fast-changing CPUs, distributed computing framework, Redis, a key-value store,
but such CPU architectural heterogeneity has largely dis- and PostgreSQL, a relational database. In addition, we include
appeared in today’s server market. Therefore, we focus on an interactive user workload—web browsing through the
software changes to core OS operations introduced over Chromium browser—and a software development workload—
time, making this the first work to systematically perform a building the Linux kernel. The chosen workloads exercise
longitudinal study on the performance of core OS operations. the kernel with varying intensities, as shown in Table 1.
This paper makes the following contributions. The first is We used strace to measure CPU time and the call-frequency
a thorough analysis of the performance evolution of Linux’s of each system call used by the workloads. We then selected
core kernel operations and the root causes for significant those system calls which took up the most time across all
performance trends. We also show that it is possible to workloads. wait-related system calls were excluded as their
mitigate the performance overhead of two of the security sole purpose is to block the process. Table 2 lists each of the
enhancements. Our second contribution is LEBench, a mi- microbenchmarks. Where applicable, we vary the input sizes
crobenchmark that is collected from representative work- to account for a variety of usage patterns.
loads together with a regression testing framework capa- Our process for running each microbenchmark is as fol-
ble of evaluating the performance of an array of Linux ver- lows. Latency is measured by collecting a timestamp immedi-
sions. The benchmark suite and a framework for automati- ately before and after invoking a kernel operation. For system
cally testing multiple kernel versions are available at https: calls, the benchmark bypasses the libc wrapper whenever
//github.com/LinuxPerfStudy/LEBench. Finally, we evaluate possible to expose the true kernel performance. We repeat
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.
each measurement 10,000 times and report the value calcu- Intel i7 processor and analyze the differences between the
lated using the K-best method with K set to 5 and tolerance two sets of results in §6.
set to 5% [59]. To do this, we order all measured values nu- When interpreting results from the microbenchmarks,
merically, and select the lowest from the first series of five we treat a flat latency trend as expected and analyze any
values where no two adjacent values differ by more than increase or decrease that may signify a performance regres-
5%. Selecting lower values filters the interference from back- sion or improvement, respectively. We extract the causes of
ground workloads, and setting K to 5 and tolerance to 5% these performance changes iteratively: for each test, we first
is considered effective in ensuring consistent and accurate identify the root cause of the most significant performance
results across runs [59]. change; we then disable the root cause and repeat the process
We run the microbenchmarks on each major version of to identify the root cause of the next most significant perfor-
Linux released in the past seven years. This includes versions mance change. We repeat this until the difference between
3.0 to 3.19 and versions 4.0 to 4.20. For every major version, the slowest and fastest kernel versions is no more than 10%
we select the latest minor version (the y in v.x.y) released for the target test.
before the next major version. This is to avoid testing changes
that were backported from a subsequent major version. For 3 Overview of Results
example, for major version 3.0, we tested minor version 3.0.7
(released just before the release of 3.1.0) since 3.0.8 may We overview the results of our analysis in this section and
contain some changes that were introduced in 3.1.0. We only make a few key observations before detailing each root cause
tested versions that were released. Linux distributions such in §4.
as Ubuntu [68] or Arch Linux [33] typically configure the Figure 1 displays the latency evolution of each test across
kernel differently from Linux’s default configuration. We use all kernels, relative to the 4.0 kernel. Only isolated tests expe-
Ubuntu’s Linux distribution because, at least for web-servers, rience performance improvements over time; the majority of
Ubuntu is the most widely used Linux distribution [70]. For tests display worsening performance trends and frequently
example, Netflix hosts its services on Ubuntu kernels [4]. suffer prolonged episodes of severe performance degradation.
We carried out the tests on an HP DL160 G9 server with a These episodes result in significant performance fluctuations
2.40GHz Intel Xeon E5-2630 v3 processor, 512KB L1 cache, across multiple core kernel operations. For example, send
2MB L2 cache, and 20MB L3 cache. The server also has 128GB and recv’s performance degraded by 135% from version 3.9
of 1866MHz DDR4 memory and a 960GB SSD for persistent to 3.10, improved by 137% in 3.11, and then degraded again
storage. To understand how different hardware setups affect by 150% in 3.12. We also observe that the benchmark’s over-
the results, we repeated the tests on a Lenovo laptop with an all performance degraded by 55% going from version 4.13 to
4.15. The sudden and significant nature of these performance
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
degradations suggests they are caused by intrusive changes page table for userspace and kernel execution, fundamen-
to the kernel. tally modifying some of the core designs of memory manage-
We identified 11 kernel changes that explain the significant ment. Similarly, SLAB freelist randomization (§4.1.3) alters
performance fluctuations as well as more steady sources of dynamic memory allocation behaviours in the kernel.
overhead. These are categorized and summarized in Table 3, Interestingly, several security features introduce overhead
and their impact on LEBench’s performance is overviewed by attempting to defend against untrusted code in the kernel
in Figure 2. The 11 changes fall into three categories: security itself. For example, the hardened usercopy feature (§4.1.4) is
enhancements (4/11), new features (4/11), and configuration used to defend against bugs in kernel code that might copy
changes (3/11). too much data between userspace and the kernel. However,
Overall, Linux users are paying a hefty performance tax we note that it can be redundant with other kernel code that
for security enhancements. The cost of accommodating se- already carefully validates pointers. Similarly, SLAB freelist
curity enhancements is high because many of them demand randomization (§4.1.3) attempts to protect against buffer
significant changes to the kernel. For example, the mitiga- overflow attacks that exploit buggy kernel code. However,
tion for Meltdown (§4.1.1) requires maintaining a separate the randomization introduces overhead for all uses of the
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.
100
50
0
Spectre patch Hugepages disabled
100
75
50
25
0
cgroup mem. controller Meltdown patch
75
50
25
0
Rand. SLAB freelist Missing CPU idle states
40
30
20
10
0
Harden usercopy
15
10
5
0
15 TLB layout spec. User pagefault handling
10
5
0
contextswitch
med-read
med-write
big-write
thrcreate
mmap *
med-munmap
contextswitch
med-read
small-munmap
big-munmap
fork
big-fork
small-pagefault
big-pagefault
small-read
big-read
med-write
small-write
big-write
mmap *
med-munmap
big-fork
thrcreate
send & recv *
big-send & recv *
select
poll
epoll
big-select
big-poll
big-epoll
small-read
big-read
small-write
small-munmap
big-munmap
fork
small-pagefault
big-pagefault
Figure 2. Impact of the 11 identified root causes on the performance of LEBench tests. For every root cause, we display the
maximum slowdown across all kernels for each test. Note that the Y-axis scales are different for each row of subgraphs: Root
causes with highest possible impacts on LEBench are ordered first.
SLAB freelist, including correct kernel code. This suggests months, and has become a well-known cause of performance
a trust issue that is fundamentally rooted in the monolithic troubles for Linux users [46, 48, 57]; control group memory
kernel design [8]. controller (§4.2.2) remained unoptimized for 6.5 years, and
Similar to the security enhancements, supporting many continues to cause significant performance degradation in
new features demands complex and intricate changes to real workloads [49, 69]. Both cases are clearly captured by
the core kernel logic. For example, the control group mem- LEBench, suggesting that more frequent and thorough test-
ory controller feature (§4.2.2), which supports containeriza- ing, as well as more proactive performance optimizations,
tion, requires tracking every page allocation and dealloca- would have avoided these impacts on users.
tion; in an early unoptimized version, it slowed down the As another example where Linux performance would ben-
big-pagefault and big-munmap tests by as much as 26% efit from more proactive optimization, we were able to easily
and 81% respectively. optimize two other security enhancements, namely avoid-
While the complexity of certain features may increase the ing indirect jump speculation (§4.1.2) and hardened user
difficulty of performance optimization. Simple misconfigura- copy (§4.1.4), largely eliminating their slowdowns without
tions have also significantly impacted kernel performance. sacrificing security guarantees.
For example, mistakenly turning on forced context tracking Finally, with little effort, Linux users can avoid most of
(§4.3.1) caused all the benchmark tests to slowdown by an the performance degradation from the identified root causes
average of 50%. by actively reconfiguring their systems. In fact, 8 out of 11
Two aforementioned changes (forced context tracking root causes can be disabled through configuration, and the
and control group memory controller) were significantly other 3 can be disabled through simple patches. Users that
optimized or disabled entirely reactively, i.e., only after per- do not require the new functionalities or security guarantees
formance degradations were observed in released kernels, can disable them to avoid paying unnecessary performance
instead of proactively. Forced context tracking (§4.3.1) was penalties. In addition, our findings also point to the fact that
only disabled after plaguing five versions for more than 11 Linux is shipped with static configurations that cannot adapt
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
4.1.2 Avoiding Indirect Branch Speculation the original jump destination, stored in rax, by moving it
Introduced in version 4.14, the Retpoline patch [67] miti- onto the stack. This causes the ret at line 8 to jump to the
gates the second variant (V2) of the Spectre attacks [35] by original jump destination, [rax], instead of line 4. Thus, the
bypassing the processor’s speculative execution of indirect thunk achieves the same behavior as jmp [rax] without
branches. The patch slows down half of the tests by more using indirect branches.
than 10% and causes severe degradation to the select, poll, A careful reader would have noticed that even without
and epoll tests, resulting in an average slowdown of 66%. lines 4 and 5, the speculative path would still fall into an
In particular, poll and epoll slow down by 89% and 72%, infinite loop at lines 7 and 8. What makes lines 4–5 necessary
respectively. is that repeatedly executing line 8, even speculatively, greatly
An indirect branch is a jump or call instruction whose perturbs a separate return address speculator, resulting in
target is not determined statically—it is only resolved at high overhead. In addition, the pause instruction at line 4
runtime. An example is jmp [rax], which jumps to an address provides a hint to the CPU that the two lines are a spin-loop,
that is stored in the rax register. Modern processors use allowing the CPU to optimize for power consumption [2, 24].
the indirect branch predictor to speculatively execute the The slowdown caused by Retpoline is proportional to the
instructions at the predicted target. However, Intel and AMD number of indirect jumps and calls in the test. The penalty
processors do not completely eliminate all side effects of for each such instruction is similar to that of a branch mis-
an incorrect speculation, e.g., by leaving data in the cache prediction. We further investigate the effects of Retpoline on
as described in §4.1.1 [1, 22]. Attackers can exploit such the select test. Without Retpoline, the select test executes
“side-channels” by carefully polluting the indirect branch an average of 31 indirect branches, all of which are indirect
target history, hence tricking the processor into speculatively calls; the misprediction rate of these is less than 1 in 30,000.
executing the desired branch target. Further analysis shows that 95% of these indirect calls are
Retpoline mitigates Spectre v2 by replacing each indirect from just three program locations that use function pointers
branch with a sequence of instructions—called a “thunk”— to invoke the handler of a specific resource type. Figure 5
during compilation. Figure 4 shows the thunk that replaces shows one of the program locations, which is also on the
jmp [rax]. The thunk starts with a call, which pushes the critical path of poll and epoll. The poll function pointer is
return address (line 4) onto the stack, before jumping to invoked repeatedly inside select’s main loop, and the actual
line 7. Line 7, however, replaces the return address with target is decided by the file type (a socket, in our case).
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
With Retpoline, all of the 31 indirect branches executed userspace and the kernel [26]. Without this patch, bugs in
by select are replaced with the thunk, and the ret in the the kernel could be exploited to either cause buffer overflow
thunk always causes a return address misprediction that has attacks when too much data is copied from userspace, or
30-35 cycles of penalty, resulting in a total slowdown of 68% to leak data when too much is copied to userspace. This
for the test. patch protects against such bugs by performing a series of
We alleviated the performance degradation by turning sanity checks on kernel pointers during every copy operation.
each indirect call into a switch statement, i.e., a direct con- However, this adds unnecessary overhead to kernel code that
ditional branch, which Spectre-V2 cannot exploit. Figure 6 already validates pointers.
shows our patch on the program location shown in Figure 5. For example, consider select, which takes a set of file
It directly invokes the specific target after matching for the descriptors for every type of event the user wants to watch
type of the resource. It reduced select’s slowdown from for. When invoked, the kernel copies the set from userspace,
68% to 5.7%, and big-select’s slowdown from 55% to 2.5% modifies it to indicate which events occurred, and then copies
respectively. This patch also reduces Retpoline’s overhead the set back to userspace. During this operation, the kernel
on poll and epoll. already checks that kernel memory was allocated correctly
and only copies as many bytes as were allocated. However,
4.1.3 SLAB Freelist Randomization the hardened usercopy patch adds several redundant sanity
Introduced since version 4.7, SLAB freelist randomization checks to this process. These include checking that i) the
increases the difficulty of exploiting buffer overflow bugs in kernel pointer is not null, ii) the kernel region involved does
the kernel [66]. A SLAB is a chunk of contiguous memory for not overlap the text segment, and iii) the object’s size does
storing equally-sized objects [9, 39]. It is used by the kernel not exceed the size limit of its SLAB if it is allocated using
to allocate kernel objects. A group of SLABs for a particular the SLAB allocator. To evaluate the cost of these redundant
type or size-class is called a cache. For example, fork uses the checks, we carefully patched the kernel to remove them.
kernel’s SLAB allocator to allocate mm_structs from SLABs The cost of hardened usercopy depends on the type of
in the mm_struct cache. The allocator keeps track of free data being copied and the amount. For select, the cost of
spaces for objects in a SLAB using a “freelist,” which is a checking adds 30ns of overhead. This slows down the test
linked list connecting adjacent object spaces in memory. As by a maximum of 18%. poll operates similarly to select
a result, objects allocated one after another will be adjacent in and also has to copy file descriptors and events to and from
memory. This predictability can be exploited by an attacker userspace. Interestingly, epoll does not experience the same
to perform a buffer overflow attack. Oberheide [28] describes degree of slowdown since it copies less data; the list of events
an example of an attack that has occurred in practice. to watch for is kept in the kernel, and only the events which
The SLAB freelist randomization feature randomizes the have occurred are copied to userspace. In contrast, the read
order of free spaces for objects in a SLAB’s freelist such tests copy one page to userspace at a time, but the page does
that consecutive objects in the list are not reliably adjacent not belong to a SLAB. As a result, only basic checks such
in memory. During initialization, the feature generates an as checking for a valid address are performed, costing only
array of random numbers for each cache. Then for every around 5ns for each page copied. This source of overhead is
new SLAB, the freelist is constructed in the order of the not significant even for big-read, which copies 10,000 pages.
corresponding random number array.
This patch resulted in notable overhead on tests that 4.2 New Features
sequentially access a large amount of memory. It caused
Next we describe the root causes that are new kernel fea-
big-fork to slow down by 37%, and the set of tests—big-
tures. One of them, namely fault around (§4.2.1), is in fact, an
select, big-poll, and big-epoll—to slow down by an aver-
optimization. It improves performance for workloads with
age of 41%. The slowdown comes from two sources. The first
certain characteristics at the cost of others. Disabling trans-
is the time spent randomizing the freelist during its initializa-
parent huge pages (§4.2.3) can also improve performance for
tion. In particular, big-fork spent roughly 6% of its execution
certain workloads. However, these features also impose non-
time just randomizing the freelist since it needs to allocate
trivial overhead on LEBench’s microbenchmarks. The other
several SLABs for the new process. The second and more
two features are new kernel functionalities mostly intended
significant source of slowdown is poor locality caused by
for virtualization or containerization needs.
turning sequential object access patterns into random access
patterns. For example, big-fork’s L3 cache misses increased
by around 13%. 4.2.1 Fault Around
Introduced in version 3.15, the fault around feature (“fault-
4.1.4 Hardened Usercopy around”) is an optimization that aims to reduce the number
Introduced since version 4.8, the hardened usercopy patch of minor page faults [34]. A minor page fault occurs when
validates kernel pointers used when copying data between no page table entry (PTE) exists for the required page, but
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.
the page is resident in the page cache. On a page fault, fault- significant a slowdown as in the case of munmap, because dur-
around not only attempts to establish the mapping for the ing each page fault, only one page is “charged” — the memory
faulting page, but also for the surrounding pages. Assuming controller’s overhead is still dwarfed by the cost of handling
the workload has good locality and several of the pages the page fault itself. In contrast, munmap often unmaps mul-
adjacent to the required page are resident in the page cache, tiple pages together, aggregating the cost of the inefficient
fault-around will reduce the number of subsequent minor “uncharging.” Note that mmap is generally unaffected by this
page faults. However, if these assumptions do not hold, fault- change because each mmapped page is allocated on demand
around can introduce overhead. For example, Roselli et al. when it is later accessed. In addition, the read and write tests
studied several file system workloads and found that larger are not affected since they use pre-allocated pages from the
files tend to be accessed randomly, which renders prefetching page cache.
unhelpful [63].
The big-pagefault test experiences a 54% slowdown as a 4.2.3 Transparent Huge Pages
result of fault-around. big-pagefault triggers a page fault by Enabled from version 3.13 to 4.6, and again from 4.8 to 4.11,
accessing a single page within a larger memory-mapped re- the transparent huge pages (THP) feature automatically ad-
gion. When handling this page fault, the fault-around feature justs the default page size [38]. It allocates 2MB pages (huge
searches the page cache for surrounding pages and estab- pages), and it also has a background thread that periodically
lishes their mappings, leading to additional overhead. promotes memory regions initially allocated with base pages
(4KB) into huge pages. Under memory pressure, THP may
4.2.2 Control Groups Memory Controller decide to fall back to 4KB pages or free up more memory
Introduced in version 2.6, the control group (cgroup) memory through compaction. THP can decrease the page table size
controller records and limits the memory usage of different and reduce the number of page faults; it also increases “TLB
control groups [42]. Control groups allow a user to isolate reach,” so the number of TLB misses is reduced.
the resource usage of different groups of processes. They However, THP can also negatively impact performance.
are a building block of containerization technologies like It could lead to internal fragmentation within huge pages.
Docker [16] and Linux Containers (LXC) [43]. This feature (Unlike FreeBSD [53], Linux could promote a 2MB region
is tightly coupled with the kernel’s core memory controller that has unallocated base pages into using a huge page [36]).
so it can credit every page deallocation or debit every page Furthermore, the background thread can also introduce over-
allocation to a certain cgroup. It introduces overhead on head [36]. Given this trade-off, kernel developers have been
tests that heavily exercise the kernel memory controller, going back-and-forth on whether to enable THP by default.
even though they do not use the cgroup feature. From version 4.8 to the present, THP is disabled by default.
The munmap tests experienced the most significant slow- In general, THP has positive effects on tests that access
down due to the added overhead during page deallocation. a large amount of memory. In particular, huge-read slows
In particular, big-munmap and med-munmap experienced an 81% down by as much as 83% on versions with THP disabled. It
and 48% slowdown, respectively, in kernels earlier than ver- is worth noting that THP also diminishes the slowdowns
sion 3.17. caused by other root causes. For example, THP reduces the
Interestingly, the kernel developers only began to opti- impact of Kernel Page Table Isolation (§4.1.1), since KPTI
mize cgroup’s overhead since version 3.17, 6.5 years after adds overhead on every kernel trap whereas THP reduces
cgroups was first introduced [29]. During munmap, the mem- the number of page faults.
ory controller needs to “uncharge” the memory usage from
the cgroup. Before version 3.17, the uncharging was done 4.2.4 Userspace Page Fault Handling
once for every page that was deallocated. It also required syn- Enabled in versions 4.6, 4.8, and later versions, userspace
chronization to keep the uncharging and the actual page deal- page fault handling allows a userspace process to handle
location atomic. Since version 3.17, uncharging is batched, page faults for a specified memory region [30]. This is useful
i.e., it is done only once for all the removed mappings. It for a userspace virtual machine monitor (VMM) to better
also occurs at a later stage when the mappings are invali- manage memory. A VMM could inform the kernel to deliver
dated from the TLB, so it no longer requires synchronization. page faults within the guest’s memory range to the VMM.
Consequently, after kernel version 3.17, the slowdowns of One use of this is for virtual machine migration so that the
big-munmap and med-munmap are reduced to 9% and 5%, respec- pages can be migrated on-demand. When the guest VM page
tively. faults, the fault will be delivered to the VMM, where the
In contrast, the memory controller only adds 2.7% of over- VMM can then communicate with a remote VMM to fetch
head for the page fault tests. When handling a page fault, the the page.
memory controller first ensures that the cgroup’s memory Overall, userspace page fault handling introduced negligi-
usage will stay within its limit following the page allocation, ble overhead except for the big-fork test which was slowed
then “charges” the cgroup for the page. Here we do not see as down by 4% on average. This is because fork must check
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
each memory region in the parent process for associated is immutable; when writing to the object, it is copied and
userspace page fault handling information and copy this to updated, resulting in a new version of the object. Because
the child if necessary. When the parent has a large number the write does not perturb existing reads, it can be carried
of pages that are mapped, this check becomes expensive. out at any time. However, deleting the old version of the
object can only be done when it is no longer being read.
4.3 Configuration Changes Therefore, each write also sets a callback to be invoked later
Three of the root causes are non-optimal configurations. to delete the old version of the object when it is safe to do so.
Forced context tracking (§4.3.1) is a misconfiguration by The readers cooperate by actively informing RCU when they
the kernel and Ubuntu developers and causes the biggest start and finish reading an object. Normally, RCU checks for
slowdown in this category. The other two are the conse- ready callbacks and invokes them at each timer interrupt;
quences of older kernel versions lacking specifications for but under RSCT, this has to be performed at other kernel
the newer hardware used in our experiments, thus leading to entries and exits.
non-optimal decisions being made. While this reflects a limi- FCT performs context tracking on every user-kernel mode
tation of our methodology (i.e., running old kernels on new transition for every core, even on the ones without RSCT
hardware), these misconfigurations could impact real Linux enabled. FCT was initially introduced by the Linux devel-
users. First, kernel patches on hardware specifications may opers to test context tracking before RSCT was ready, and
not be released in a timely manner: the release of the (sim- is automatically enabled with RSCT. The Ubuntu develop-
ple) patch that specifies the size of second-level TLB did not ers mistakenly enabled RSCT in a release version, hence
take place until six months after the release of the Haswell inadvertently enabling FCT. When this was reported as a
processors, during which time users of the new hardware performance problem [14], the Ubuntu developers disabled
could suffer a 50% slowdown on certain workloads (§4.3.2). RSCT. However, this still failed to disable FCT, as the Linux
This misconfiguration could impact any modern processor developers accidentally left FCT enabled even after RSCT
with a second level of TLB. Furthermore, hardware speci- was working. This was only fixed in later Ubuntu distribu-
fications for the popular family of Haswell processors are tions as a result of another bug report [17], 11 months after
not back-ported to older kernel versions that still claim to the initial misconfiguration.
be actively supported (§4.3.3).
4.3.2 TLB Layout Change
4.3.1 Forced Context Tracking Introduced in kernel version 3.14, this patch improves per-
Released into the kernel by mistake in versions 3.10 and 3.12– formance by enabling Linux to recognize the size of the
15, forced context tracking (FCT) is a debugging feature that second-level TLB (STLB) on newer Intel processors. Know-
was used in the development of another feature, reduced ing the TLB’s size is important for deciding how to invalidate
scheduling-clock ticks [40]. Nonetheless, FCT was enabled TLB entries during munmap. There are two options: one is to
in several Ubuntu release kernels due to misconfigurations. shoot down (i.e., invalidate) individual entries, and the other
This caused a minimum of approximately 200–300ns over- is to flush the entire TLB. Shoot-down should be used when
head in every trip to and from the kernel, thus significantly the number of mappings to remove is small relative to the
affecting all of our tests (see Figure 1). On average, FCT slows TLB’s capacity, whereas TLB flushing is better when the
down each of the 28 tests by 50%, out of which 7 slow down number of entries to invalidate is comparable to the TLB’s
by more than 100% and another 8 by 25–100%. capacity.
The reduced scheduling-clock ticks (RSCT) feature allows Before this patch was introduced, Linux used the size of
the kernel to disable the delivery of timer interrupts to idle the first-level data and instruction TLB (64 entries on our
CPU cores or cores running only one task. This reduces test machines) as the TLB’s size, and is not aware of the
power consumption for idle cores and interruptions for cores larger second-level TLB with 1024 entries. This resulted in
running a single compute-intensive task. However, work incorrect TLB invalidation decisions: for a TLB capacity of
normally done during these timer interrupts must now be 64, Linux calculates the flushing threshold to be 64/64 = 1.
done during other user-kernel mode transitions like system This means that, without the patch, invalidating more than
calls. Such work is referred to as context tracking. just one entry will cause a full TLB flush. As a result, the
Context tracking involves two tasks—CPU usage tracking med-munmap test, which removes 10 entries, suffers as much as
and participation in the read-copy update (RCU) algorithm. a 50% slowdown on a subsequent read of a memory-mapped
Tracking how much time is spent in userspace and the ker- file of 1024 pages due to the increased TLB misses. With the
nel is usually performed by counting the number of timer patch, the TLB flush threshold is increased to 16 (1024/64)
interrupts. Without timer interrupts, this must be done on on our processor, so med-munmap no longer induced a full
other kernel entries and exits instead. Context tracking also flush. However, this patch was only released six months
participates in RCU, a kernel subsystem that provides lock- after the earliest version of the Haswell family of processors
less synchronization. Conceptually, under RCU, each object was released. Note that small-munmap and big-munmap were
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.
not affected because the kernel still made the right decision 1e−4
Latency
Redis SPOP
Throughput
1e5
by invalidating a single entry in small-munmap and flushing 0.8 2.0
the entire TLB in big-munmap. 0.5 1.5
1e−4 Redis RPUSH 1e5
4.3.3 CPU Idle Power-State Support 0.75
2.0
Introduced in kernel version 3.9, this patch specifies the fine- 0.50 1.5
grained idle power saving modes of the Intel processor with 1e−1 Redis RPOP 1e5
3.0
3.2
3.4
3.6
3.8
3.10
3.12
3.14
3.16
3.18
4.0
4.1
4.3
4.5
4.7
4.9
4.11
4.13
4.15
4.17
4.19
patch speeds up LEBench by 21%, with the CPU intensive
select test achieving the most significant speedup of 31%. Linux Kernel Versions
While this patch was released in advance of the release
of the Xeon processors, it was not backported to the LTS Figure 7. Latency and throughput trends of the Apache
kernel lines which were still supported at the time, including Benchmark and selected Redis Benchmark tests (3 write
3.0, 3.2, and 3.4. This means that in order to achieve the best tests and 2 read tests with highest system times).
performance for newer hardware, a user might be forced
to adopt the newer kernel lines at the cost of potentially
unstable features.
responsible for returning the value of a key (GET) and re-
5 Macrobenchmark turning a range of values for a key (LRANGE) [60].
To understand how the 11 identified root causes affect real- We disable the 11 root causes on the kernels and evaluate
world workloads, we evaluate the Redis key-value store [62], their impact on the applications. Overall, disabling the 11
Apache HTTP Server [7], and Nginx web server [55],3 across root causes brings significant speedup for all three appli-
the Linux kernel versions on which we tested LEBench. Re- cations, improving the performance of Redis, Apache, and
dis’ workload was used to build LEBench, while workloads Nginx by a maximum of 56%, 33%, and 34%, and an average
from the other two applications serve as validation. We of 19%, 6.5%, and 10%, respectively, across all kernels. Four
use Redis’ and Apache’s built-in benchmarks—Redis Bench- changes—forced context tracking (§4.3.1), kernel page table
mark [61] and ApacheBench [6]—respectively; we also use isolation (§4.1.1), missing CPU idle power states (§4.3.3), and
ApacheBench to evaluate Nginx. Each benchmark is config- avoiding indirect jump speculation (§4.1.2)—account for 88%
ured to issue 100,000 requests through 50 (for Redis) or 100 of the slowdown across all applications. This is not surprising
(for Apache and Nginx) concurrent connections. given that these four changes also resulted in the most signif-
All three applications spend significant time in the kernel icant and widespread impact on LEBench tests, as evident in
and exhibit performance trends (shown in Figure 7) sim- Figure 2. The rest of the performance-impacting changes cre-
ilar to those observed from LEBench. For each test, the ate more tolerable and steady sources of overhead: across all
throughput trend tends to be the inverse of the latency kernels, they cause an average combined slowdown of 4.2%
trend. For brevity, we only display Redis Benchmark’s three for Redis, and 3.2% for Apache and Nginx; this observation
most kernel-intensive write tests, responsible for inserting is again consistent with the results obtained from LEBench,
(RPUSH) or deleting (SPOP, RPOP) records from the key- where these changes cause an average slowdown of 2.6%
value store, and the two most kernel-intensive read tests, across the tests. It is worth noting that these changes could
cause more significant individual fluctuations—if we only
3 In
2019, Redis is the most popular key value store [15]. Apache and Nginx
count the worst kernels, on average, each change can cause
rank first and third in web server market share, respectively, and together as much as a 5.8%, 11.5%, and 12.2% slowdown for Redis,
account for more than half of all market share [54]. Apache, and Nginx, respectively.
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
(a) % Change in Latency Relative to v4.0 on E5-2630 v3 (b) % Change in Latency Relative to v4.0 on i7-4810MQ
22 23 1 89 1 98 94 82 80 2 1 1 0 0 -4 -2 -8 -11 -9 -10 -14 -6 -5 -6 -4 -6 -6 21 44 0 1 2 93 1 102 98 84 82 3 2 4 0 0 -3 -1 -7 -9 -8 -9 -13 -5 -4 -4 -3 -5 -6 21 46
h
tc
wi d
21 20 -3 146 -4 153151132132 3 -1 -1 -1 0 -6 -3 -1 -3 0 6 0 6 11 12 11 11 10 77 99 -10 -11 -11 130 -11 136134193196 -2 4 26 -9 0 -9 -11 -10 -10 -8 25 -10 -1 2 6 2 6 0 67 87
x ts -read 21 20 -2 15 -2 16 15 13 16 1 0 1 0 0 0 0 2 1 3 3 0 9 8 9 10 9 4 11 16 -13 -12 -13 38 -14 3 2 35 36 -11 -10 -10 -9 0 -12 -11 -11 -11 3 21 -11 -5 -4 -5 -4 -6 -7 -2 2
e l a
nt al re 79 79 54 55 57 58 1 -2 3 -2 -1 1 0 0 2 1 -1 -1 1 2 50 0 0 2 0 72 72 71 75 63 63 63 62 65 67 -2 -3 -2 -1 0 -1 -1 0 21 1 -1 21 1 1 59 4 5 6 2 85 86 86 87
co smed-read 12 15 -1 56 -1 61 59 50 48 0 0 -2 -1 0 9 11 -2 -3 -2 4 1 1 -2 -2 -2 -4 -2 23 53 -13 -12 -7 48 -2 53 51 91 91 -6 5 27 -6 0 1 4 -9 -7 -8 32 -8 -4 -6 -6 -9 -9 -7 15 44
m ig- rite 7 7 -4 4 -2 8 7 4 5 0 -2 -1 -1 0 0 -2 -3 -4 -3 -2 -1 -1 -5 -6 -7 -8 -9 -6 2 -14 -18 -4 2 -1 5 5 38 40 -1 -3 -2 -1 0 -3 -6 -5 -5 -4 31 -3 -2 -6 -7 -15 -8 -16 -8 2
b -w ite
l
al wr te 4 3 -2 -1 0 0 -1 -1 -1 -1 -1 -1 -1 0 1 0 -1 -1 1 0 2 3 -1 0 0 -1 -3 -3 -1 -10 -10 -7 -7 -6 -4 -2 -2 -1 -1 -2 -1 -2 0 0 -1 -1 -3 1 0 0 4 0 -1 -3 -5 -6 -8 -3
s ed-wri *
m 8 19 -3 123 -2 130126114107 2 2 0 1 0 6 4 5 5 3 16 7 9 11 11 18 14 18 73 122 -5 -7 -7 166 -8 130118157177 -2 3 7 -2 0 12 2 2 14 5 41 3 8 6 7 16 14 13 75 112
m ig- ap 29 32 6 63 4 71 69 55 60 3 7 2 4 0 -1 -3 -2 -4 1 5 3 1 4 5 7 9 13 67 75 -1 2 -1 55 3 64 61 99 107 -2 14 31 -1 0 -3 -6 -9 -9 -5 29 -2 -5 -4 5 2 1 8 58 68
b m ap
m nm p 17 18 -4 10 -3 15 13 31 32 14 4 1 2 0 -4 -4 -3 -4 -1 0 -1 1 3 3 5 4 6 63 67 -6 -5 -4 9 -4 13 13 76 75 13 2 0 0 0 -4 -4 -5 -5 -2 -1 -2 1 2 4 4 3 5 62 66
u ma 61 79 46 46 57 56 52 51 44 35 7 2 0 0 1 -5 -3 -6 -1 -2 -3 0 4 0 0 -1 -1 0 -5 18 31 31 32 37 38 40 42 50 24 5 -3 -8 0 -13 -13 -12 -9 -10 -12 -11 -7 -5 3 -8 -8 -11 -10 -15
l l-munmap -1 1 0 -1 4 12
a -m n rk 17 18 -5 -1 -4 5 3 2 3 0 -1 1 0 0 -2 -1 -3 -1 1 2 -2 0 2 -1 -1 -3 0 -2 11 3 40 39 -1 12 2 1 0 -1 0 -2 -1 14 0 -1 0 -1 -1 2 0 -1 4 13
rc cv
th re lect
14 12 -8 68 -5 79 75 70 65 4 4 -1 2 0 2 -2 -2 -1 0 7 -4 9 5 9 7 -2 1 37 118 -33 -31 -31 23 -27 33 29 26 68 2 0 -17 1 0 2 0 -25 -26 -26 -25 -29 -11 -14 -19 -21 -29 -21 -1 60
& se oll 7 3 -15 79 -7 102 95 90 83 5 5 5 2 0 -2 -2 -6 -5 -5 7 -8 4 -4 0 0 -3 -3 47 152 -15 -37 -37 33 -28 48 43 63 86 3 0 -17 1 0 -2 -5 -22 -29 -30 -29 -32 -16 -4 -25 -27 -29 -24 7 86
nd p ll 23 21 -1 80 -5 107101 82 81 -3 -1 0 -1 0 2 -2 0 0 1 7 10 3 3 -1 0 -2 0 57 143 -25 0 -25 32 -24 55 50 37 85 -1 0 -17 0 0 2 1 -17 -25 -24 -24 -17 5 -24 -21 -25 -26 -24 20 80
se o
ep ct 1 0 -10 -10 -1 6 15 0 -2 -1 3 -1 2 0 10 1 0 -4 14 0 39 42 44 40 43 41 40 39 118 -24 -24 -21 -9 -1 7 9 0 29 -1 23 11 24 0 8 30 5 -12 -1 -8 34 30 26 27 30 29 28 21 90
e t 27 19 2 37 -1 43 41 31 31 -1 1 0 3 0 -3 -3 1 -3 -1 5 -2 3 2 1 3 2 4 36 40 1 -4 1 34 -2 41 39 74 81 0 35 34 1 0 7 -4 -4 -5 -5 36 -3 1 5 1 1 2 3 35 39
g- ul
bi efault -4 -9 -8 -9 -8
-40 -44 -52 -36 -54 -34 -35 -39 17 -42 3 2 2 0 -3 -4 -3 -4 2 1 -3 4 6 10 -52 -57 -53 -38 -55 -27 -29 -18 60 11 1 0 12 0 -5 -6 -6 -7 -4 -1 -5 3 5 -1 0 -4 22 3 52
g
a fa lt -50 -50 -59 -48 -60 -46 -47 -49 15 1 3 3 2 0 -3 -4 -4 -5 3 1 -4 2 -9 -13 -13 -14 -12 0 6 -70 -71 -70 -62 -70 -60 -56 -63 17 2 4 -24 3 0 -4 -3 -30 -30 -27 -27 -30 -16 -24 -35 -35 -37 -36 -27 -23
l-p ge u
al -pa efa
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
smed pag
m ig-
b
Linux Kernel Versions
Figure 8. Comparing the results of LEBench on two machines. For brevity, we only show results after v3.7 and before v4.15.
average user, it may be more economical to pay for a Red In 2007, developers at Intel introduced a Linux regression
Hat Enterprise Linux (RHEL) licence, or they may have to testing framework using a suite of micro- and macrobench-
compensate for the lack of performance tuning by investing marks [21], which caught a number of performance regres-
in hardware (i.e., purchasing more powerful servers or scal- sions in release candidates [13]. In contrast, our study focuses
ing their server pool) to make up for slower kernels. All of on performance changes in stable versions that persist over
these facts point to the importance of kernel performance, many versions, which are more likely to impact real users.
whose optimization remains a difficult challenge. Additional studies have analyzed other aspects of OS per-
formance. Boyd-Wickizer et al. [10] analyzed Linux’s scal-
8 Limitations ability and found that the traditional kernel design can be
adapted to scale without architectural changes. Lozi et al. [45]
We restrict the scope of our study due to practical limitations.
discovered Linux kernel bugs that resulted in leaving cores
First, while LEBench tests are obtained from profiling a set of
idle even when runnable tasks exist. Pillai et al. [58] dis-
popular workloads, we omitted many other types of popular
covered Linux file systems often trade crash consistency
Linux workloads, for example, HPC or virtualization work-
guarantees for good performance.
loads [3]. Second, we only used two machine setups in our
Finally, Heiser and Elphinstone [20] examined the evolu-
study, and both use Intel processors. A more comprehensive
tion of the L4 microkernel for the past 20 years and found
study should sample other types of processor architectures,
that many design and implementation choices have been
for example, those used in embedded devices on which Linux
phased out because they either are too complex or inflexible,
is widely deployed. Finally, our study focuses on Linux, and
or complicate verification.
our results may not be general to other OSes.
References [23] Intel Corporation. 2019. Intel® 64 and IA-32 Architectures Software
[1] Advanced Micro Devices. 2018. “Speculative Store Bypass” Vulner- Developer’s Manual. Vol. 3A. Chapter 4.10.1.
ability Mitigations for AMD Platforms. https://www.amd.com/en/ [24] Intel Corporation. 2019. Intel® 64 and IA-32 Architectures Software
corporate/security-updates. Developer’s Manual. Vol. 1. Chapter 11.4.4.4.
[2] Advanced Micro Devices. 2019. AMD64 Architecture Programmer’s [25] Intel Corporation. 2019. Intel® 64 and IA-32 Architectures Software
Manual. Vol. 3. Chapter 3, 262. Developer’s Manual. Vol. 3. Chapter 14.5.
[3] Al Gillen and Gary Chen. 2011. The Value of Linux in Today’s Fast- [26] Jake Edge. 2016. Hardened Usercopy. https://lwn.net/Articles/695991/.
Changing Computing Environments. [27] Jake Edge. 2017. Testing Kernels. https://lwn.net/Articles/734016/.
[4] Amazon Web Services. 2017. AWS re:Invent 2017: How Netflix [28] Jon Oberheide. 2010. Linux Kernel CAN SLUB Overflow. https://jon.
Tunes Amazon EC2 Instances for Performance (CMP325). https: oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/.
//www.youtube.com/watch?v=89fYOo1V2pA. [29] Jonathan Corbet. 2007. Notes from a Container. https://lwn.net/
[5] Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D. Articles/256389/.
Lazowska. 1991. The Interaction of Architecture and Operating Sys- [30] Jonathan Corbet. 2015. User-space Page Fault Handling. https://lwn.
tem Design. In Proceedings of the 4th International Conference on Ar- net/Articles/636226/.
chitectural Support for Programming Languages and Operating Systems [31] Jonathan Corbet. 2017. The Current State of Kernel Page-table Isolation.
(ASPLOS IV). ACM, 108–120. https://lwn.net/Articles/741878/.
[6] Apache. 2018. ab - Apache HTTP Server Benchmarking Tool. https: [32] Jonathan Corbet and Greg Kroah-Hartman. 2017. 2017 State of Linux
//httpd.apache.org/docs/2.4/programs/ab.html. Kernel Development. https://www.linuxfoundation.org/2017-linux-
[7] Apache. 2018. Apache HTTP Server Project. https://httpd.apache.org/. kernel-report-landing-page/.
[8] Simon Biggs, Damon Lee, and Gernot Heiser. 2018. The Jury Is In: [33] Judd Vinet and Aaron Griffin. 2018. Arch Linux. https://www.archlinux.
Monolithic OS Design Is Flawed: Microkernel-based Designs Improve org/.
Security. In Proceedings of the 9th Asia-Pacific Workshop on Systems [34] Kirill A. Shutemov. 2014. mm: Map Few Pages Around Fault Address
(APSys ’18). ACM, Article 16, 7 pages. if They are in Page Cache. https://lwn.net/Articles/588802/.
[9] Jeff Bonwick. 1994. The Slab Allocator: An Object-caching Kernel [35] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Ham-
Memory Allocator. In Proceedings of the 1994 USENIX Summer Technical burg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz,
Conference (USTC ’94). USENIX Association, 87–98. and Yuval Yarom. 2018. Spectre Attacks: Exploiting Speculative Exe-
[10] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey cution. (Jan. 2018). arXiv:1801.01203
Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. [36] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach,
2010. An Analysis of Linux Scalability to Many Cores. In Proceed- and Emmett Witchel. 2016. Coordinated and Efficient Huge Page
ings of the 9th USENIX Conference on Operating Systems Design and Management with Ingens. In Proceedings of the 12th USENIX Conference
Implementation (OSDI ’10). USENIX Association, 1–16. on Operating Systems Design and Implementation (OSDI ’16). USENIX
[11] Aaron B. Brown and Margo I. Seltzer. 1997. Operating System Bench- Association, 705–721.
marking in the Wake of Lmbench: A Case Study of the Performance of [37] Kevin Lai and Mary Baker. 1996. A Performance Comparison of UNIX
NetBSD on the Intel x86 Architecture. In Proceedings of the 1997 ACM Operating Systems on the Pentium. In Proceedings of the 1996 USENIX
SIGMETRICS International Conference on Measurement and Modeling Annual Technical Conference (ATC ’96). USENIX Association, 265–277.
of Computer Systems (SIGMETRICS ’97). ACM, 214–224. [38] Linux. 2017. = Transparent Hugepage Support =. https://www.kernel.
[12] Peter M. Chen and David A. Patterson. 1993. A New Approach to I/O org/doc/Documentation/vm/transhuge.txt.
Performance Evaluation: Self-scaling I/O Benchmarks, Predicted I/O [39] Linux. 2017. Short Users Guide for SLUB. https://www.kernel.org/doc/
Performance. In Proceedings of the 1993 ACM SIGMETRICS Conference Documentation/vm/slub.txt.
on Measurement and Modeling of Computer Systems (SIGMETRICS ’93). [40] Linux. 2018. NO_HZ: Reducing Scheduling-Clock Ticks. https://www.
ACM, 1–12. kernel.org/doc/Documentation/timers/NO_HZ.txt.
[13] Tim Chen, Leonid I. Ananiev, and Alexander V. Tikhonov. 2007. Keep- [41] Linux. 2018. Page Table Isolation. https://www.kernel.org/doc/
ing Kernel Performance from Regressions. In Proceedings of the Linux Documentation/x86/pti.txt.
Symposium, Vol. 1. 93–102. [42] Linux. 2019. Memory Resource Controller. https://www.kernel.org/
[14] Colin Ian King. 2013. Context Switching on 3.11 Kernel Costing CPU doc/Documentation/cgroup-v1/memory.txt.
and Power. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/ [43] Linux Containers. 2018. Linux Containers. https://linuxcontainers.
1233681. org/.
[15] DB-Engines. 2019. DB-engines Ranking. https://db-engines.com/en/ [44] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner
ranking. Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and
[16] Docker. 2018. Docker. https://www.docker.com/. Mike Hamburg. 2018. Meltdown. (Jan. 2018). arXiv:1801.01207
[17] George Greer. 2014. getitimer Returns it_value=0 Erroneously. https: [45] Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien
//bugs.launchpad.net/ubuntu/+source/linux/+bug/1349028. Quéma, and Alexandra Fedorova. 2016. The Linux Scheduler: A Decade
[18] Graz University of Technology. 2018. Meltdown and Spectre. https: of Wasted Cores. In Proceedings of the 11th European Conference on
//meltdownattack.com/. Computer Systems (EuroSys ’16). ACM, Article 1, 16 pages.
[19] Greg Kroah-Hartman. 2017. Linux Kernel Release Model. http://kroah. [46] Markus Podar. 2014. Current Ubuntu 14.04 Uses Kernel with Degraded
com/log/blog/2018/02/05/linux-kernel-release-model/. Disk Performance in SMP Environment. https://github.com/jedi4ever/
[20] Gernot Heiser and Kevin Elphinstone. 2016. L4 Microkernels: The veewee/issues/1015.
Lessons from 20 Years of Research and Deployment. ACM Transaction [47] Larry McVoy and Carl Staelin. 1996. Lmbench: Portable Tools for Per-
on Computer Systems 34, 1, Article 1 (April 2016), 29 pages. formance Analysis. In Proceedings of the 1996 USENIX Annual Technical
[21] Intel Corporation. 2017. Linux Kernel Performance. https://01.org/lkp. Conference (ATC ’96). USENIX Association, 279–294.
[22] Intel Corporation. 2018. Speculative Execution and Indirect Branch [48] Michael Dale Long. 2016. Unnaccounted for High CPU Usage While
Prediction Side Channel Analysis Method. https://www.intel.com/ Idle. https://bugzilla.kernel.org/show_bug.cgi?id=150311.
content/www/us/en/security-center/advisory/intel-sa-00088.html. [49] Michael Kerrisk. 2012. KS2012: memcg/mm: Improving Memory
cgroups Performance for Non-users. https://lwn.net/Articles/516533/.
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.
[50] Michael Larabel. 2010. Five Years of Linux Kernel Benchmarks: 2.6.12 Equal: On the Complexity of Crafting Crash-consistent Applications.
Through 2.6.37. https://www.phoronix.com/scan.php?page=article& In Proceedings of the 11th USENIX Conference on Operating Systems
item=linux_2612_2637. Design and Implementation (OSDI ’14). USENIX Association, 433–448.
[51] Michael Larabel. 2016. Linux 3.5 Through Linux 4.4 Kernel Bench- [59] Randal E. Bryant and David R. O’Hallaron. 2002. Computer Systems: A
marks: A 19-Way Kernel Showdown Shows Some Regressions. https: Programmer’s Perspective (1 ed.). Prentice Hall, 467–470.
//www.phoronix.com/scan.php?page=article&item=linux-44-19way. [60] Redis. 2018. Command Reference — Redis. https://redis.io/commands.
[52] Michael Larabel. 2017. The Linux Kernel Gained 2.5 Million Lines of [61] Redis. 2018. How Fast is Redis? https://redis.io/topics/benchmarks.
Code, 71k Commits in 2017. https://www.phoronix.com/scan.php? [62] Redis. 2018. Redis. https://redis.io/.
page=news_item&px=Linux-Kernel-Commits-2017. [63] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. 2000. A Com-
[53] Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Prac- parison of File System Workloads. In Proceedings of the 2000 USENIX
tical, Transparent Operating System Support for Superpages. In Pro- Annual Technical Conference (ATC ’00). USENIX Association, 41–54.
ceedings of the 5th Symposium on Operating Systems Design and Imple- [64] M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta.
mentation (OSDI ’02). USENIX Association, 89–104. 1995. The Impact of Architectural Trends on Operating System Per-
[54] Netcraft. 2019. March 2019 Web Server Survey | Netcraft. formance. In Proceedings of the 15th ACM Symposium on Operating
https://news.netcraft.com/archives/2019/03/28/march-2019-web- Systems Principles (SOSP ’95). ACM, 285–298.
server-survey.html. [65] Theodore Y. Ts’o. 2019. Personal Communication.
[55] Nginx. 2019. NGINX | High Performance Load Balancer, Web Server, [66] Thomas Garnier. 2016. mm: SLAB Freelist Randomization. https:
& Reverse Proxy. https://www.nginx.com/. //lwn.net/Articles/682814/.
[56] John K. Ousterhout. 1990. Why Aren’t Operating Systems Getting [67] Thomas Gleixner. 2018. x86/retpoline: Add Initial Retpoline Support.
Faster As Fast as Hardware?. In Proceedings of the 1990 USENIX Summer https://patchwork.kernel.org/patch/10152669/.
Technical Conference (USTC ’90). USENIX Association, 247–256. [68] Ubuntu. 2018. Ubuntu. https://www.ubuntu.com/.
[57] Philippe Gerum. 2018. Troubleshooting Guide. https://gitlab.denx.de/ [69] Vlad Frolov. 2016. [REGRESSION] Intensive Memory CGroup Removal
Xenomai/xenomai/wikis/Troubleshooting. Leads to High Load Average 10+. https://bugzilla.kernel.org/show_
[58] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ram- bug.cgi?id=190841.
natthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, [70] W3Techs. 2018. Usage Statistics and Market Share of Linux for Web-
and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created sites. https://w3techs.com/technologies/details/os-linux/all/all.