0% found this document useful (0 votes)
52 views16 pages

An Analysis of Performance Evolution of Linux's Core Operations

Uploaded by

Kession Hou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views16 pages

An Analysis of Performance Evolution of Linux's Core Operations

Uploaded by

Kession Hou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

An Analysis of Performance Evolution of Linux’s Core Operations

Xiang (Jenny) Ren Kirk Rodrigues Luyuan Chen


University of Toronto University of Toronto University of Toronto
[email protected] [email protected] [email protected]

Camilo Vega Michael Stumm Ding Yuan


University of Toronto University of Toronto University of Toronto
[email protected] [email protected] [email protected]

Abstract CCS Concepts • Software and its engineering → Soft-


This paper presents an analysis of how Linux’s performance ware performance; Operating systems; • Social and pro-
has evolved over the past seven years. Unlike recent works fessional topics → History of software.
that focus on OS performance in terms of scalability or ser- Keywords Performance evolution, operating systems, Linux
vice of a particular workload, this study goes back to basics:
the latency of core kernel operations (e.g., system calls, con- ACM Reference Format:
text switching, etc.). To our surprise, the study shows that the Xiang (Jenny) Ren, Kirk Rodrigues, Luyuan Chen, Camilo Vega,
performance of many core operations has worsened or fluc- Michael Stumm, and Ding Yuan. 2019. An Analysis of Performance
Evolution of Linux’s Core Operations. In ACM SIGOPS 27th Sym-
tuated significantly over the years. For example, the select
posium on Operating Systems Principles (SOSP ’19), October 27–30,
system call is 100% slower than it was just two years ago. An
2019, Huntsville, ON, Canada. ACM, New York, NY, USA, 16 pages.
in-depth analysis shows that over the past seven years, core https://doi.org/10.1145/3341301.3359640
kernel subsystems have been forced to accommodate an in-
creasing number of security enhancements and new features.
These additions steadily add overhead to core kernel opera-
tions but also frequently introduce extreme slowdowns of
more than 100%. In addition, simple misconfigurations have
1 Introduction
also severely impacted kernel performance. Overall, we find In the early days of operating systems (OS) research, the
most of the slowdowns can be attributed to 11 changes. performance of core OS kernel operations – in particular,
Some forms of slowdown are avoidable with more proac- system call latency – was put under the microscope [5, 11,
tive engineering. We show that it is possible to patch two 37, 47, 56]. However, over the past decade or two, interest
security enhancements (from the 11 changes) to eliminate in core kernel performance has waned. Researchers have
most of their overheads. In fact, several features have been seemingly shifted focus to other aspects of OS performance
introduced to the kernel unoptimized or insufficiently tested such as multicore scalability [10], performance under specific
and then improved or disabled long after their release. workloads or on new hardware, and scheduling [45], to name
Our findings also highlight both the feasibility and impor- just a few. Indeed, the most recent comprehensive analysis
tance for Linux users to actively configure their systems to of OS system call performance dates back to 1996, when
achieve an optimal balance between performance, function- McVoy and Staelin [47] studied OS system call latencies using
ality, and security: we discover that 8 out of the 11 changes lmbench with follow-up work from Brown and Seltzer [11]
can be avoided by reconfiguring the kernel, and the other 3 in 1997 that extended lmbench. This begs the question: are
can be disabled through simple patches. By disabling the 11 core OS kernel operations getting slower or faster?
changes with the goal of optimizing performance, we speed This paper presents an analysis of how the latencies of
up Redis, Apache, and Nginx benchmark workloads by as Linux’s core operations have evolved over the past seven
much as 56%, 33%, and 34%, respectively. years. We use the term “kernel operations” to encompass
both system calls and kernel functions like context switch-
ing. This work first introduces LEBench, a microbenchmark
Permission to make digital or hard copies of part or all of this work for
suite that measures the performance of the 13 kernel oper-
personal or classroom use is granted without fee provided that copies are ations that most significantly impact a variety of popular
not made or distributed for profit or commercial advantage and that copies applications. We test LEBench on 36 Linux release versions,
bear this notice and the full citation on the first page. Copyrights for third- from 3.0 to 4.20 (the most recent), running on a single Intel
party components of this work must be honored. For all other uses, contact Xeon server. Figure 1 shows the results. All kernel operations
the owner/author(s).
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada
are slower than they were four years ago (version 4.0), ex-
© 2019 Copyright held by the owner/author(s).
cept for big-write and big-munmap. The majority (75%) of the
ACM ISBN 978-1-4503-6873-5/19/10. kernel operations are slower than seven years ago (version
https://doi.org/10.1145/3341301.3359640 3.0). Many of the slowdowns are substantial: the majority
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

(a) Percentage Change in Test Latency Relative to v4.0


contextswitch 9 7 17 10 5 10 11 22 23 1 89 1 98 94 82 80 2 1 1 0 0 -4 -2 -8 -11 -9 -10 -14 -6 -5 -6 -4 -6 -6 21 44 46 46 54 55 58

small-read 28 29 27 25 24 23 18 21 20 -3 146 -4 153 151 132 132 3 -1 -1 -1 0 -6 -3 -1 -3 0 6 0 6 11 12 11 11 10 77 98 109 103 103 103 99

med-read 23 24 21 25 22 23 21 21 20 -2 15 -2 16 15 13 16 1 0 1 0 0 0 0 2 1 3 3 0 9 8 9 10 9 4 11 16 18 15 15 16 16

big-read 94 94 96 93 83 84 79 79 79 54 55 57 58 1 -2 3 -2 -1 1 0 0 2 1 -1 -1 1 2 50 0 0 2 0 72 72 71 75 79 77 78 76 74

small-write 0 1 4 11 2 9 13 12 15 -1 56 -1 61 59 50 48 0 0 -2 -1 0 9 11 -2 -3 -2 4 1 1 -2 -2 -2 -4 -2 23 46 56 56 51 51 50

med-write 6 6 8 9 5 6 6 7 7 -4 4 -2 8 7 4 5 0 -2 -1 -1 0 0 -2 -3 -4 -3 -2 -1 -1 -5 -6 -7 -8 -9 -6 0 2 2 1 2 2

big-write 2 2 3 3 2 1 3 4 3 -2 -1 0 0 -1 -1 -1 -1 -1 -1 -1 0 1 0 -1 -1 1 0 2 3 -1 0 0 -1 -3 -3 0 0 -1 -2 0 -1

mmap * 10 9 13 13 11 16 15 8 19 -3 123 -2 130 126 114 107 2 2 0 1 0 6 4 5 5 3 16 7 9 11 11 18 14 18 73 147 142 120 120 119 116 150%
small-munmap 48 48 53 48 45 43 28 29 32 6 63 4 71 69 55 60 3 7 2 4 0 -1 -3 -2 -4 1 5 3 1 4 5 7 9 13 67 94 81 82 81 77 78 125%
med-munmap 29 31 36 27 23 19 19 17 18 -4 10 -3 15 13 31 32 14 4 1 2 0 -4 -4 -3 -4 -1 0 -1 1 3 3 5 4 6 63 73 70 69 68 68 68 100%
big-munmap 74 74 68 66 70 65 62 61 79 46 46 57 56 52 51 44 35 7 2 0 0 1 -5 -3 -6 -1 -2 -3 0 4 0 0 -1 -1 0 -7 -12 -17 -15 -16 -14
75%
fork 20 20 21 21 16 16 16 17 18 -5 -1 -4 5 3 2 3 0 -1 1 0 0 -2 -1 -3 -1 1 2 -2 0 2 -1 1 0 -1 4 14 13 12 13 0 2
50%
big-fork 21 16 16 19 16 23 19 11 26 -5 -4 -4 -2 -5 3 1 3 6 2 6 0 0 0 1 2 0 1 33 25 26 26 36 37 39 29 50 49 43 44 48 42
25%
thrcreate 72 69 65 52 46 47 51 26 28 6 38 11 27 33 38 64 -3 1 0 3 0 -1 -5 -2 9 -6 13 -2 -5 5 17 6 3 7 31 46 49 102 94 85 83
0%
send & recv * 34 35 23 24 21 20 26 26 24 0 136 -2 148 144 122 117 -1 0 -1 1 0 -6 -4 1 -1 -1 5 3 10 9 9 9 3 6 75 90 96 104 104 107 100

big-send & recv * -53 -53 34 32 31 41 30 32 32 9 16 9 7 6 4 4 -2 -2 -2 1 0 -2 -1 0 -1 -2 0 -1 0 0 -1 -1 -1 0 3 4 4 5 8 5 4 -25%


select 18 12 17 16 16 15 20 14 12 -8 68 -5 79 75 70 65 4 4 -1 2 0 2 -2 -2 -1 0 7 -4 9 5 9 7 -2 1 37 115 118 118 116 114 109 -50%
poll 6 1 2 2 3 6 14 7 3 -15 79 -7 102 95 90 83 5 5 5 2 0 -2 -2 -6 -5 -5 7 -8 4 -4 0 0 -3 -3 47 144 149 149 146 146 136

epoll 7 7 17 17 5 23 30 23 21 -1 80 -5 107 101 82 81 -3 -1 0 -1 0 2 -2 0 0 1 7 10 3 3 -1 0 -2 0 57 145 149 150 148 127 124

big-select 0 -2 0 -2 -3 1 0 1 0 -10 -10 -1 6 15 0 -2 -1 3 -1 2 0 10 1 0 -4 14 0 39 42 44 40 43 41 40 39 118 122 113 120 117 121

big-poll -2 -4 -2 -3 -4 0 0 1 -1 -10 -10 -6 13 -2 0 -2 -1 3 0 2 0 2 3 -1 -5 -2 2 37 40 38 40 41 39 38 36 114 121 112 119 120 115

big-epoll -1 0 5 7 3 14 16 21 16 2 -4 -4 2 -2 -2 -2 -3 1 0 1 0 5 -2 -2 1 6 6 45 41 40 42 43 42 42 40 128 134 126 134 131 131

small-pagefault 17 19 19 20 17 18 17 27 19 2 37 -1 43 41 31 31 -1 1 0 3 0 -3 -3 1 -3 -1 5 -2 3 2 1 3 2 4 36 54 51 45 45 46 46

big-pagefault -48 -47 -46 -46 -52 -52 -53 -50 -50 -59 -48 -60 -46 -47 -49 15 1 3 3 2 0 -3 -4 -4 -5 3 1 -4 2 -9 -13 -13 -14 -12 0 9 9 8 10 10 6
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
(b) Enabled Changes
Spectre patch
Meltdown patch
Harden usercopy

led bled
Rand. SLAB freelist
User pagefault handling

a
En
Fault around
TLB layout spec.

ab
Force context tracking

Dis
Hugepages disabled
Missing CPU idle states
cgroup mem. controller
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
4.16
4.17
4.18
4.19
4.20
Linux Kernel Versions

Figure 1. Main result. (a) shows the latency trend for each test across all kernels, relative to the 4.0 kernel. (We use the 4.0
kernel as a baseline to better highlight performance degradations in later kernels.) (b) shows the timeline of each performance
affecting change. Each value in (a) indicates the percentage change in latency of a test relative to the same test on the 4.0
kernel. Therefore, positive and negative values indicate worse and better performance, respectively. *: for brevity, we show the
averaged trend of related tests with extremely similar trends, including the average of all mmap tests, the send and recv test,
and the big-send and big-recv test.

(67%) slow down by at least 50% and some by 100% over steady creep of slowdown in core operations, and disruptive
the last seven years (e.g., mmap, poll & select, send & recv). slowdowns that persist over many versions (e.g., a more than
Performance has also fluctuated significantly over the years. 100% slowdown that persists across six versions). Such sig-
Drilling down on these performance fluctuations, we ob- nificant impacts are introduced by security enhancements
serve that a total of 11 root causes are responsible for the and features, which often demand complex and intrusive
major slowdowns. These root causes fall into three categories. modifications to central subsystems of the kernel, such as
First, we observe a growing number of (1) security enhance- memory management. The last category of root causes is
ments and (2) new features, like support for containers and (3) configuration changes, some of which are simple miscon-
virtualization, being added to the kernel. The effect of this figurations that resulted in severe slowdowns across kernel
trend on kernel performance manifests itself in two ways: a operations, impacting many users.
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

While many forms of slowdowns result from fundamental Application Workload % System Time
trade-offs between performance and functionality or secu- Apache Spark v2.2.1 spark-bench’s minimal example 3%
rity, we find a good number could have been avoided or Redis v4.0.8 redis-benchmark
significantly alleviated with more proactive software engi- with 100K requests 41%
neering practices. For example, frequent lightweight test- PostgreSQL v9.5 pgbench with scale factor 100 17%
ing can easily catch the simple misconfigurations that re- Chromium browser Watching a video and reading
v59.0.3071.109 a news article 29%
sulted in widespread slowdowns. The performance of certain
Build toolchain Compiling the 7%
kernel functions would also benefit from more eager opti-
(make 4.1, gcc 5.3) 4.15.10 Linux kernel
mizations and thorough testing: we found some features
Table 1. Applications and respective workloads used to choose
significantly degraded the performance of core kernel opera-
core kernel operations, and each workload’s approximate execution
tions in the initial release; only long after having been intro- time spent in the kernel.
duced were they performance-optimized or disabled due to
performance complaints. Furthermore, a few other changes
that introduced performance slowdowns simply remained
unoptimized—we patched two of the security enhancements
to eliminate most of their performance overhead without re- the impact of the 11 identified root causes on three real-
ducing security guarantees. At the same time, we recognize world applications and show that they can cause slowdowns
the difficulty of testing and maintaining a generic OS kernel as high as 56%, 33%, and 34% on the Redis key-value store,
like Linux, which must support a diverse array of hardware Apache HTTP server, and Nginx web server, respectively.
and workloads [27], and evolves extremely quickly [52]. On The rest of the paper is organized as follows. §2 describes
the other hand, the benefit of being a generic OS kernel is LEBench and the methodology we used to drive our analy-
that Linux is highly configurable—8 out of the 11 root causes sis. We summarize our main findings in §3 before zooming
can be easily disabled by reconfiguring the kernel. This cre- into each change that caused significant performance fluc-
ates the potential for Linux users to actively configure their tuations in §4. §5 discusses the performance implications of
kernels and significantly improve the performance of their core kernel operations on three real-world applications. §6
custom workloads. validates LEBench’s results on a different hardware setup.
Out of the many performance-critical parts of the kernel, We discuss the challenges of Linux performance tuning in
we chose to study core kernel operations since the signif- §7, and we survey related work in §9 before concluding.
icance of their performance is likely elevating; recent ad-
vances in fast non-volatile memory and network devices 2 Methodology
together with the flattened curve of microprocessor speed Our experiments focus on system calls, thread creation, page
scaling may shift the bottleneck to core kernel operations. faults, and context switching. To determine which system
We also chose to focus on how the kernel’s software design calls are frequently exercised, we use our best efforts to select
and implementation impact performance. Prior studies on a set of representative application workloads. Table 1 lists
OS performance mostly focused on comparing the implica- the applications and the workloads we ran. We include work-
tions of different architectures [5, 12, 56, 64]. Those studies loads from three popular server-side applications: Spark, a
occurred during a time of diverse and fast-changing CPUs, distributed computing framework, Redis, a key-value store,
but such CPU architectural heterogeneity has largely dis- and PostgreSQL, a relational database. In addition, we include
appeared in today’s server market. Therefore, we focus on an interactive user workload—web browsing through the
software changes to core OS operations introduced over Chromium browser—and a software development workload—
time, making this the first work to systematically perform a building the Linux kernel. The chosen workloads exercise
longitudinal study on the performance of core OS operations. the kernel with varying intensities, as shown in Table 1.
This paper makes the following contributions. The first is We used strace to measure CPU time and the call-frequency
a thorough analysis of the performance evolution of Linux’s of each system call used by the workloads. We then selected
core kernel operations and the root causes for significant those system calls which took up the most time across all
performance trends. We also show that it is possible to workloads. wait-related system calls were excluded as their
mitigate the performance overhead of two of the security sole purpose is to block the process. Table 2 lists each of the
enhancements. Our second contribution is LEBench, a mi- microbenchmarks. Where applicable, we vary the input sizes
crobenchmark that is collected from representative work- to account for a variety of usage patterns.
loads together with a regression testing framework capa- Our process for running each microbenchmark is as fol-
ble of evaluating the performance of an array of Linux ver- lows. Latency is measured by collecting a timestamp immedi-
sions. The benchmark suite and a framework for automati- ately before and after invoking a kernel operation. For system
cally testing multiple kernel versions are available at https: calls, the benchmark bypasses the libc wrapper whenever
//github.com/LinuxPerfStudy/LEBench. Finally, we evaluate possible to expose the true kernel performance. We repeat
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

Test Name Description


Context switch Forces context switching by having two processes repeatedly pass one byte through two pipes.
Thread create Measure the time immediately before and after the thread creation, both in the child and the parent. The
shorter latency of the two is used to eliminate variations introduced by scheduling.
fork Measure the time immediately before and after the fork, both in the child and the parent. The shorter
latency of the two is used. To stress test fork, 12,000 writable pages are mapped into the parent before
forking; to understand the minimum forking overhead, 0 pages are mapped.
read, write Sequentially read or write the entire content of a file. A one page file is used to understand the bare
minimum overhead. Sizes of 10 and 10,000 pages are used to test how performance changes with increasing
sizes. For read tests, the page cache is warmed up by running the tests before taking measurements.
mmap, munmap Map a number of consecutive file-backed read-only pages from a file into memory, or unmap a number of
consecutive file-backed writable pages into a file. We use three file sizes: 1, 10, and 10,000 pages.
Page fault Reads one byte from the first page of a number of newly mapped pages to trigger page faults. The test
is run for 1 and 10,000 contiguous, mapped file-backed pages. The size of the mapped region affects the
behavior of page fault handling under the “fault-around” patch.
send, recv Creates a TCP connection between two processes on the same machine using a UNIX socket as the
underlying communication channel. Each process repeatedly sends/receives a message to/from the other.
The test is run for two message sizes: 1 byte and 96,000 bytes.
select, poll, Performs select, poll, or epoll on a number of socket file descriptors. The socket file descriptors become
epoll ready upon having enough memory for each socket. The test is run for 10 and 1,000 file descriptors.
Table 2. A description of the tests in LEBench including their usage patterns. (The size of a page is 4kB in this table.)

each measurement 10,000 times and report the value calcu- Intel i7 processor and analyze the differences between the
lated using the K-best method with K set to 5 and tolerance two sets of results in §6.
set to 5% [59]. To do this, we order all measured values nu- When interpreting results from the microbenchmarks,
merically, and select the lowest from the first series of five we treat a flat latency trend as expected and analyze any
values where no two adjacent values differ by more than increase or decrease that may signify a performance regres-
5%. Selecting lower values filters the interference from back- sion or improvement, respectively. We extract the causes of
ground workloads, and setting K to 5 and tolerance to 5% these performance changes iteratively: for each test, we first
is considered effective in ensuring consistent and accurate identify the root cause of the most significant performance
results across runs [59]. change; we then disable the root cause and repeat the process
We run the microbenchmarks on each major version of to identify the root cause of the next most significant perfor-
Linux released in the past seven years. This includes versions mance change. We repeat this until the difference between
3.0 to 3.19 and versions 4.0 to 4.20. For every major version, the slowest and fastest kernel versions is no more than 10%
we select the latest minor version (the y in v.x.y) released for the target test.
before the next major version. This is to avoid testing changes
that were backported from a subsequent major version. For 3 Overview of Results
example, for major version 3.0, we tested minor version 3.0.7
(released just before the release of 3.1.0) since 3.0.8 may We overview the results of our analysis in this section and
contain some changes that were introduced in 3.1.0. We only make a few key observations before detailing each root cause
tested versions that were released. Linux distributions such in §4.
as Ubuntu [68] or Arch Linux [33] typically configure the Figure 1 displays the latency evolution of each test across
kernel differently from Linux’s default configuration. We use all kernels, relative to the 4.0 kernel. Only isolated tests expe-
Ubuntu’s Linux distribution because, at least for web-servers, rience performance improvements over time; the majority of
Ubuntu is the most widely used Linux distribution [70]. For tests display worsening performance trends and frequently
example, Netflix hosts its services on Ubuntu kernels [4]. suffer prolonged episodes of severe performance degradation.
We carried out the tests on an HP DL160 G9 server with a These episodes result in significant performance fluctuations
2.40GHz Intel Xeon E5-2630 v3 processor, 512KB L1 cache, across multiple core kernel operations. For example, send
2MB L2 cache, and 20MB L3 cache. The server also has 128GB and recv’s performance degraded by 135% from version 3.9
of 1866MHz DDR4 memory and a 960GB SSD for persistent to 3.10, improved by 137% in 3.11, and then degraded again
storage. To understand how different hardware setups affect by 150% in 3.12. We also observe that the benchmark’s over-
the results, we repeated the tests on a Lenovo laptop with an all performance degraded by 55% going from version 4.13 to
4.15. The sudden and significant nature of these performance
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

Root Cause Description How it affects performance Impact


Security Enhancements: max combined slowdown: 146% poll
Kernel page-table Removes kernel memory mappings from A kernel entry/exit now swaps the recv 63%,
isolation (KPTI) the page table upon entering userspace to page table pointer, which further small-read
(§4.1.1) mitigate Meltdown. leads to subsequent TLB misses. 60%
Avoid indirect branch Mitigates Spectre (v2) by avoiding Adds around 30 cycles to each poll 89%,
speculation (§4.1.2) speculations on indirect jumps and calls. indirect jump or call. recv 25%
SLAB freelist Randomizes the order of objects in the Destroys spatial locality that leads to big-select
randomization (§4.1.3) SLAB freelist. increased L3 cache misses. 45%
Hardened usercopy Adds sanity checks on data-copy opera- Constant cost for system calls that send 18%,
(§4.1.4) tions between userspace and the kernel. copy data to and from userspace. select 18%
New Features: max combined slowdown: 167% big-pagefault
Fault around (§4.2.1) Pre-establishes mappings for pages Adds constant overhead for page big-
surrounding a faulting page, if they are faults on read-only file-backed pagefault
available in the file cache. pages. 167%
Control group mem- Accounts and limits memory usage per Adds overhead when establishing or big-munmap
ory controller (§4.2.2) control group. destroying page mappings. 81%
Disabling transparent Disables the transparent use of 2MB pages. More page faults when sequentially big-read
huge pages (§4.2.3) accessing large memory regions. 83%
Userspace page fault Enables userspace to provide the mapping Slows down fork, which checks each big-fork
handling (§4.2.4) for page faults in an address range. copied memory area for userspace 12%
mappings.
Configuration Changes: max combined slowdown: 171% small-read
Forced context Misconfiguration forces unnecessary CPU Adds overhead on each kernel entry small-read
tracking (§4.3.1) time accounting and RCU handling on and exit. 171%, recv
every kernel entry and exit. 149%
TLB layout Hardcoded & outdated TLB capacity in Increases TLB misses due to flushes. 50% read
specification (§4.3.2) older kernels causes munmap to flush the after munmap
TLB when it should invalidate individual
TLB entries.
Missing CPU power- Missing power-saving states in older ker- Slows down CPU bound tests. select 31%,
saving states (§4.3.3) nels results in decreased effective fre- send 26%
quency.
Table 3. Summary of root causes causing performance fluctuations across kernel versions. For each root cause, we report
examples of significant slowdowns from highly impacted tests across all kernel versions.

degradations suggests they are caused by intrusive changes page table for userspace and kernel execution, fundamen-
to the kernel. tally modifying some of the core designs of memory manage-
We identified 11 kernel changes that explain the significant ment. Similarly, SLAB freelist randomization (§4.1.3) alters
performance fluctuations as well as more steady sources of dynamic memory allocation behaviours in the kernel.
overhead. These are categorized and summarized in Table 3, Interestingly, several security features introduce overhead
and their impact on LEBench’s performance is overviewed by attempting to defend against untrusted code in the kernel
in Figure 2. The 11 changes fall into three categories: security itself. For example, the hardened usercopy feature (§4.1.4) is
enhancements (4/11), new features (4/11), and configuration used to defend against bugs in kernel code that might copy
changes (3/11). too much data between userspace and the kernel. However,
Overall, Linux users are paying a hefty performance tax we note that it can be redundant with other kernel code that
for security enhancements. The cost of accommodating se- already carefully validates pointers. Similarly, SLAB freelist
curity enhancements is high because many of them demand randomization (§4.1.3) attempts to protect against buffer
significant changes to the kernel. For example, the mitiga- overflow attacks that exploit buggy kernel code. However,
tion for Meltdown (§4.1.1) requires maintaining a separate the randomization introduces overhead for all uses of the
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

Force context tracking Fault around


200
150
Maximum Slowdown Per Root Cause (%)

100
50
0
Spectre patch Hugepages disabled
100
75
50
25
0
cgroup mem. controller Meltdown patch
75
50
25
0
Rand. SLAB freelist Missing CPU idle states
40
30
20
10
0
Harden usercopy
15
10
5
0
15 TLB layout spec. User pagefault handling
10
5
0
contextswitch

med-read

med-write
big-write

thrcreate
mmap *

med-munmap

send & recv *


big-send & recv *
select
poll
epoll
big-select
big-poll
big-epoll

contextswitch

med-read
small-munmap

big-munmap
fork
big-fork

small-pagefault
big-pagefault

small-read

big-read

med-write
small-write

big-write
mmap *

med-munmap

big-fork
thrcreate
send & recv *
big-send & recv *
select
poll
epoll
big-select
big-poll
big-epoll
small-read

big-read
small-write

small-munmap

big-munmap
fork

small-pagefault
big-pagefault
Figure 2. Impact of the 11 identified root causes on the performance of LEBench tests. For every root cause, we display the
maximum slowdown across all kernels for each test. Note that the Y-axis scales are different for each row of subgraphs: Root
causes with highest possible impacts on LEBench are ordered first.

SLAB freelist, including correct kernel code. This suggests months, and has become a well-known cause of performance
a trust issue that is fundamentally rooted in the monolithic troubles for Linux users [46, 48, 57]; control group memory
kernel design [8]. controller (§4.2.2) remained unoptimized for 6.5 years, and
Similar to the security enhancements, supporting many continues to cause significant performance degradation in
new features demands complex and intricate changes to real workloads [49, 69]. Both cases are clearly captured by
the core kernel logic. For example, the control group mem- LEBench, suggesting that more frequent and thorough test-
ory controller feature (§4.2.2), which supports containeriza- ing, as well as more proactive performance optimizations,
tion, requires tracking every page allocation and dealloca- would have avoided these impacts on users.
tion; in an early unoptimized version, it slowed down the As another example where Linux performance would ben-
big-pagefault and big-munmap tests by as much as 26% efit from more proactive optimization, we were able to easily
and 81% respectively. optimize two other security enhancements, namely avoid-
While the complexity of certain features may increase the ing indirect jump speculation (§4.1.2) and hardened user
difficulty of performance optimization. Simple misconfigura- copy (§4.1.4), largely eliminating their slowdowns without
tions have also significantly impacted kernel performance. sacrificing security guarantees.
For example, mistakenly turning on forced context tracking Finally, with little effort, Linux users can avoid most of
(§4.3.1) caused all the benchmark tests to slowdown by an the performance degradation from the identified root causes
average of 50%. by actively reconfiguring their systems. In fact, 8 out of 11
Two aforementioned changes (forced context tracking root causes can be disabled through configuration, and the
and control group memory controller) were significantly other 3 can be disabled through simple patches. Users that
optimized or disabled entirely reactively, i.e., only after per- do not require the new functionalities or security guarantees
formance degradations were observed in released kernels, can disable them to avoid paying unnecessary performance
instead of proactively. Forced context tracking (§4.3.1) was penalties. In addition, our findings also point to the fact that
only disabled after plaguing five versions for more than 11 Linux is shipped with static configurations that cannot adapt
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

to workloads with diverse characteristics. This suggests that

Latency Increase (%)


120 Read Test Slowdown from KPTI
Linux users should pay close attention to performance when 100 with PCID
their workload’s characteristics change or when updating 80
60 without PCID
the kernel; in such scenarios, kernel misconfigurations (with 40
respect to the workload) or Linux performance regressions 20
0
could be avoided by proactive kernel reconfiguration. This 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Read Test Latencies (s) 1e−5
practice has the potential to offer significant performance
gains. Figure 3. Slowdown of the read test due to KPTI, with increasing
baseline latency, with and without the PCID feature enabled.
4 Performance Impacting Root Causes
the kernel page table. The top-level page table contains only
This section describes the 11 root causes from Table 3. For
512 entries and is modified very infrequently, requiring very
each root cause, we first explain the background of the
little “synchronization” between the two copies.
change before analyzing its performance impact.
KPTI’s most serious source of overhead stems from swap-
ping the page table pointer on every kernel entry and exit,
4.1 Security Enhancements
each time forcing a TLB flush. This leads to a constant cost of
Four security enhancements to the kernel resulted in signifi- the two writes to the page table pointer register (CR3 on Intel
cant performance slowdowns in LEBench. The first two are processors) and a variable cost from increased TLB misses.
a response to recently discovered CPU vulnerabilities, and With KPTI, the lower-bound of the constant cost is on the
the last two are meant to protect against buggy kernel code. order of 400–500 cycles, whereas without KPTI, the kernel
entry and exit overhead is less than 100 cycles.1 The variable
4.1.1 Remove Kernel Mappings in Userspace cost of TLB misses depends on different workloads’ memory
Introduced after kernel version 4.14, kernel page table isola- access patterns. For example, small-read and big-read spend
tion (KPTI) [41] is a security patch to mitigate the Meltdown an additional 700 and 6000 cycles in the TLB miss handler,
vulnerability [44] that affects several current generation pro- respectively.
cessor architectures, including Intel x86 [18, 22]. The average The kernel developers released an optimization with the
slowdown caused by KPTI across all microbenchmark tests KPTI patch that avoids the TLB flush on processors with
is 22%; recv and read tests are affected the most, slowing the process-context identifier (PCID) feature [23].2 The fea-
down by 63% and 59% respectively. ture allows tagging each TLB entry with a unique PCID
Meltdown allows a userspace process to read kernel mem- pertaining to an address space, such that only entries with
ory. When the attacker performs a read of an unauthorized the currently active PCID are used by the CPU. The kernel
address, the processor schedules both the read and the privi- developers use this feature to assign a separate PCID for
lege check in its instruction pipeline. However, before the the kernel and user TLB entries, hence the kernel no longer
privilege check is complete, the value read may have already needs to flush the TLB on each entry and exit.
been returned from memory and loaded into the cache. Once The performance improvement is significant. Figure 3
the privilege check fails, the processor does not eliminate all compares KPTI’s overhead on the read test with and without
side-effects of the read and the value remains in the cache. the PCID optimization. For the shortest read test with a
The attacker can exploit this by using a “timing-channel” to baseline latency of 344ns, the PCID optimization reduces
leak the value. the slowdown from 113% to 47%. The number of increased
KPTI mitigates Meltdown by using a different page table cycles in the TLB miss handler is reduced from 700 to just
in the kernel than in userspace. Before the patch, kernel and 30. (Figure 3 also shows that tests with short latencies are
user mode shared the same address space using one shared more sensitive to the overhead caused by KPTI.)
page table with kernel memory protected by requiring a Despite the TLB flush being avoided, we find the lower-
higher privilege level for access. However, this protection bound of the constant cost of KPTI is still 400–500 cycles.
is ineffective with Meltdown. With KPTI, the kernel-space This is because the kernel still needs to write to the CR3 reg-
page table still contains both kernel and user mappings; ister on every entry and exit, since on Intel processors, the
whereas the userspace page table removes the vast majority active PCID is stored in bits that are a part of CR3. Writing
of kernel mappings, leaving only the bare-minimum ones to CR3 is expensive, costing around 200 cycles. This is why,
necessary to service a trap (e.g., handlers for system calls
1 We measure the constant cost by comparing the result of running an
and interrupts) [31].
empty system call with and without the KPTI patch. We measure the cycles
The overhead of keeping two separate page tables is mini-
spent in the MMU’s TLB miss handler. The constant cost is estimated by
mal. KPTI only needs to keep a separate copy of the top-level subtracting the increase in cycles spent in the TLB miss handler from the
page table for both kernel and user page tables; all lower- overall increase in latency.
level page tables in the user page table can be accessed from 2 The results in Figure 1 and Table 3 are obtained with PCID enabled.
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

1 # normal code fs/select.c net/socket.c


int do_select(...) {
2 call load_label
1 for (;;) { const struct file_operations
3 capture_ret_spec: ... socket_file_ops = {
4 pause ; lfence mask = (*f_op->poll)(f.file, wait); .poll = sock_poll,
5 jmp capture_ret_spec } ...
6 load_label: ... };
}
7 mov rax, [rsp]
8 ret 2
9 Figure 5. Left: the indirect branch code snippet used by select,
10 # rax target 2 poll, and epoll. Right: assignment of the poll function pointer
11 … for sockets.

Figure 4. An example showing how Retpoline replaces jmp [rax].


The solid lines indicate actual execution paths, whereas the dotted for (;;) {
line indicates a speculatively executed path. ...
- mask = (* f_op -> poll ) ( f . file , wait ) ;
+ if ((* f_op -> poll ) == sock_poll )
as shown in Figure 3, the shortest read test still experiences + mask = sock_poll ( f . file , wait );
a 59% slowdown with PCID-optimized KPTI. The PCID op- + else if ((* f_op -> poll ) == pipe_poll )
timization alone has a minimal cost: it results in around + mask = pipe_poll ( f . file , wait );
two additional instruction TLB misses per round-trip to the + else if ((* f_op -> poll ) == timerfd_poll )
kernel, compared to pre-KPTI kernels. This is because the + mask = timerfd_poll (f. file , wait ) ;
optimization requires additional code, for example, to swap + else
the active PCID on kernel entry and exit. + mask = (* f_op -> poll )(f. file , wait );
...
Interestingly, the PCID optimization benefits all tests with
}
the exception of the med-munmap test, whose slowdown in-
creases from 18% to 53% with PCID enabled. This is because Figure 6. Our patch to optimize Retpoline’s overhead in select,
med-munmap shoots down individual TLB entries, and the in- poll, and epoll.
struction to invalidate a tagged TLB entry is more expensive.

4.1.2 Avoiding Indirect Branch Speculation the original jump destination, stored in rax, by moving it
Introduced in version 4.14, the Retpoline patch [67] miti- onto the stack. This causes the ret at line 8 to jump to the
gates the second variant (V2) of the Spectre attacks [35] by original jump destination, [rax], instead of line 4. Thus, the
bypassing the processor’s speculative execution of indirect thunk achieves the same behavior as jmp [rax] without
branches. The patch slows down half of the tests by more using indirect branches.
than 10% and causes severe degradation to the select, poll, A careful reader would have noticed that even without
and epoll tests, resulting in an average slowdown of 66%. lines 4 and 5, the speculative path would still fall into an
In particular, poll and epoll slow down by 89% and 72%, infinite loop at lines 7 and 8. What makes lines 4–5 necessary
respectively. is that repeatedly executing line 8, even speculatively, greatly
An indirect branch is a jump or call instruction whose perturbs a separate return address speculator, resulting in
target is not determined statically—it is only resolved at high overhead. In addition, the pause instruction at line 4
runtime. An example is jmp [rax], which jumps to an address provides a hint to the CPU that the two lines are a spin-loop,
that is stored in the rax register. Modern processors use allowing the CPU to optimize for power consumption [2, 24].
the indirect branch predictor to speculatively execute the The slowdown caused by Retpoline is proportional to the
instructions at the predicted target. However, Intel and AMD number of indirect jumps and calls in the test. The penalty
processors do not completely eliminate all side effects of for each such instruction is similar to that of a branch mis-
an incorrect speculation, e.g., by leaving data in the cache prediction. We further investigate the effects of Retpoline on
as described in §4.1.1 [1, 22]. Attackers can exploit such the select test. Without Retpoline, the select test executes
“side-channels” by carefully polluting the indirect branch an average of 31 indirect branches, all of which are indirect
target history, hence tricking the processor into speculatively calls; the misprediction rate of these is less than 1 in 30,000.
executing the desired branch target. Further analysis shows that 95% of these indirect calls are
Retpoline mitigates Spectre v2 by replacing each indirect from just three program locations that use function pointers
branch with a sequence of instructions—called a “thunk”— to invoke the handler of a specific resource type. Figure 5
during compilation. Figure 4 shows the thunk that replaces shows one of the program locations, which is also on the
jmp [rax]. The thunk starts with a call, which pushes the critical path of poll and epoll. The poll function pointer is
return address (line 4) onto the stack, before jumping to invoked repeatedly inside select’s main loop, and the actual
line 7. Line 7, however, replaces the return address with target is decided by the file type (a socket, in our case).
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

With Retpoline, all of the 31 indirect branches executed userspace and the kernel [26]. Without this patch, bugs in
by select are replaced with the thunk, and the ret in the the kernel could be exploited to either cause buffer overflow
thunk always causes a return address misprediction that has attacks when too much data is copied from userspace, or
30-35 cycles of penalty, resulting in a total slowdown of 68% to leak data when too much is copied to userspace. This
for the test. patch protects against such bugs by performing a series of
We alleviated the performance degradation by turning sanity checks on kernel pointers during every copy operation.
each indirect call into a switch statement, i.e., a direct con- However, this adds unnecessary overhead to kernel code that
ditional branch, which Spectre-V2 cannot exploit. Figure 6 already validates pointers.
shows our patch on the program location shown in Figure 5. For example, consider select, which takes a set of file
It directly invokes the specific target after matching for the descriptors for every type of event the user wants to watch
type of the resource. It reduced select’s slowdown from for. When invoked, the kernel copies the set from userspace,
68% to 5.7%, and big-select’s slowdown from 55% to 2.5% modifies it to indicate which events occurred, and then copies
respectively. This patch also reduces Retpoline’s overhead the set back to userspace. During this operation, the kernel
on poll and epoll. already checks that kernel memory was allocated correctly
and only copies as many bytes as were allocated. However,
4.1.3 SLAB Freelist Randomization the hardened usercopy patch adds several redundant sanity
Introduced since version 4.7, SLAB freelist randomization checks to this process. These include checking that i) the
increases the difficulty of exploiting buffer overflow bugs in kernel pointer is not null, ii) the kernel region involved does
the kernel [66]. A SLAB is a chunk of contiguous memory for not overlap the text segment, and iii) the object’s size does
storing equally-sized objects [9, 39]. It is used by the kernel not exceed the size limit of its SLAB if it is allocated using
to allocate kernel objects. A group of SLABs for a particular the SLAB allocator. To evaluate the cost of these redundant
type or size-class is called a cache. For example, fork uses the checks, we carefully patched the kernel to remove them.
kernel’s SLAB allocator to allocate mm_structs from SLABs The cost of hardened usercopy depends on the type of
in the mm_struct cache. The allocator keeps track of free data being copied and the amount. For select, the cost of
spaces for objects in a SLAB using a “freelist,” which is a checking adds 30ns of overhead. This slows down the test
linked list connecting adjacent object spaces in memory. As by a maximum of 18%. poll operates similarly to select
a result, objects allocated one after another will be adjacent in and also has to copy file descriptors and events to and from
memory. This predictability can be exploited by an attacker userspace. Interestingly, epoll does not experience the same
to perform a buffer overflow attack. Oberheide [28] describes degree of slowdown since it copies less data; the list of events
an example of an attack that has occurred in practice. to watch for is kept in the kernel, and only the events which
The SLAB freelist randomization feature randomizes the have occurred are copied to userspace. In contrast, the read
order of free spaces for objects in a SLAB’s freelist such tests copy one page to userspace at a time, but the page does
that consecutive objects in the list are not reliably adjacent not belong to a SLAB. As a result, only basic checks such
in memory. During initialization, the feature generates an as checking for a valid address are performed, costing only
array of random numbers for each cache. Then for every around 5ns for each page copied. This source of overhead is
new SLAB, the freelist is constructed in the order of the not significant even for big-read, which copies 10,000 pages.
corresponding random number array.
This patch resulted in notable overhead on tests that 4.2 New Features
sequentially access a large amount of memory. It caused
Next we describe the root causes that are new kernel fea-
big-fork to slow down by 37%, and the set of tests—big-
tures. One of them, namely fault around (§4.2.1), is in fact, an
select, big-poll, and big-epoll—to slow down by an aver-
optimization. It improves performance for workloads with
age of 41%. The slowdown comes from two sources. The first
certain characteristics at the cost of others. Disabling trans-
is the time spent randomizing the freelist during its initializa-
parent huge pages (§4.2.3) can also improve performance for
tion. In particular, big-fork spent roughly 6% of its execution
certain workloads. However, these features also impose non-
time just randomizing the freelist since it needs to allocate
trivial overhead on LEBench’s microbenchmarks. The other
several SLABs for the new process. The second and more
two features are new kernel functionalities mostly intended
significant source of slowdown is poor locality caused by
for virtualization or containerization needs.
turning sequential object access patterns into random access
patterns. For example, big-fork’s L3 cache misses increased
by around 13%. 4.2.1 Fault Around
Introduced in version 3.15, the fault around feature (“fault-
4.1.4 Hardened Usercopy around”) is an optimization that aims to reduce the number
Introduced since version 4.8, the hardened usercopy patch of minor page faults [34]. A minor page fault occurs when
validates kernel pointers used when copying data between no page table entry (PTE) exists for the required page, but
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

the page is resident in the page cache. On a page fault, fault- significant a slowdown as in the case of munmap, because dur-
around not only attempts to establish the mapping for the ing each page fault, only one page is “charged” — the memory
faulting page, but also for the surrounding pages. Assuming controller’s overhead is still dwarfed by the cost of handling
the workload has good locality and several of the pages the page fault itself. In contrast, munmap often unmaps mul-
adjacent to the required page are resident in the page cache, tiple pages together, aggregating the cost of the inefficient
fault-around will reduce the number of subsequent minor “uncharging.” Note that mmap is generally unaffected by this
page faults. However, if these assumptions do not hold, fault- change because each mmapped page is allocated on demand
around can introduce overhead. For example, Roselli et al. when it is later accessed. In addition, the read and write tests
studied several file system workloads and found that larger are not affected since they use pre-allocated pages from the
files tend to be accessed randomly, which renders prefetching page cache.
unhelpful [63].
The big-pagefault test experiences a 54% slowdown as a 4.2.3 Transparent Huge Pages
result of fault-around. big-pagefault triggers a page fault by Enabled from version 3.13 to 4.6, and again from 4.8 to 4.11,
accessing a single page within a larger memory-mapped re- the transparent huge pages (THP) feature automatically ad-
gion. When handling this page fault, the fault-around feature justs the default page size [38]. It allocates 2MB pages (huge
searches the page cache for surrounding pages and estab- pages), and it also has a background thread that periodically
lishes their mappings, leading to additional overhead. promotes memory regions initially allocated with base pages
(4KB) into huge pages. Under memory pressure, THP may
4.2.2 Control Groups Memory Controller decide to fall back to 4KB pages or free up more memory
Introduced in version 2.6, the control group (cgroup) memory through compaction. THP can decrease the page table size
controller records and limits the memory usage of different and reduce the number of page faults; it also increases “TLB
control groups [42]. Control groups allow a user to isolate reach,” so the number of TLB misses is reduced.
the resource usage of different groups of processes. They However, THP can also negatively impact performance.
are a building block of containerization technologies like It could lead to internal fragmentation within huge pages.
Docker [16] and Linux Containers (LXC) [43]. This feature (Unlike FreeBSD [53], Linux could promote a 2MB region
is tightly coupled with the kernel’s core memory controller that has unallocated base pages into using a huge page [36]).
so it can credit every page deallocation or debit every page Furthermore, the background thread can also introduce over-
allocation to a certain cgroup. It introduces overhead on head [36]. Given this trade-off, kernel developers have been
tests that heavily exercise the kernel memory controller, going back-and-forth on whether to enable THP by default.
even though they do not use the cgroup feature. From version 4.8 to the present, THP is disabled by default.
The munmap tests experienced the most significant slow- In general, THP has positive effects on tests that access
down due to the added overhead during page deallocation. a large amount of memory. In particular, huge-read slows
In particular, big-munmap and med-munmap experienced an 81% down by as much as 83% on versions with THP disabled. It
and 48% slowdown, respectively, in kernels earlier than ver- is worth noting that THP also diminishes the slowdowns
sion 3.17. caused by other root causes. For example, THP reduces the
Interestingly, the kernel developers only began to opti- impact of Kernel Page Table Isolation (§4.1.1), since KPTI
mize cgroup’s overhead since version 3.17, 6.5 years after adds overhead on every kernel trap whereas THP reduces
cgroups was first introduced [29]. During munmap, the mem- the number of page faults.
ory controller needs to “uncharge” the memory usage from
the cgroup. Before version 3.17, the uncharging was done 4.2.4 Userspace Page Fault Handling
once for every page that was deallocated. It also required syn- Enabled in versions 4.6, 4.8, and later versions, userspace
chronization to keep the uncharging and the actual page deal- page fault handling allows a userspace process to handle
location atomic. Since version 3.17, uncharging is batched, page faults for a specified memory region [30]. This is useful
i.e., it is done only once for all the removed mappings. It for a userspace virtual machine monitor (VMM) to better
also occurs at a later stage when the mappings are invali- manage memory. A VMM could inform the kernel to deliver
dated from the TLB, so it no longer requires synchronization. page faults within the guest’s memory range to the VMM.
Consequently, after kernel version 3.17, the slowdowns of One use of this is for virtual machine migration so that the
big-munmap and med-munmap are reduced to 9% and 5%, respec- pages can be migrated on-demand. When the guest VM page
tively. faults, the fault will be delivered to the VMM, where the
In contrast, the memory controller only adds 2.7% of over- VMM can then communicate with a remote VMM to fetch
head for the page fault tests. When handling a page fault, the the page.
memory controller first ensures that the cgroup’s memory Overall, userspace page fault handling introduced negligi-
usage will stay within its limit following the page allocation, ble overhead except for the big-fork test which was slowed
then “charges” the cgroup for the page. Here we do not see as down by 4% on average. This is because fork must check
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

each memory region in the parent process for associated is immutable; when writing to the object, it is copied and
userspace page fault handling information and copy this to updated, resulting in a new version of the object. Because
the child if necessary. When the parent has a large number the write does not perturb existing reads, it can be carried
of pages that are mapped, this check becomes expensive. out at any time. However, deleting the old version of the
object can only be done when it is no longer being read.
4.3 Configuration Changes Therefore, each write also sets a callback to be invoked later
Three of the root causes are non-optimal configurations. to delete the old version of the object when it is safe to do so.
Forced context tracking (§4.3.1) is a misconfiguration by The readers cooperate by actively informing RCU when they
the kernel and Ubuntu developers and causes the biggest start and finish reading an object. Normally, RCU checks for
slowdown in this category. The other two are the conse- ready callbacks and invokes them at each timer interrupt;
quences of older kernel versions lacking specifications for but under RSCT, this has to be performed at other kernel
the newer hardware used in our experiments, thus leading to entries and exits.
non-optimal decisions being made. While this reflects a limi- FCT performs context tracking on every user-kernel mode
tation of our methodology (i.e., running old kernels on new transition for every core, even on the ones without RSCT
hardware), these misconfigurations could impact real Linux enabled. FCT was initially introduced by the Linux devel-
users. First, kernel patches on hardware specifications may opers to test context tracking before RSCT was ready, and
not be released in a timely manner: the release of the (sim- is automatically enabled with RSCT. The Ubuntu develop-
ple) patch that specifies the size of second-level TLB did not ers mistakenly enabled RSCT in a release version, hence
take place until six months after the release of the Haswell inadvertently enabling FCT. When this was reported as a
processors, during which time users of the new hardware performance problem [14], the Ubuntu developers disabled
could suffer a 50% slowdown on certain workloads (§4.3.2). RSCT. However, this still failed to disable FCT, as the Linux
This misconfiguration could impact any modern processor developers accidentally left FCT enabled even after RSCT
with a second level of TLB. Furthermore, hardware speci- was working. This was only fixed in later Ubuntu distribu-
fications for the popular family of Haswell processors are tions as a result of another bug report [17], 11 months after
not back-ported to older kernel versions that still claim to the initial misconfiguration.
be actively supported (§4.3.3).
4.3.2 TLB Layout Change
4.3.1 Forced Context Tracking Introduced in kernel version 3.14, this patch improves per-
Released into the kernel by mistake in versions 3.10 and 3.12– formance by enabling Linux to recognize the size of the
15, forced context tracking (FCT) is a debugging feature that second-level TLB (STLB) on newer Intel processors. Know-
was used in the development of another feature, reduced ing the TLB’s size is important for deciding how to invalidate
scheduling-clock ticks [40]. Nonetheless, FCT was enabled TLB entries during munmap. There are two options: one is to
in several Ubuntu release kernels due to misconfigurations. shoot down (i.e., invalidate) individual entries, and the other
This caused a minimum of approximately 200–300ns over- is to flush the entire TLB. Shoot-down should be used when
head in every trip to and from the kernel, thus significantly the number of mappings to remove is small relative to the
affecting all of our tests (see Figure 1). On average, FCT slows TLB’s capacity, whereas TLB flushing is better when the
down each of the 28 tests by 50%, out of which 7 slow down number of entries to invalidate is comparable to the TLB’s
by more than 100% and another 8 by 25–100%. capacity.
The reduced scheduling-clock ticks (RSCT) feature allows Before this patch was introduced, Linux used the size of
the kernel to disable the delivery of timer interrupts to idle the first-level data and instruction TLB (64 entries on our
CPU cores or cores running only one task. This reduces test machines) as the TLB’s size, and is not aware of the
power consumption for idle cores and interruptions for cores larger second-level TLB with 1024 entries. This resulted in
running a single compute-intensive task. However, work incorrect TLB invalidation decisions: for a TLB capacity of
normally done during these timer interrupts must now be 64, Linux calculates the flushing threshold to be 64/64 = 1.
done during other user-kernel mode transitions like system This means that, without the patch, invalidating more than
calls. Such work is referred to as context tracking. just one entry will cause a full TLB flush. As a result, the
Context tracking involves two tasks—CPU usage tracking med-munmap test, which removes 10 entries, suffers as much as
and participation in the read-copy update (RCU) algorithm. a 50% slowdown on a subsequent read of a memory-mapped
Tracking how much time is spent in userspace and the ker- file of 1024 pages due to the increased TLB misses. With the
nel is usually performed by counting the number of timer patch, the TLB flush threshold is increased to 16 (1024/64)
interrupts. Without timer interrupts, this must be done on on our processor, so med-munmap no longer induced a full
other kernel entries and exits instead. Context tracking also flush. However, this patch was only released six months
participates in RCU, a kernel subsystem that provides lock- after the earliest version of the Haswell family of processors
less synchronization. Conceptually, under RCU, each object was released. Note that small-munmap and big-munmap were
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

not affected because the kernel still made the right decision 1e−4
Latency
Redis SPOP
Throughput
1e5
by invalidating a single entry in small-munmap and flushing 0.8 2.0
the entire TLB in big-munmap. 0.5 1.5
1e−4 Redis RPUSH 1e5
4.3.3 CPU Idle Power-State Support 0.75
2.0
Introduced in kernel version 3.9, this patch specifies the fine- 0.50 1.5

grained idle power saving modes of the Intel processor with 1e−1 Redis RPOP 1e5

Time Per Request (ms)


0.75 2.0

Requests Per Second


the Haswell microarchitecture used by our server. Modern
0.50 1.5
processors save power by idling their components. With
1e−1 Redis GET 1e5
more of its components turning idle, the processor is said 0.75 2.0
to enter a deeper idle state and will consume less power. 1.5
0.50
However, deeper idle states also take longer to recover from, 1e−1 Redis LRANGE 100 1e5
and this latency results in a lower overall effective operating 1.50
0.8
frequency [25]. 1.25 0.6
Before this patch, the kernel only recognized coarse-grained Apache 1e4
5.0
power saving states. Therefore, when trying to save power, 4.0 3.0
it always turned the processor to the deepest idle state. With 3.0 2.0
this patch, the kernel’s idle driver takes control of the proces- 4.0
Nginx 1e4
4.0
sor’s power management and utilizes lighter idle states. This 3.5
3.0
3.0
increases the effective frequency by 31%. On average, this 2.5

3.0
3.2
3.4
3.6
3.8
3.10
3.12
3.14
3.16
3.18
4.0
4.1
4.3
4.5
4.7
4.9
4.11
4.13
4.15
4.17
4.19
patch speeds up LEBench by 21%, with the CPU intensive
select test achieving the most significant speedup of 31%. Linux Kernel Versions
While this patch was released in advance of the release
of the Xeon processors, it was not backported to the LTS Figure 7. Latency and throughput trends of the Apache
kernel lines which were still supported at the time, including Benchmark and selected Redis Benchmark tests (3 write
3.0, 3.2, and 3.4. This means that in order to achieve the best tests and 2 read tests with highest system times).
performance for newer hardware, a user might be forced
to adopt the newer kernel lines at the cost of potentially
unstable features.
responsible for returning the value of a key (GET) and re-
5 Macrobenchmark turning a range of values for a key (LRANGE) [60].
To understand how the 11 identified root causes affect real- We disable the 11 root causes on the kernels and evaluate
world workloads, we evaluate the Redis key-value store [62], their impact on the applications. Overall, disabling the 11
Apache HTTP Server [7], and Nginx web server [55],3 across root causes brings significant speedup for all three appli-
the Linux kernel versions on which we tested LEBench. Re- cations, improving the performance of Redis, Apache, and
dis’ workload was used to build LEBench, while workloads Nginx by a maximum of 56%, 33%, and 34%, and an average
from the other two applications serve as validation. We of 19%, 6.5%, and 10%, respectively, across all kernels. Four
use Redis’ and Apache’s built-in benchmarks—Redis Bench- changes—forced context tracking (§4.3.1), kernel page table
mark [61] and ApacheBench [6]—respectively; we also use isolation (§4.1.1), missing CPU idle power states (§4.3.3), and
ApacheBench to evaluate Nginx. Each benchmark is config- avoiding indirect jump speculation (§4.1.2)—account for 88%
ured to issue 100,000 requests through 50 (for Redis) or 100 of the slowdown across all applications. This is not surprising
(for Apache and Nginx) concurrent connections. given that these four changes also resulted in the most signif-
All three applications spend significant time in the kernel icant and widespread impact on LEBench tests, as evident in
and exhibit performance trends (shown in Figure 7) sim- Figure 2. The rest of the performance-impacting changes cre-
ilar to those observed from LEBench. For each test, the ate more tolerable and steady sources of overhead: across all
throughput trend tends to be the inverse of the latency kernels, they cause an average combined slowdown of 4.2%
trend. For brevity, we only display Redis Benchmark’s three for Redis, and 3.2% for Apache and Nginx; this observation
most kernel-intensive write tests, responsible for inserting is again consistent with the results obtained from LEBench,
(RPUSH) or deleting (SPOP, RPOP) records from the key- where these changes cause an average slowdown of 2.6%
value store, and the two most kernel-intensive read tests, across the tests. It is worth noting that these changes could
cause more significant individual fluctuations—if we only
3 In
2019, Redis is the most popular key value store [15]. Apache and Nginx
count the worst kernels, on average, each change can cause
rank first and third in web server market share, respectively, and together as much as a 5.8%, 11.5%, and 12.2% slowdown for Redis,
account for more than half of all market share [54]. Apache, and Nginx, respectively.
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

(a) % Change in Latency Relative to v4.0 on E5-2630 v3 (b) % Change in Latency Relative to v4.0 on i7-4810MQ
22 23 1 89 1 98 94 82 80 2 1 1 0 0 -4 -2 -8 -11 -9 -10 -14 -6 -5 -6 -4 -6 -6 21 44 0 1 2 93 1 102 98 84 82 3 2 4 0 0 -3 -1 -7 -9 -8 -9 -13 -5 -4 -4 -3 -5 -6 21 46
h
tc
wi d
21 20 -3 146 -4 153151132132 3 -1 -1 -1 0 -6 -3 -1 -3 0 6 0 6 11 12 11 11 10 77 99 -10 -11 -11 130 -11 136134193196 -2 4 26 -9 0 -9 -11 -10 -10 -8 25 -10 -1 2 6 2 6 0 67 87

x ts -read 21 20 -2 15 -2 16 15 13 16 1 0 1 0 0 0 0 2 1 3 3 0 9 8 9 10 9 4 11 16 -13 -12 -13 38 -14 3 2 35 36 -11 -10 -10 -9 0 -12 -11 -11 -11 3 21 -11 -5 -4 -5 -4 -6 -7 -2 2
e l a
nt al re 79 79 54 55 57 58 1 -2 3 -2 -1 1 0 0 2 1 -1 -1 1 2 50 0 0 2 0 72 72 71 75 63 63 63 62 65 67 -2 -3 -2 -1 0 -1 -1 0 21 1 -1 21 1 1 59 4 5 6 2 85 86 86 87
co smed-read 12 15 -1 56 -1 61 59 50 48 0 0 -2 -1 0 9 11 -2 -3 -2 4 1 1 -2 -2 -2 -4 -2 23 53 -13 -12 -7 48 -2 53 51 91 91 -6 5 27 -6 0 1 4 -9 -7 -8 32 -8 -4 -6 -6 -9 -9 -7 15 44
m ig- rite 7 7 -4 4 -2 8 7 4 5 0 -2 -1 -1 0 0 -2 -3 -4 -3 -2 -1 -1 -5 -6 -7 -8 -9 -6 2 -14 -18 -4 2 -1 5 5 38 40 -1 -3 -2 -1 0 -3 -6 -5 -5 -4 31 -3 -2 -6 -7 -15 -8 -16 -8 2
b -w ite
l
al wr te 4 3 -2 -1 0 0 -1 -1 -1 -1 -1 -1 -1 0 1 0 -1 -1 1 0 2 3 -1 0 0 -1 -3 -3 -1 -10 -10 -7 -7 -6 -4 -2 -2 -1 -1 -2 -1 -2 0 0 -1 -1 -3 1 0 0 4 0 -1 -3 -5 -6 -8 -3

s ed-wri *
m 8 19 -3 123 -2 130126114107 2 2 0 1 0 6 4 5 5 3 16 7 9 11 11 18 14 18 73 122 -5 -7 -7 166 -8 130118157177 -2 3 7 -2 0 12 2 2 14 5 41 3 8 6 7 16 14 13 75 112
m ig- ap 29 32 6 63 4 71 69 55 60 3 7 2 4 0 -1 -3 -2 -4 1 5 3 1 4 5 7 9 13 67 75 -1 2 -1 55 3 64 61 99 107 -2 14 31 -1 0 -3 -6 -9 -9 -5 29 -2 -5 -4 5 2 1 8 58 68
b m ap
m nm p 17 18 -4 10 -3 15 13 31 32 14 4 1 2 0 -4 -4 -3 -4 -1 0 -1 1 3 3 5 4 6 63 67 -6 -5 -4 9 -4 13 13 76 75 13 2 0 0 0 -4 -4 -5 -5 -2 -1 -2 1 2 4 4 3 5 62 66
u ma 61 79 46 46 57 56 52 51 44 35 7 2 0 0 1 -5 -3 -6 -1 -2 -3 0 4 0 0 -1 -1 0 -5 18 31 31 32 37 38 40 42 50 24 5 -3 -8 0 -13 -13 -12 -9 -10 -12 -11 -7 -5 3 -8 -8 -11 -10 -15
l l-munmap -1 1 0 -1 4 12
a -m n rk 17 18 -5 -1 -4 5 3 2 3 0 -1 1 0 0 -2 -1 -3 -1 1 2 -2 0 2 -1 -1 -3 0 -2 11 3 40 39 -1 12 2 1 0 -1 0 -2 -1 14 0 -1 0 -1 -1 2 0 -1 4 13

smed mu fo k 11 26 -5 -4 -4 -2 -5 3 1 3 6 2 6 0 0 0 1 2 0 1 33 25 26 26 36 37 39 29 46 1 37 -11 2 -4 2 -1 26 20 7 29 -6 13 0 -7 -7 8 6 -2 26 82 85 88 86 93 96 95 84 100


m ig- fo
r 26 28 6 38 11 27 33 38 64 -3 1 0 3 0 -1 -5 -2 9 -6 13 -2 -5 5 17 6 3 7 31 59 7 9 7 42 16 47 36 86 88 17 41 7 8 0 -2 -3 -1 1 34 10 1 13 6 11 8 11 12 38 47
b g- te
bi rea * 26 24 0 136 -2 148144122117 -1 0 -1 1 0 -6 -4 1 -1 -1 5 3 10 9 9 9 3 6 75 93 3 2 1 221 -3 178145199197 12 -1 -3 38 0 -6 -5 -2 0 -1 -1 3 9 12 11 9 3 3 71 91

rc cv
th re lect
14 12 -8 68 -5 79 75 70 65 4 4 -1 2 0 2 -2 -2 -1 0 7 -4 9 5 9 7 -2 1 37 118 -33 -31 -31 23 -27 33 29 26 68 2 0 -17 1 0 2 0 -25 -26 -26 -25 -29 -11 -14 -19 -21 -29 -21 -1 60

& se oll 7 3 -15 79 -7 102 95 90 83 5 5 5 2 0 -2 -2 -6 -5 -5 7 -8 4 -4 0 0 -3 -3 47 152 -15 -37 -37 33 -28 48 43 63 86 3 0 -17 1 0 -2 -5 -22 -29 -30 -29 -32 -16 -4 -25 -27 -29 -24 7 86

nd p ll 23 21 -1 80 -5 107101 82 81 -3 -1 0 -1 0 2 -2 0 0 1 7 10 3 3 -1 0 -2 0 57 143 -25 0 -25 32 -24 55 50 37 85 -1 0 -17 0 0 2 1 -17 -25 -24 -24 -17 5 -24 -21 -25 -26 -24 20 80
se o
ep ct 1 0 -10 -10 -1 6 15 0 -2 -1 3 -1 2 0 10 1 0 -4 14 0 39 42 44 40 43 41 40 39 118 -24 -24 -21 -9 -1 7 9 0 29 -1 23 11 24 0 8 30 5 -12 -1 -8 34 30 26 27 30 29 28 21 90

e le ll 1 -1 -10 -10 -6 13 -2 0 -2 -1 3 0 2 0 2 3 -1 -5 -2 2 37 40 38 40 41 39 38 36 119 -23 -22 -21 -10 0 13 2 2 30 2 27 23 23 0 4 5 -6 -15 -13 -5 29 30 28 29 25 28 25 29 83


s
g- po l
bi big-pol
21 16 2 -4 -4 2 -2 -2 -2 -3 1 0 1 0 5 -2 -2 1 6 6 45 41 40 42 43 42 42 40 126 -10 -16 -15 -4 1 -1 8 8 28 0 29 4 18 0 5 10 -14 -12 -9 -7 31 24 20 19 23 19 23 21 87

e t 27 19 2 37 -1 43 41 31 31 -1 1 0 3 0 -3 -3 1 -3 -1 5 -2 3 2 1 3 2 4 36 40 1 -4 1 34 -2 41 39 74 81 0 35 34 1 0 7 -4 -4 -5 -5 36 -3 1 5 1 1 2 3 35 39
g- ul
bi efault -4 -9 -8 -9 -8
-40 -44 -52 -36 -54 -34 -35 -39 17 -42 3 2 2 0 -3 -4 -3 -4 2 1 -3 4 6 10 -52 -57 -53 -38 -55 -27 -29 -18 60 11 1 0 12 0 -5 -6 -6 -7 -4 -1 -5 3 5 -1 0 -4 22 3 52
g
a fa lt -50 -50 -59 -48 -60 -46 -47 -49 15 1 3 3 2 0 -3 -4 -4 -5 3 1 -4 2 -9 -13 -13 -14 -12 0 6 -70 -71 -70 -62 -70 -60 -56 -63 17 2 4 -24 3 0 -4 -3 -30 -30 -27 -27 -30 -16 -24 -35 -35 -37 -36 -27 -23
l-p ge u
al -pa efa
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
4.0
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
smed pag
m ig-
b
Linux Kernel Versions

Figure 8. Comparing the results of LEBench on two machines. For brevity, we only show results after v3.7 and before v4.15.

Overall, we find that not only do the macrobenchmarks 7 Discussion


and LEBench display significant overlap in overall perfor- Our findings suggest that kernel performance tuning can
mance trends, they are also impacted by the 11 changes in play an important role. Unfortunately, thorough performance
very similar ways. While it is a limitation that we carried tuning of the Linux kernel can be extremely expensive. For
out detailed analysis on the results from a microbenchmark, example, Red Hat and Suse normally require 6-18 months
which does not always exercise kernel operations the same to optimize the performance of an upstream Linux kernel
way as a macrobenchmark, we note that even different real- before it can be released as an enterprise distribution [65].
world workloads do not necessarily exercise kernel opera- Adding to the difficulty, Linux is a generic OS kernel and thus
tions identically. We chose to construct LEBench out of a set must support a diverse array of hardware configurations and
of representative real-world workloads, and our evaluation workloads; many forms of performance optimization do not
results from the macrobenchmarks confirm the relevance of make sense unless a workload’s characteristics are taken
LEBench. into account. For example, Google’s data center kernel is
carefully performance tuned for their workloads. This task
6 Sensitivity Analysis is carried out by a team of over 100 engineers, and for each
new kernel, the effort can also take 6-18 months [65].
To understand how different hardware affects the results Unfortunately, this heavyweight performance tuning pro-
from LEBench, we repeat the tests on a laptop with a 2.8GHz cess cannot catch up with the pace at which Linux is evolving.
Intel i7-4810MQ processor, 32GB of 1600MHz DDR4 memory Our study observes an increasing number of features and
and a 512GB SSD. Figure 8 displays a side-by-side comparison security enhancements being added to Linux. In fact, Linux
of the results. releases a new kernel every 2-3 months, and every release
Out of the 11 changes described in §4, 10 have similar per- incorporates between 13,000 and 18,000 commits [32]. It is
formance impacts on LEBench, on the i7 laptop. Updating estimated that the mainline Linux kernel accepts 8.5 changes
CPU idle states does not impact the i7 processor’s frequency. every hour on average [19]. Under such a tight schedule,
The other 10 changes manifest in differing degrees of perfor- each release effectively only serves as an integration and sta-
mance impact on each machine due to different hardware bilization point; therefore, systematic performance tuning is
speeds. For example, the i7 laptop has a faster processor and not carried out by the kernel or distribution developers for
slower memory. Therefore, the slowdown due to increased most kernel releases [27, 65].
L3 misses from randomizing the SLAB freelist gets exagger- Clearly, performance comes at a high cost, and unfortu-
ated for big-fork (seen after v4.6), likely because the test is nately, this cost is difficult to get around. Most Linux users
memory bound. In addition, we observe more performance cannot afford the amount of resource large enterprises like
variability in the results collected from the laptop, caused by Google put into custom Linux performance tuning. For the
CPU throttling due to over-heating.
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

average user, it may be more economical to pay for a Red In 2007, developers at Intel introduced a Linux regression
Hat Enterprise Linux (RHEL) licence, or they may have to testing framework using a suite of micro- and macrobench-
compensate for the lack of performance tuning by investing marks [21], which caught a number of performance regres-
in hardware (i.e., purchasing more powerful servers or scal- sions in release candidates [13]. In contrast, our study focuses
ing their server pool) to make up for slower kernels. All of on performance changes in stable versions that persist over
these facts point to the importance of kernel performance, many versions, which are more likely to impact real users.
whose optimization remains a difficult challenge. Additional studies have analyzed other aspects of OS per-
formance. Boyd-Wickizer et al. [10] analyzed Linux’s scal-
8 Limitations ability and found that the traditional kernel design can be
adapted to scale without architectural changes. Lozi et al. [45]
We restrict the scope of our study due to practical limitations.
discovered Linux kernel bugs that resulted in leaving cores
First, while LEBench tests are obtained from profiling a set of
idle even when runnable tasks exist. Pillai et al. [58] dis-
popular workloads, we omitted many other types of popular
covered Linux file systems often trade crash consistency
Linux workloads, for example, HPC or virtualization work-
guarantees for good performance.
loads [3]. Second, we only used two machine setups in our
Finally, Heiser and Elphinstone [20] examined the evolu-
study, and both use Intel processors. A more comprehensive
tion of the L4 microkernel for the past 20 years and found
study should sample other types of processor architectures,
that many design and implementation choices have been
for example, those used in embedded devices on which Linux
phased out because they either are too complex or inflexible,
is widely deployed. Finally, our study focuses on Linux, and
or complicate verification.
our results may not be general to other OSes.

9 Related Work 10 Concluding Remarks


This paper presents an in-depth analysis on the evolution
Prior works on analyzing core OS operation performance
of core OS operation performance in Linux. Overall, most
focused either on comparing the same OS on different archi-
of the core Linux operations today are much slower than a
tectures or different OSes on the same architecture. In com-
few years ago, and substantial performance fluctuations are
parison, this paper is the first to compare historical versions
common. We attribute most of the slowdowns to 11 changes
of the same OS, systematically analyzing the root causes of
grouped into three categories: security enhancements, new
performance fluctuations over time.
features and misconfigurations. Studying each change in de-
Ousterhout [56] analyzed OS performance on a range
tail, we find that many of the performance impacting changes
of computers and concluded that the OS was not getting
are possible to mitigate with more proactive performance
faster at the same rate as the processor due to the increasing
testing and optimizations; and most of the performance im-
discrepancy between the speed of the processor and other
pact is possible to avoid through custom configuration of
devices. Anderson et al. [5] further zoomed into processor
the kernel. This highlights the importance of investing more
architectures and provided a detailed analysis on the per-
in kernel performance tuning and its potential benefits.
formance implications of different architecture designs to
the OS. Rosenblum et al. [64] evaluated the impact of ar-
chitectural trends on operating system performance. They Acknowledgements
found that despite faster processors and bigger caches, OS We would like to thank our shepherd, Edouard Bugnion,
performance continued to be bottlenecked by disk I/O and and the anonymous reviewers for their extensive feedback
by memory on multiprocessors. Chen and Patterson [12] and comments on our work. We thank Theodore Ts’o for
developed a self-scaling I/O benchmark and used it to ana- explaining the practices of Linux performance tuning used
lyze a number of different systems. McVoy and Staelin [47] by the kernel developers, distributors like Red Hat and Suse,
developed lmbench, a microbenchmark that measures a va- and users like Google.
riety of OS operations. Brown and Seltzer [11] further ex- We thank Tong Liu for collecting traces of the real-world
tended lmbench. A large number of tests in lmbench were workloads used to develop LEBench. We thank Serguei Makarov
still focused on measuring different hardware speeds. In for sharing his experiences with upstream software project
comparison, we selected tests from workloads commonly development. We also thank David Lion for his feedback on
used today; therefore, LEBench might be more relevant for this paper.
modern applications. This research is supported by an NSERC Discovery grant,
Others have evaluated Linux kernel performance using a NetApp Faculty Fellowship, a VMware gift, and a Huawei
macrobenchmarks. Phoronix [50, 51] studied Linux’s perfor- grant. Xiang (Jenny) Ren and Kirk Rodrigues are supported
mance across multiple versions but focused on macrobench- by SOSP 2019 student scholarships from the ACM Special
marks, many of which are not kernel intensive. Moreover, Interest Group in Operating Systems to attend the SOSP
they do not analyze the causes of these performance changes. conference.
An Analysis of Performance Evolution of Linux’s Core Operations SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada

References [23] Intel Corporation. 2019. Intel® 64 and IA-32 Architectures Software
[1] Advanced Micro Devices. 2018. “Speculative Store Bypass” Vulner- Developer’s Manual. Vol. 3A. Chapter 4.10.1.
ability Mitigations for AMD Platforms. https://www.amd.com/en/ [24] Intel Corporation. 2019. Intel® 64 and IA-32 Architectures Software
corporate/security-updates. Developer’s Manual. Vol. 1. Chapter 11.4.4.4.
[2] Advanced Micro Devices. 2019. AMD64 Architecture Programmer’s [25] Intel Corporation. 2019. Intel® 64 and IA-32 Architectures Software
Manual. Vol. 3. Chapter 3, 262. Developer’s Manual. Vol. 3. Chapter 14.5.
[3] Al Gillen and Gary Chen. 2011. The Value of Linux in Today’s Fast- [26] Jake Edge. 2016. Hardened Usercopy. https://lwn.net/Articles/695991/.
Changing Computing Environments. [27] Jake Edge. 2017. Testing Kernels. https://lwn.net/Articles/734016/.
[4] Amazon Web Services. 2017. AWS re:Invent 2017: How Netflix [28] Jon Oberheide. 2010. Linux Kernel CAN SLUB Overflow. https://jon.
Tunes Amazon EC2 Instances for Performance (CMP325). https: oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/.
//www.youtube.com/watch?v=89fYOo1V2pA. [29] Jonathan Corbet. 2007. Notes from a Container. https://lwn.net/
[5] Thomas E. Anderson, Henry M. Levy, Brian N. Bershad, and Edward D. Articles/256389/.
Lazowska. 1991. The Interaction of Architecture and Operating Sys- [30] Jonathan Corbet. 2015. User-space Page Fault Handling. https://lwn.
tem Design. In Proceedings of the 4th International Conference on Ar- net/Articles/636226/.
chitectural Support for Programming Languages and Operating Systems [31] Jonathan Corbet. 2017. The Current State of Kernel Page-table Isolation.
(ASPLOS IV). ACM, 108–120. https://lwn.net/Articles/741878/.
[6] Apache. 2018. ab - Apache HTTP Server Benchmarking Tool. https: [32] Jonathan Corbet and Greg Kroah-Hartman. 2017. 2017 State of Linux
//httpd.apache.org/docs/2.4/programs/ab.html. Kernel Development. https://www.linuxfoundation.org/2017-linux-
[7] Apache. 2018. Apache HTTP Server Project. https://httpd.apache.org/. kernel-report-landing-page/.
[8] Simon Biggs, Damon Lee, and Gernot Heiser. 2018. The Jury Is In: [33] Judd Vinet and Aaron Griffin. 2018. Arch Linux. https://www.archlinux.
Monolithic OS Design Is Flawed: Microkernel-based Designs Improve org/.
Security. In Proceedings of the 9th Asia-Pacific Workshop on Systems [34] Kirill A. Shutemov. 2014. mm: Map Few Pages Around Fault Address
(APSys ’18). ACM, Article 16, 7 pages. if They are in Page Cache. https://lwn.net/Articles/588802/.
[9] Jeff Bonwick. 1994. The Slab Allocator: An Object-caching Kernel [35] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike Ham-
Memory Allocator. In Proceedings of the 1994 USENIX Summer Technical burg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael Schwarz,
Conference (USTC ’94). USENIX Association, 87–98. and Yuval Yarom. 2018. Spectre Attacks: Exploiting Speculative Exe-
[10] Silas Boyd-Wickizer, Austin T. Clements, Yandong Mao, Aleksey cution. (Jan. 2018). arXiv:1801.01203
Pesterev, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich. [36] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach,
2010. An Analysis of Linux Scalability to Many Cores. In Proceed- and Emmett Witchel. 2016. Coordinated and Efficient Huge Page
ings of the 9th USENIX Conference on Operating Systems Design and Management with Ingens. In Proceedings of the 12th USENIX Conference
Implementation (OSDI ’10). USENIX Association, 1–16. on Operating Systems Design and Implementation (OSDI ’16). USENIX
[11] Aaron B. Brown and Margo I. Seltzer. 1997. Operating System Bench- Association, 705–721.
marking in the Wake of Lmbench: A Case Study of the Performance of [37] Kevin Lai and Mary Baker. 1996. A Performance Comparison of UNIX
NetBSD on the Intel x86 Architecture. In Proceedings of the 1997 ACM Operating Systems on the Pentium. In Proceedings of the 1996 USENIX
SIGMETRICS International Conference on Measurement and Modeling Annual Technical Conference (ATC ’96). USENIX Association, 265–277.
of Computer Systems (SIGMETRICS ’97). ACM, 214–224. [38] Linux. 2017. = Transparent Hugepage Support =. https://www.kernel.
[12] Peter M. Chen and David A. Patterson. 1993. A New Approach to I/O org/doc/Documentation/vm/transhuge.txt.
Performance Evaluation: Self-scaling I/O Benchmarks, Predicted I/O [39] Linux. 2017. Short Users Guide for SLUB. https://www.kernel.org/doc/
Performance. In Proceedings of the 1993 ACM SIGMETRICS Conference Documentation/vm/slub.txt.
on Measurement and Modeling of Computer Systems (SIGMETRICS ’93). [40] Linux. 2018. NO_HZ: Reducing Scheduling-Clock Ticks. https://www.
ACM, 1–12. kernel.org/doc/Documentation/timers/NO_HZ.txt.
[13] Tim Chen, Leonid I. Ananiev, and Alexander V. Tikhonov. 2007. Keep- [41] Linux. 2018. Page Table Isolation. https://www.kernel.org/doc/
ing Kernel Performance from Regressions. In Proceedings of the Linux Documentation/x86/pti.txt.
Symposium, Vol. 1. 93–102. [42] Linux. 2019. Memory Resource Controller. https://www.kernel.org/
[14] Colin Ian King. 2013. Context Switching on 3.11 Kernel Costing CPU doc/Documentation/cgroup-v1/memory.txt.
and Power. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/ [43] Linux Containers. 2018. Linux Containers. https://linuxcontainers.
1233681. org/.
[15] DB-Engines. 2019. DB-engines Ranking. https://db-engines.com/en/ [44] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner
ranking. Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and
[16] Docker. 2018. Docker. https://www.docker.com/. Mike Hamburg. 2018. Meltdown. (Jan. 2018). arXiv:1801.01207
[17] George Greer. 2014. getitimer Returns it_value=0 Erroneously. https: [45] Jean-Pierre Lozi, Baptiste Lepers, Justin Funston, Fabien Gaud, Vivien
//bugs.launchpad.net/ubuntu/+source/linux/+bug/1349028. Quéma, and Alexandra Fedorova. 2016. The Linux Scheduler: A Decade
[18] Graz University of Technology. 2018. Meltdown and Spectre. https: of Wasted Cores. In Proceedings of the 11th European Conference on
//meltdownattack.com/. Computer Systems (EuroSys ’16). ACM, Article 1, 16 pages.
[19] Greg Kroah-Hartman. 2017. Linux Kernel Release Model. http://kroah. [46] Markus Podar. 2014. Current Ubuntu 14.04 Uses Kernel with Degraded
com/log/blog/2018/02/05/linux-kernel-release-model/. Disk Performance in SMP Environment. https://github.com/jedi4ever/
[20] Gernot Heiser and Kevin Elphinstone. 2016. L4 Microkernels: The veewee/issues/1015.
Lessons from 20 Years of Research and Deployment. ACM Transaction [47] Larry McVoy and Carl Staelin. 1996. Lmbench: Portable Tools for Per-
on Computer Systems 34, 1, Article 1 (April 2016), 29 pages. formance Analysis. In Proceedings of the 1996 USENIX Annual Technical
[21] Intel Corporation. 2017. Linux Kernel Performance. https://01.org/lkp. Conference (ATC ’96). USENIX Association, 279–294.
[22] Intel Corporation. 2018. Speculative Execution and Indirect Branch [48] Michael Dale Long. 2016. Unnaccounted for High CPU Usage While
Prediction Side Channel Analysis Method. https://www.intel.com/ Idle. https://bugzilla.kernel.org/show_bug.cgi?id=150311.
content/www/us/en/security-center/advisory/intel-sa-00088.html. [49] Michael Kerrisk. 2012. KS2012: memcg/mm: Improving Memory
cgroups Performance for Non-users. https://lwn.net/Articles/516533/.
SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada X. Ren et al.

[50] Michael Larabel. 2010. Five Years of Linux Kernel Benchmarks: 2.6.12 Equal: On the Complexity of Crafting Crash-consistent Applications.
Through 2.6.37. https://www.phoronix.com/scan.php?page=article& In Proceedings of the 11th USENIX Conference on Operating Systems
item=linux_2612_2637. Design and Implementation (OSDI ’14). USENIX Association, 433–448.
[51] Michael Larabel. 2016. Linux 3.5 Through Linux 4.4 Kernel Bench- [59] Randal E. Bryant and David R. O’Hallaron. 2002. Computer Systems: A
marks: A 19-Way Kernel Showdown Shows Some Regressions. https: Programmer’s Perspective (1 ed.). Prentice Hall, 467–470.
//www.phoronix.com/scan.php?page=article&item=linux-44-19way. [60] Redis. 2018. Command Reference — Redis. https://redis.io/commands.
[52] Michael Larabel. 2017. The Linux Kernel Gained 2.5 Million Lines of [61] Redis. 2018. How Fast is Redis? https://redis.io/topics/benchmarks.
Code, 71k Commits in 2017. https://www.phoronix.com/scan.php? [62] Redis. 2018. Redis. https://redis.io/.
page=news_item&px=Linux-Kernel-Commits-2017. [63] Drew Roselli, Jacob R. Lorch, and Thomas E. Anderson. 2000. A Com-
[53] Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Prac- parison of File System Workloads. In Proceedings of the 2000 USENIX
tical, Transparent Operating System Support for Superpages. In Pro- Annual Technical Conference (ATC ’00). USENIX Association, 41–54.
ceedings of the 5th Symposium on Operating Systems Design and Imple- [64] M. Rosenblum, E. Bugnion, S. A. Herrod, E. Witchel, and A. Gupta.
mentation (OSDI ’02). USENIX Association, 89–104. 1995. The Impact of Architectural Trends on Operating System Per-
[54] Netcraft. 2019. March 2019 Web Server Survey | Netcraft. formance. In Proceedings of the 15th ACM Symposium on Operating
https://news.netcraft.com/archives/2019/03/28/march-2019-web- Systems Principles (SOSP ’95). ACM, 285–298.
server-survey.html. [65] Theodore Y. Ts’o. 2019. Personal Communication.
[55] Nginx. 2019. NGINX | High Performance Load Balancer, Web Server, [66] Thomas Garnier. 2016. mm: SLAB Freelist Randomization. https:
& Reverse Proxy. https://www.nginx.com/. //lwn.net/Articles/682814/.
[56] John K. Ousterhout. 1990. Why Aren’t Operating Systems Getting [67] Thomas Gleixner. 2018. x86/retpoline: Add Initial Retpoline Support.
Faster As Fast as Hardware?. In Proceedings of the 1990 USENIX Summer https://patchwork.kernel.org/patch/10152669/.
Technical Conference (USTC ’90). USENIX Association, 247–256. [68] Ubuntu. 2018. Ubuntu. https://www.ubuntu.com/.
[57] Philippe Gerum. 2018. Troubleshooting Guide. https://gitlab.denx.de/ [69] Vlad Frolov. 2016. [REGRESSION] Intensive Memory CGroup Removal
Xenomai/xenomai/wikis/Troubleshooting. Leads to High Load Average 10+. https://bugzilla.kernel.org/show_
[58] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ram- bug.cgi?id=190841.
natthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, [70] W3Techs. 2018. Usage Statistics and Market Share of Linux for Web-
and Remzi H. Arpaci-Dusseau. 2014. All File Systems Are Not Created sites. https://w3techs.com/technologies/details/os-linux/all/all.

You might also like