Master Linux Programming and Network
Master Linux Programming and Network
Backward-Edge Control-Flow
Integrity Performance in the
Linux Kernel
Christian Resell
Department of Informatics
Faculty of mathematics and natural sciences
UNIVERSITY OF OSLO
Spring 2020
Forward-Edge and
Backward-Edge
Control-Flow Integrity
Performance in the Linux
Kernel
Christian Resell
© 2020 Christian Resell
http://www.duo.uio.no/
1
Acknowledgements
First of all I would like to thank my supervisor, Knut Omang, for letting me
choose my own topic for this thesis. His patience and great advice have been
crucial for completing the thesis.
Thanks to Aleksi Luukkonen for letting me use his hardware during the
early stages of writing. This was very helpful for getting started. Thank you to
Christoffer Buen for being a great rubber duck when I was stuck early in the
research process.
I would also like to thank my family for all the support they have given me
through the years. Finally, I would like to thank my girlfriend, Marit Iren Rognli
Tokle, for supporting me and putting up with me while I have been working on
this thesis.
3
Contents
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background and motivation . . . . . . . . . . . . . . . . . . . . . 1
1.3 Terms and definitions . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Vulnerability and exploit . . . . . . . . . . . . . . . . . . 2
1.3.2 Kernel space and user space . . . . . . . . . . . . . . . . . 2
1.3.3 System calls . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.4 Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Linux kernel 4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 History and distributions . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Android . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Android IPC — Binder . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Open source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.6 Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6.1 Calling conventions . . . . . . . . . . . . . . . . . . . . . . 7
2.6.2 Prologues and epilogues . . . . . . . . . . . . . . . . . . . 12
3 Software vulnerabilities 14
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Memory layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Memory corruption bugs . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Code-reuse attacks . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.1 return-to-libc . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 Return-oriented programming . . . . . . . . . . . . . . . . 22
3.4.3 Jump-oriented programming . . . . . . . . . . . . . . . . 23
3.4.4 Other attacks . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Memory corruption defenses . . . . . . . . . . . . . . . . . . . . . 24
3.5.1 Address space layout randomization . . . . . . . . . . . . 24
3.5.2 Stack protection . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Other vulnerability classes . . . . . . . . . . . . . . . . . . . . . . 26
3.6.1 Format string vulnerabilities . . . . . . . . . . . . . . . . 26
3.6.2 Use-after-free vulnerabilities . . . . . . . . . . . . . . . . . 28
3.6.3 Type confusion . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6.4 Heap vulnerabilities in general . . . . . . . . . . . . . . . 28
5
3.6.5 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 28
6
7 Performance benchmarks 64
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Motivation and previous work . . . . . . . . . . . . . . . . . . . . 66
7.3 Binary size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.4 Compile time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.5 Kernel performance . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.6 WireGuard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8 Results 71
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.2 QEMU results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
8.3 Binary size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.4 Compilation time . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.5 Micro benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.5.1 LMBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.6 Macro benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.6.1 Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
8.6.2 ApacheBench . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.6.3 WireGuard . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.7 Android benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
8.9 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A Scripts 99
A.1 Linux kernel installation script . . . . . . . . . . . . . . . . . . . 99
A.2 Linux kernel build script . . . . . . . . . . . . . . . . . . . . . . . 100
A.3 Android Linux kernel build script . . . . . . . . . . . . . . . . . . 102
A.4 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.5 Benchmarking script . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.6 Python benchmarking scripts . . . . . . . . . . . . . . . . . . . . 104
A.7 WireGuard benchmarking . . . . . . . . . . . . . . . . . . . . . . 123
B Code 132
B.1 CFI failure fixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.2 Inline limit patch . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7
List of Tables
8
List of Figures
9
8.4 LMBench bandwidth read results . . . . . . . . . . . . . . . . . . 76
8.5 LMBench bandwidth read open2close results . . . . . . . . . . . 76
8.6 LMBench bandwidth mmap read results . . . . . . . . . . . . . . 77
8.7 LMBench bandwidth mmap read open2close results . . . . . . . 77
8.8 LMBench bandwidth bzero results . . . . . . . . . . . . . . . . . 78
8.9 LMBench bandwidth memory read results . . . . . . . . . . . . . 78
8.10 LMBench bandwidth memory partial read results . . . . . . . . . 79
8.11 LMBench bandwidth memory write results . . . . . . . . . . . . 79
8.12 LMBench bandwidth memory partial write results . . . . . . . . 80
8.13 LMBench bandwidth memory partial read/write results . . . . . 80
8.14 newstat() definition in the Linux kernel . . . . . . . . . . . . . . 81
8.15 newstat() from gcc kernel . . . . . . . . . . . . . . . . . . . . . 81
8.16 newstat() from clang kernels . . . . . . . . . . . . . . . . . . . . 82
8.17 user path at empty() assembly code from gcc kernel . . . . . . 85
8.18 redis-benchmark results . . . . . . . . . . . . . . . . . . . . . . . 86
8.19 ApacheBench nginx requests/s . . . . . . . . . . . . . . . . . . . 88
8.20 ApacheBench nginx time per request . . . . . . . . . . . . . . . . 89
8.21 ApacheBench apache2 requests/s . . . . . . . . . . . . . . . . . . 90
8.22 ApacheBench apache2 time per request . . . . . . . . . . . . . . 91
8.23 WireGuard UDP send bandwidth . . . . . . . . . . . . . . . . . . 93
8.24 WireGuard TCP send bandwidth . . . . . . . . . . . . . . . . . . 94
8.25 WireGuard TCP receive bandwidth . . . . . . . . . . . . . . . . 95
10
Chapter 1
Introduction
1.1 Introduction
Software security is a cat-and-mouse game between attackers and defenders. As
more and more sophisticated protections are put in place, attackers come up
with more creative and advanced techniques for circumventing them.
This thesis will look into modern software defenses employed today. First
going through standard mitigations supported by most modern compilers, then
finally looking at state-of-the-art protections applied to the Linux kernel. More
specifically, our focus will be so-called forward-edge and backward-edge control-
flow integrity implementations.
The modern defenses examined requires that the Linux kernel is compiled
using clang, which is not the default compiler usually used to build the kernel.
Thus it makes sense to examine whether switching to clang can provide a perfor-
mance boost, or if it will slow the kernel down. To support control-flow integrity,
clang uses something called link-time optimization, which is an interprocedural
optimization technique that enables the compiler to apply optimizations across
translation units. Since link-time optimization is a requirement for the chosen
defenses to work, we will examine the impact it has on the kernel.
1
The security features we will discuss can be quite complex, and the same
applies for the Linux kernel. We will spend some time introducing relevant
background material in the earlier chapters so that the reader is more familiar
with modern software security before delving into more complex topics.
1.3.4 Inlining
Inlining is an optimization compilers can perfom by replacing a function call
site with the body of the function. This can eliminate some of the performance
overhead associated with calling a function. However, too much inlining can
have negative effects on performance because of increased code size. The pro-
grammer can also signal to the compiler that a function should be inlined by
using the inline keyword. In the following code snippet the foobar() function
is marked as inline.
2
inline int foobar(void)
{
return 0x41414141;
}
The compiler does not have to inline the function, however, this is simply a
hint to the compiler.
1.4 Conventions
For clarity, some conventions used throughout thesis are written down in this
section. When refering to certain programming terms like variable types, as-
sembly instructions, and so on, they are written using a monospace font like
this. Function names are written using a monospace font and parentheses at
the end like this: function(). File paths are also written using a monospace
font: /foo/bar/foobar.c.
1.5 Outline
The rest of the thesis is structured as follows:
Chapter 2 will present some background information on the Linux kernel.
First some brief history about Linux, Ubuntu, and Android, followed by an
introduction to the two CPU architectures that are relevant for this thesis,
namely x86 64 and arm64. Finally, Android IPC will be briefly discussed as
this is the target for benchmarks performed on Android.
Chapter 3 will describe software vulnerabilities, defenses, and attacks. The
chapters start with a general introduction and then moves on to memory cor-
ruption bugs and different techniques used by attackers. Then we will discuss
some general memory corruption defenses and finally other vulnerability classes.
A general understanding of these topics is required for the further chapters.
Chapter 4 will continue the general discuss about software vulnerabilities
with more specific implementation details related to defensive mechanisms. The
chapter will detail how certain defenses are implemented in the gcc and clang
compilers. This will be useful for understanding the later chapters where we
will delve into implementation details in the Linux kernel.
Chapter 5 describes some Linux kernel internals. First we describe the
different versions of the kernel that are available, and then briefly introduce the
build system. Finally we will discuss security features in the kernel.
Chapter 6 presents how forward-edge and backward-edge control-flow in-
tegrity is implemented in the Linux kernel. Since building the kernel with clang
is a requirement for this to work at the moment, we will first discuss how this
can be accomplished, and what work has been done to make it possible.
Chapter 7 describes the different performance benchmarks performed. We
start off with presenting motivation and previous work before delving into the
specific benchmarks chosen for this thesis.
Chapter 8 concludes the thesis with results from the benchmarks and
presents ideas for future work.
Appendix features source code, scripts, and other information.
3
Chapter 2
Linux kernel
2.1 Introduction
In this chapter we will discuss Linux [45], and more specifically, the Linux kernel.
In section 2.2 we will briefly look at the history of the Linux kernel. Then,
in section 2.3, we will look at Android, which is Google’s mobile operating
system based on Linux. In section 2.4 we will briefly introduce Android’s IPC
mechanism, called Binder. In section 2.5 we will briefly discuss open source and
how the Linux kernel is licensed. Finally, we will dive into details surrounding
the different CPU architectures supported by Linux in section 2.6.
4
and has support for proprietary drivers which makes it easier to use when dealing
with hardware that has no or lacking support for open source drivers. Ubuntu
is actually based on another distro called Debian [24]. While Ubuntu is a com-
mercially backed distro, Debian is entirely driven by a community of volunteers.
We could also have used Debian, since it is a popular distro that is the foun-
dation for many other distros, like Ubuntu. In addition, it only consists of free
software. Debian has a very long history as a distro, with version 0.01 released
on September 15, 1993.
Debian Jessie and Stretch are two very stable distros that uses longterm
support kernels. We will describe the different types of Linux kernel releases in
section 5.2, and the different longterm kernels are listed in table 5.1. Debian
Jessie relies on 3.16, which is supported until 2020, while Debian Stretch relies
on 4.9, which is supported all the way until 2023. Ubuntu uses kernels that are
a bit newer, 4.15 and 5.0 for versions 18.04 and 19.04, respectively.
We could have focused on other distros like Redhat, Fedora, Arch Linux,
and Clear Linux as well, but Ubuntu is a very well tested and widespread distro
that should cover all benchmarking needs. Some distros that rely on a rolling-
release model might support more software that takes advantage of newer kernel
features out-of-the-box, but the benchmarking software used in later chapters
will not have any issues like this.
2.3 Android
The former section described different types of Linux operating systems pri-
marily for servers and desktop computers. Linux is very widespread in the
smartphone market because of Google’s mobile operating system, Android [31],
which is currently the most popular mobile operating system with 2 billion
montly active users in May 2017 according to Google. Android introduces sev-
eral modifications to the Linux kernel, and some of them eventually make it
back to the upstream kernel. Binder, which is the inter-process communication
(IPC) subsystem for Android, is an example of such a modification. It was
originally developed by Be Inc., later Palm, Inc. for beOS among others. This
code was later the based for Binder as we know it on Android today. Binder is
a central component of Android, as the IPC mechanism is used by apps. We
will describe Binder further in section 2.4.
Android updates are released on a yearly basis. All the major releases before
10 are named after different sweets, and as of June 15, 2020, Android 11.0 is
the newest version, although it is currently released as a preview build. Table
2.1 lists all the different versions.
Many different companies like Samsung, Huawei, and Nokia use Android as
the operating system for their smartphones. Google releases Android as the
Android Open Source Project (AOSP), which other manufacturers can then
modify to suit their needs. Many devices might need new drivers for device
specific hardware, or other modifications that are not a part of AOSP.
Google develops their own smartphones, named Pixel. Currently, four gen-
erations of Pixel phones are released. The source code for Pixel and Pixel 3 is
open source, and is part of AOSP. As we will discuss further in chapter 5, Google
has introduced several interesting performance and security enhancements for
their newer Pixel models.
5
Version Name Released
1.5 Cupcake 2009
1.6 Donut 2009
2.0 Eclair 2009
2.2 Froyo 2010
2.3 Gingerbread 2010
3.0 Honeycomb 2011
4.0 Ice Cream Sandwich 2011
4.1 Jelly Bean 2012
4.4 KitKat 2013
5.0 Lollipop 2014
6.0 Marshmallow 2015
7.0 Nougat 2016
8.0 Oreo 2017
9.0 Pie 2018
10.0 Android 10 2019
11.0 Android 11 2020
6
products, as long as the source code is made available. Changes have to be
tracked, and the modified source also has to be licensed under GPLv2.
The souce code for the Linux kernel can be found on kernel.org or on
GitHub under Linus Torvald’s account torvalds 1 .
2.6 Architectures
Linux supports many different CPU architectures. The code for these are lo-
cated in the arch directory in the top-level Linux kernel directory. As of 4.14,
there are 31 architectures supported. For our purpose, however, we will only
focus on x86 64 and arm64. x86 64 is the most common architecture for desk-
tops and laptops, while arm64 are more common for smartphones and similar
devices. x86 64 is also known as amd64, and arm64 is sometimes referred to
as aarch64. These terms will be used interchangeably throughout the thesis.
The 32-bit version of x86 64 is known as x86 and the 32-bit version of arm64
is known as arm. Although the research is not focused on x86 and arm, these
architectures may be mentioned later, and x86 is used for some examples in
later chapters. Since clang has very good cross-compilation support, porting
this work to other architectures is definitely feasible. However, we will focus on
x86 64 and arm64 here to save time and improve the quality of the research.
x86 64 and arm64 differ a lot in some of the low-level details of the archi-
tectures. For example how the instruction set is encoded, how the assembly
code looks, the special CPU registers available, etc. The rest of this section
will describe the differences we should be aware of when reading the rest of the
chapters in this thesis.
One of the biggest differences is that arm64 is RISC-based, while x86 64
is CISC-based. The gist is that the arm64 instruction set architecture (ISA)
contains fewer, and simpler instructions. In contrast with x86 64 which contains
a lot of complex and specialized instructions. For RISC-based systems, each
instruction is usually of the same length. CISC, on the other hand, employs a
variable-length encoding.
Compared to x86 64, arm64 has a lot of registers. Table 2.3 summarizes
the different registers available on x86 64 that are interesting for this thesis
and their purpose. Note that the usage of certain registers may differ between
operating systems, kernel space and user space, and so on. Table 2.4 summarizes
the different registers on arm64 and their purpose. Note that these lists are not
complete. They contain only the registers that are important for understanding
the different code snippets, implementation details, etc. that follow in the next
chapters. Some of these 64-bit registers also have a 32-bit version that allow
you to access the lowest 32-bits of the register. For x86 64, these registers have
the same name as they do on x86. Simply replace the r in their names with e.
rax becomes eax, for example. For arm64, replace x with w. Thus, to access
the lower 32-bits of x0, you would use w0.
7
name purpose
rax return value or system call number
rdi first argument to function
rsi second argument to function
rdx third argument to function
rcx fourth argument to function
r8 fifth argument to function
r9 sixth argument to function
rbp base pointer
rsp stack pointer
rip instruction pointer
name purpose
x0 first argument to function or return value
x1 second argument to function
x2 third argument to function
x3 fourth argument to function
x4 fifth argument to function
x5 sixth argument to function
x8 system call number
sp stack pointer
x29/fp frame pointer
x30/lr link register
pc program counter / instruction pointer
8
static int sum(int a, int b)
{
return a + b;
}
int main(void)
{
int s = sum(13, 37);
printf("sum: %d\n", s);
return 0;
}
the calling convention. In this section, we will describe the calling convention
for x86 64 and arm64 on Linux systems. Calling conventions may differ between
different compilers and operating systems, but these descriptions are valid for
Linux with the gcc and clang compilers. We will start this section by describing
different calling conventions, and then move on to how they are implemented in
low-level assembly code.
On x86 64, the first six function arguments are placed in rdi, rsi, rdx, rcx,
r8, and r9. If the function needs more arguments, they are pushed on the stack.
The return value is placed in rax. When issuing a system call, the arguments
are placed in rdi, rsi, rdx, r10, r8, and r9. The system call number is placed
in rax, and the return value also ends up there after the syscall has completed.
The calling convention for arm64 is easier to remember, as the first six
arguments to a function are placed in x0 to x5. The same registers are used for
arguments to system calls. The return value from functions and system calls
are placed in x0, and the system call number is placed in x8.
Consider the simple piece of code in figure 2.1 that calls a function named
sum to add two numbers and then prints the result.
We now know that the function arguments will be placed in rdi and rsi for
x86 64, and x0 and x1 for arm64. Let us take a look at how the assembly code
looks like for these two architectures. The code can be found in figure 2.2, and
it has been cleaned up a bit for clarity.
We will focus on x86 64 first. The first argument, 13, is placed is edi. rdi is
not used here since sum() takes two int arguments. The size of an int on x86 64
is 32-bits, which means that it fits nicely in the lower part of rdi. The second
argument, 37, is placed in esi. call is used to call into the function. Every
call instruction (almost) is paired with a ret, the return instruction. When
call is issued, the CPU actually performs several steps. First, the address of
the next instruction, which is known as the return address, is pushed onto the
stack. Next, the CPU jumps to the target of the call. In this case it is the
sum function. Conversely, ret pops the return address off the stack and jumps
there.
For arm64, the first argument is placed in w0, the second in w1. Here, the
lower 32-bit part of x0 and x1 is used, just like for the x86 64 example. When
calling a function, we now see a different assembly instruction, namely bl. bl is
9
main: main:
mov w0, #13 mov edi, 13
mov w1, #37 mov esi, 37
bl sum call sum
sum: sum:
sub sp, sp, #16 push rbp
str w0, [sp, #12] mov rbp, rsp
str w1, [sp, #8] mov dword ptr [rbp - 4], edi
mov dword ptr [rbp - 8], esi
short for branch with link. This is one point where the two architectures differ
a bit. On arm64, there is a special register called the link register. The x30
register works as the link register, but it is sometimes simply referred to as LR.
Instead of storing the return value on the stack, some arm64 functions will use
bl, which instead stores the return value in x30. The ret function will consult
this register to find the correct return address. Return addresses may also be
stored on the stack, but in that case they are moved into x30 before returning.
One thing to note here is that sum is a so-called leaf function. Leaf functions
do not call any other functions, they are the leaves of the program’s control flow
graph. This means that there is no need to store the return address on the stack
for this function, since it will not call any other functions, thus not touching x30.
The ret instruction can then freely return without the need to fetch the return
address from the stack. To demonstrate the difference between a leaf function
and a non-leaf function, consider the code example in figure 2.3. We have added
a new function to the program in figure 2.1 called do stuff() that calls sum(),
the leaf function. Calling leaf functions and non-leaf functions usually look the
same, unless the compiler has performed some kind of optimization. But the
prologues and epilogues will differ, which will be described in the next section.
In addition to the call and bl instructions, we also have what is known as
indirect calls. These calls go through a register or memory location.
10
do_stuff:
# make room on the stack
sub sp, sp, #32
# store the old frame pointer and link register
stp x29, x30, [sp, #16]
add x29, sp, #16
11
2.6.2 Prologues and epilogues
A prologue is the set of instructions executed at the start of a function. Con-
versely, the epilogue is the set of instructions that are executed at the end of a
function. In this section we will refer to frame pointers, which point to stack
frames. A stack frame corresponds to a function call, and contains the parame-
ters, local variables, return address, and so on for the call. The stack frames for
do stuff() and sum() for the code example in figure 2.3 can be seen in figure
2.5. The stack grows upwards in the figure, i.e. towards lower addresses.
If we return back to the assembly code example in figure 2.4, we see that
the first three instructions are there to prepare the stack. First, room is made
on the stack to store local variables and the frame pointer and link register of
the previous function. This function will make room for 32 bytes on the stack.
The frame pointer and link register are stored at offset 16 and 24 from the new
stack pointer, respectively. 16 is then added to x29 to make it point to the saved
x29 value on the stack. The ARMv8 Programmer’s Manual [7] states that “The
frame pointer (X29) should point to the previous frame pointer saved on stack,
with the saved LR (X30) stored after it. The final frame pointer in the chain
12
should be set to 0.” The stack pointer has to be aligned on a 16-byte boundary,
which explains why the compiler makes more room on the stack than is strictly
necessary.
13
Chapter 3
Software vulnerabilities
3.1 Introduction
In this chapter we will discuss software vulnerabilities. More specifically, we will
look at memory corruption bugs, which commonly lead to software vulnerabili-
ties in software written in low-level languages like C and C++. Memory safety
issues are still widespread today, even though the problem has been known for
a long time. A well documented case of a buffer overflow vulnerability being
exploited by an attacker was in 1988. The exploit was used by the Morris worm
[64], which spread on the Internet, infecting approximately 6000 computers [28].
Even though these issues have been known for a long time, memory safety is
still a difficult problem. The Chromium project reports that around 70 % of
their serious security bugs are memory safety problems [66].
We start off by looking at the memory layout of a process on Linux in
section 3.2. This section will introduce some important information about dif-
ferent memory regions and how executable files look like on Linux. Then we
will move on to section 3.3, where we will describe different types of memory
corruption bugs and techniques for exploiting them. To fully understand the
vulnerability mitigation techniques described in coming chapters, it is important
to understand why we need these mitigations, and what sort of benefits they
can provide. In section 3.4 we will look into common exploitation techniques
used by attackers. Then, in section 3.5 we will discuss some common memory
corruption defenses. Finally, we will look at other vulnerability classes that are
relevant in section 3.6. Having some familiarity with exploitation techniques
and vulnerabilities will allow us to better understand the value of mitigations,
and if they are worth applying.
14
header table, or both. The ELF header starts with the magic values 7f 45 4c
46, where the hex values 45 4c 46 spell out ELF.
The program header table is an array of structures describing a segment or
other information the system needs to load an ELF file. Examples of program
header entries include loadable segments, like the executable code of the pro-
gram and writable memory used to store global variables. Other entries include
dynamic linking information, auxiliary information, and the state of the stack.
The section header table is an array describing all the file’s sections. Some
section types include a symbol table which can be used for dynamic linking.
Another important type of section is the string table, which is used to store
strings used by the ELF file. For example names of functions imported from
shared libraries or names of the different sections, like .text for the executable
code, or .rodata for read-only data. Table 3.1 summarizes the different section
types. Some of the section types, like the GOT and PLT, are further described
in section 4.4.3.
For the rest of the thesis it helps to have a general idea of how an ELF
15
is mapped to memory when a program is running. Most ELF binaries rely
on a dynamic linker like ld-linux [54] to run. The job of a dynamic linker is
to map the ELF into memory, load any libraries it depends on, perform other
preparations, and finally run the program. The exception is for statically linked
executables, where the binary does not depend on any external libraries. The
.interp section contains the full pathname of the dynamic linker to use for the
program. For statically linked binaries, there is no .interp section.
Linux provides functionality to view memory mappings of running processes
through the /proc filesystem. For each process on a Linux system, a separate
directory exists in /proc named after the process ID (PID) of that process.
There is also a special symbolic link at /proc/self that always redirects to the
current process opening that link. The file /proc/<PID>/maps allows us to view
the memory mapping of a process. Let us look at how the memory mapping for
the cat program looks like:
The numbers in front have been added to make it easier to describe the
different memory regions. 1-3 comprises the cat binary. 1 starts with the ELF
header, but most importantly includes the .text section of the binary, which
contains the executable instructions. Note that this region is marked as read-
able (r) and executable (x). 2 is read-only data, and 3 is data that can be read
and written, for example the .data and .bss sections. Read-only data includes
strings and other data that never changes throughout the lifetime of the pro-
gram. At 4 we have the heap, which is used for dynamic memory allocations,
for example memory returned from the malloc() function call.
The next section, 5, is not very important for our purposes, but for com-
pleteness we describe it here as well. The locale archive contains locales, which
are “collections of language and country specific conventions allowing to adapt
software to the user’s preferences.” [29]. It is loaded and used by the C library.
Sections 6-10 are related to the standard C library, or libc for short. 6
contains the executable code, and 7 is a section with no permissions. This
might be present when loading certain libraries or binaries. Usually the whole
ELF is loaded into memory with the memory protections being applied later by
parsing the program headers of the ELF. If certain sections are not meant to
be mapped, or if there is a gap between them, they end up being inaccessible.
Other times these sections may show up when a program deliberately creates
16
mappings with no permissions to protect adjacent data. These allocations are
often known as guard pages. Allocating a guard page before a region containing
important data can thwart some buffer overflows since the program will crash as
soon as it tries to read from or write to the guard page. Section 8 contains read-
only data, and 9 contains writable data. 10 is an anonymous memory mapping
allocated using mmap(). An anonymous mapping is not backed by a file.
11-16 pertain to the dynamic linker, and it is very similar to the binary and
libc, since this is also an ELF file. The next few sections, however, are different
from the previous ones. At 17 we have the stack, which is used to store local
variables, and other important information like return addresses.
The next three sections, 18-20, are related to virtual syscalls [16]. Simply
put, virtual syscalls are faster syscalls implemented in user-space for perfor-
mance reasons. The syscalls implemented here are read-only syscalls, mean-
ing that they do not change any structures in kernel space, they simply read
data. The vdso region contains an ELF file that the kernel maps into pro-
cesses, and if you dump this memory region to disk you can parse it like any
normal ELF file. Virtual syscalls include gettimeofday(), getcpu(), time(),
and clock gettime(). These were introduced as virtual syscalls since the over-
head of context switching between user space and kernel space was too high for
processes that frequently checked the current time.
17
void function(void)
{
char buf[32];
Shellcode
Code injected like this is often known as shellcode. Usually the goal is to spawn
a shell, which is where the name comes from. However, shellcode is simply a
sequence of instructions used to carry out any action desired, not just spawning
a shell. When using shellcode in an exploit, depending on the constraints of
the target program, several tricks may be necessary to make the shellcode work
in the target environment. We can imagine attempting to exploit a stack over-
flow where the program uses strcpy() to copy data into a vulnerable buffer.
strcpy() expects the data to be a valid C string, which means that it will stop
reading data once a null byte is encountered. To work around this, the shell-
code has to be null free, i.e. no null bytes can occur anywhere in the shellcode.
For certain input functions, like the scanf() family of functions, reading stops
when encountering any whitespace character. For example tabs, vertical tabs,
or spaces. There might also be size constraints or other illegal bytes that cannot
occur in the shellcode.
18
The shellcode presented later in this section corresponds to the following C
snippet:
0000000000000000 <.data>:
0: 48 31 c0 xor rax,rax
3: 50 push rax
4: 48 b9 2f 62 69 6e 2f movabs rcx,0x68732f2f6e69622f
b: 2f 73 68
e: 51 push rcx
f: 48 89 e7 mov rdi,rsp
12: 50 push rax
13: 57 push rdi
14: 48 89 e6 mov rsi,rsp
17: 48 31 d2 xor rdx,rdx
1a: 6a 3b push 0x3b
1c: 58 pop rax
1d: 0f 05 syscall
As with most shellcode, it can be hard to understand what is going on. This
code starts by setting rax to zero by XORing with itself, and pushing it on
the stack. Next, a weird-looking number is pushed onto the stack. This is the
string /bin//sh encoded as a hexadecimal number. To make sure the string is
null-terminated, the zero was pushed onto the stack earlier. Recall that syscalls
on x86 64 require that the syscall number is stored in rax and the first three
arguments are stored in rdi, rsi, and rdx. At address f, the first argument is
set to point to the /bin//sh string, which means the first argument is now set
up correctly. The next argument to execve is a little bit trickier since we have
an array of pointers to C strings. The array has to end with a NULL pointer.
First, rax, which is still zero, is pushed onto the stack again. rdi is then pushed,
which points to /bin//sh. We now have two pointers stored next to each other
on the stack, the first one pointing to our string, the second a NULL pointer. We
can then set rsi to point to the stack. rdx is then set to zero, as we do not care
about this argument. Finally, rax is set to 0x3b, which is the syscall number
for execve, and the syscall is executed, starting a shell.
19
When faced with NX, attackers had to come up with new and inventive
solutions for bypassing it. The following section will describe code-reuse attacks,
which can be used to defeat the NX protection.
3.4.1 return-to-libc
Return-to-libc [25] [80] [77], also known as ret2libc, is a code-reuse attack tech-
nique. In this technique, an attacker will use a memory corruption vulnerability
to return into functions in libc (or other loaded libraries). This allows an at-
tacker to reuse existing functionality in those libraries to carry out an attack.
In their paper [77] Tran et. al even show that return-to-libc attacks are Turing
complete. A lot of interesting functionality is available in libc which can be
leveraged by an attacker to get full control over an application under attack.
One common target for attackers is the system() function which will execute
any shell command passed to it. On x86 (32-bit), performing a return-to-libc
attack is a little simpler than on x86 64, so we will use that as an example.
Consider the previous example in figure 3.1, where the dangerous gets() func-
tion is used to read data from the user. The stack looks pretty similar on x86
and x86 64, but the big difference is that arguments are passed on the stack by
default on x86. There are, however, some calling conventions where a limited
amount of function arguments are placed in registers on x86, but this is not the
case when calling functions from libc. Now consider the following C/asm source
example for x86:
main:
push ebp
mov esp, ebp
int main(void)
{ push 0xbadc0de ; "/bin/sh"
system("/bin/sh"); call system
return 0; out:
} xor rax, rax
leave
(a) C: system(”/bin/sh”) example ret
The out label has been added to the assembly code for clarity. In this
20
example, we assume that the string "/bin/sh" is located at address 0xbadc0de.
Right before executing the call instruction, the stack layout can be seen in
figure 3.3 1
After the call instruction, the address of the next instruction, which is the
xor instruction after the out label, will be pushed on the stack. The layout can
be seen in figure 3.4.
Now that we have an understanding of the stack layout when calling func-
tions on x86, we can revisit the example in figure 3.1. We can imagine that
the stack layout looks something like figure 3.5. When filling buf with 32 bytes
of data, any more written to the buffer will overwrite the saved base pointer
(ebp) and then the saved return address (eip). If the saved return address is
overwritten, control-flow will be hijacked when the function returns. From the
system() stack layout example in figure 3.4 we know how the stack should look
like when performing a function call. With the correct input, we can cause the
program to call system("/bin/sh") instead of returning back to the original
caller. See figure 3.6 for an example of how it may look like.
As mentioned in section 2.6.1, the first argument to functions in x86 64 is
stored in rdi. On arm64, x0 is used. This means that a simple return-to-
1 the arguments to main(): argc, argv, and envp have been omitted for simplicity.
21
Figure 3.6: x86 return-to-libc stack frame
libc attack as shown here will not work on these architectures since parameters
are not passed on the stack. To work around this challenge, return-oriented
programming can be used. This technique is described in the next section.
For the attack in figure 3.6 in the previous section to work, we would have to
place a pointer to /bin/sh in rdi/x0 before executing system(). To accomplish
this, we could use a gadget like this:
pop rdi
ret
The pop rdi instruction will fetch the next value on the stack and place it
into rdi. The following figure shows how the previous attack could look like
when using ROP to control the argument in rdi. The gadget placement is
highlighted in blue.
Within the target program and the set of loaded libraries, a lot of gadgets
exist, making it very likely that an attacker will succeed in running arbitrary
22
code as long as the memory address of the code containing the gadgets is known.
As we will see in section 3.5.1, however, knowing the memory layout of a program
might not be straightforward.
23
a function pointer stored in rbx to return to the dispatch gadget listed above. 8
is then added to ebp and the gadget jumps to gadget 1. Gadget 1 adds 8 to rax,
setting it to 8. The gadget returns by jumping back to the dispatcher gadget.
Next, 8 is added to the dispatch table again, effectively skipping the entry at
offset 12 and fetching the next gadget at offset 16. The dispatcher then jumps
to gadget 2 which sets rcx to 0x10 and then jumps back. Again, 8 is added to
the dispatch table to fetch the final gadget, gadget 3. This gadget multiplies
rax and rcx, storing the result in rax.
24
space. However, if an attacker takes a wrong guess the kernel will probably
crash, making it infeasible for an attacker to actually exploit a vulnerability
where they have a 1/512 chance of success.
#include <stdio.h>
void function(void)
{
char buf[32];
gets(buf);
}
int main(void)
{
function();
return 0;
}
Let’s take a look at the x86 64 assembly code when we compile without stack
canaries.
push rbp
mov rbp,rsp
sub rsp,0x20
lea rax,[rbp-0x20]
mov rdi,rax
mov eax,0x0
call gets
25
leave
ret
And now with stack canaries enabled (using the -fstack-protector flag
during compilation).
6aa: 55 push rbp
6ab: 48 89 e5 mov rbp,rsp
6ae: 48 83 ec 30 sub rsp,0x30
6b2: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28
6b9: 00 00
6bb: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
6bf: 31 c0 xor eax,eax
6c1: 48 8d 45 d0 lea rax,[rbp-0x30]
6c5: 48 89 c7 mov rdi,rax
6c8: b8 00 00 00 00 mov eax,0x0
6cd: e8 ae fe ff ff call 580 <gets@plt>
6d2: 90 nop
6d3: 48 8b 45 f8 mov rax,QWORD PTR [rbp-0x8]
6d7: 64 48 33 04 25 28 00 xor rax,QWORD PTR fs:0x28
6de: 00 00
6e0: 74 05 je 6e7 <function+0x3d>
6e2: e8 89 fe ff ff call 570 <__stack_chk_fail@plt>
6e7: c9 leave
6e8: c3 ret
First of all, we can see that the code is a bit larger this time. At address 0x6b2,
a stack canary is fetched using the fs register. At 0x6bb, the canary is stored
as a local variable on the stack. This is a part of the function prologue. Then,
at 0x6d3, the canary is fetched from the stack and compared with the original
canary in fs. If the canary has changed, for example as the result of stack
overflow, the jump at 0x6e0 is not taken, and a call to stack chk fail() is
made. This function will abort the program and print an error message stating
that stack corruption has been detected:
$ ./gets
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
*** stack smashing detected ***: <unknown> terminated
26
#include <stdio.h>
printf(argv[1]);
return 0;
}
int printf(const char *format, ...);. The dots mean that this function
takes a variable number of arguments. This is known as a variadic function.
When a function takes a variable amount of arguments, it is the caller’s
responsibility to set up registers and/or the stack properly so the function sees
all the arguments. The callee has to know how many arguments are passed to
it somehow. For printf-style functions, the amount of arguments correspond
to the amount of format specifiers in the argument. These specifiers control
how the argument is treated in the output. All format specifiers start with %,
followed by one or more characters describing the format. A %d specifies that
the next argument should be treated as an int, for example. Other specifiers
include %s that treats the next argument as a string, %lx that treats the next
argument as an unsigned long and displays it as a hexadecimal number, and so
on. A problem arises when an attacker can supply the format string, and treat
variables in registers or on the stack as arguments when they are not actually
passed to the function.
This very easily leads to an information disclosure if the attacker specifies
several %p arguments, for example. %p prints the next argument as a pointer,
e.g. 0xdeadbeef. For an information leak, the goal is to print an ”argument”
on the stack, or in a register, that contains a pointer to something that lets the
attacker defeat ASLR. Consider the case where there is a pointer to stdin on
the stack. If an attacker manages to print the address of stdin using a format
string attack, the attacker can defeat ASLR. The reason an attacker can defeat
ASLR in this case is because the stdin symbol is located at the same offset
from the base of libc every time it is loaded. It is only the base address that is
randomized with ASLR.
Let us take a look at an example. The program in figure 3.9 contains a
format string vulnerability. If you try to compile this example with modern
compilers, you will probably get a warning like this: warning: format not
a string literal and no format arguments. This type of bug is not as
common anymore, as compilers will complain if you attempt to pass something
else than string literals to printf-style functions, making them easier to spot.
In addition to posing a risk for information leakage, format string bugs are
actually very severe, as they also allow writing data to memory using the %n
format specifier. In short, %n writes the amount of characters outputted by the
function to the next parameter. An attacker can use this to corrupt memory
27
and gain code execution.
3.6.5 Terminology
Now that we have a basic understanding of different vulnerabilities and how they
can be exploited by an attacker, some more terminology can be introduced.
These terms will be used in later chapters, and are particularly useful when
reading articles on exploitation.
An exploit is often built using primitives, basic building blocks for an at-
tacker. Commonly sought after primitives include arbitrary read and arbitrary
write primitives. These allow an attacker to read arbitrary memory addresses
and write to arbitrary memory addresses, respectively. An arbitrary write prim-
itive is often known as a write-what-where primitive, which allows an attacker
to write any value (what) to any address (where).
28
Chapter 4
4.1 Introduction
In this chapter we will discuss compilers and linkers, and how the previously
mention software defenses are implemented by modern compilers. We start with
a general introduction in section 4.2. Then we briefly discuss gcc in section 4.3,
and then move on to the LLVM project in section 4.4. In section 4.4.1 we will
discuss in detail how different software defenses are implemented. We will then
describe link-time optimization in section 4.5. After that we will dive into one
of the central topics of this thesis, namely control-flow integrity. In section
4.6 we start by describing what control-flow integrity is, and then discuss how
it is implemented. Finally, we will discuss some software defenses that are
implemented in hardware in section 4.7.
29
uses Portable Executables (PE), and macOS uses the Mach-O format. For our
purposes, ELF files are the most relevant as we are only working with Linux.
4.4 LLVM
LLVM started as a research project at the University of Illinois [44] with the
“goal of providing a modern, SSA-based compilation strategy capable of sup-
porting both static and dynamic compilation of arbitrary programming lan-
guages” [51]. Originally, LLVM was short for Low Level Virtual Machine, but
the acronym is not used today. The LLVM project has grown into an umbrella
project consisting of many sub-projects. These include a C/C++ compiler, a
debugger, a C++ library, and others. See table 4.1 for a full list of projects.
name description
LLVM core core libraries, e.g. for code generation
clang C/C++/Objective-C compiler
lldb LLVM debugger
libc++ (ABI) implementation of the C++ Standard Library
compiler-rt code-generation
OpenMP OpenMP runtime
polly cache-locality optimizations
libclc aims to implement the OpenCL standard
klee symbolic virtual machine
lld drop-in replacement for system linkers
The projects that are interesting for this thesis are the LLVM core libraries,
clang, and LLD. As written in table 4.1, LLD is a drop-in replacement for system
linkers. LLD accepts the same command line arguments and linkers scripts as
the standard GNU linkers on Linux. According to the LLVM documentation,
the expectation is that LLD runs more than twice as fast as the GNU gold linker
when linking a large program on a multicore machine. This is great for huge
projects like the Linux kernel, where LLD will hopefully be able to speed up the
linking process.
Clang is just one of many LLVM fontends, other languages that use LLVM
include Rust, Swift, and Haskell. Although gcc has been around for a very long
30
time, clang has attracted many users. Some examples of large projects that
use clang are the Chrome and Firefox web browsers. FreeBSD has switched
to clang/LLVM for some of its supported architectures. As of June 15, 2020,
amd64, arm64, i386, armv7, and powerpc64 uses LLD to link both packages
and the kernel. clang has replaced gcc as the standard compiler on the system
as well.
4.4.2 ASLR/PIE
As we discussed in section 3.5.1, certain memory regions of programs are loaded
at random addresses to thwart certain attacks. Most exploits require that an
attacker know at least some of the program’s layout. The stack, heap, and
shared libraries will be loaded at different locations in memory. For the program
to be loaded at a random location as well, it has to be a position independent
executable (PIE). A PIE is constructed in such a way that it can be loaded
anywhere in memory without relying on hard-coded memory addresses. This is
easier on some architectures like x86 64 where we have what is known as RIP
relative addressing. This means that the program can access memory relative
to the current instruction pointer (rip). x86 (32-bit) does not have this feature
and has to rely on other tricks to make this work.
4.4.3 RELRO
When a program wants to call a function that is located in a library, say for
example printf(), the program has to get the address of this function somehow.
This is handled by the dynamic linker on the system. The job of the dynamic
linker is to load the executable into memory and resolve any dependencies on
other libraries. This includes loading the required libraries into memory and
resolving any function addresses that the program depends on in these libraries.
The process of connecting a function symbol, like printf(), to an address is
known as relocation. For ELF files we have two interesting sections pertaining
to relocations. First we have the global offset table (GOT) and the procedure
linkage table (PLT). There are other sections related to relocations as well, but
these are the most interesting for our purposes. The PLT contains code stubs
that jump to an address defined in the GOT.
For performance reasons, however, the symbols in the GOT table might
not be resolved right away. Instead, the entries will point to a function that
performs the actual resolving at runtime. To support this scheme, the GOT has
to be writable. An attacker that has an arbitrary write primitive will be able to
change GOT entries to point to anything else. Common ways to abuse this is
to change functions that deal with user-controlled data into system() or other
functions that can be used to execute arbitrary commands.
31
#include <stdio.h>
#include <string.h>
if (argc == 2)
strcpy(buf, argv[1]);
return 0;
}
4.4.4 NX
The non-executable (NX) bit marks a memory region as non-executable. This
mitigation is usually known as data execution prevention (DEP) [57] on Win-
dows. Traditionally, many memory regions were marked as executable even
though they did not contain any code. For example the stack and the heap
used to be marked as executable. An attacker could abuse the fact that these
regions were both writable and executable at the same time to inject code into
the stack or the heap and then divert execution to this location. The memory
protections for different memory regions are controlled by the ELF headers.
Normal regions of memory like program code is controlled using section head-
ers. Today, the stack and heap are not executable by default, but can be made
executable using the PT GNU STACK program header.
32
the program to crash if we do not have fortify source enabled. However, when
fortify source is enabled, we get the following error message:
So how does the program detect this at runtime? Let us look at the control-
flow graph of the program compiled with and without fortify source side by
side to see what is different. Consider the two CFGs in figure 4.2. We can see
that there is almost no difference between the two CFGs. However, if you look
closely you can see that strcpy() has been replaced by strcpy chk(), and
printf() has been replaced by printf chk().
(a) CFG of program without fortify source (b) CFG of program with fortify source
33
/* Copy SRC to DEST with checking of destination buffer overflow. */
char *
__strcpy_chk (char *dest, const char *src, size_t destlen)
{
size_t len = strlen (src);
if (len >= destlen)
__chk_fail ();
#include <stdio.h>
int main(void)
{
printf("%2$d %1$s %4$x\n", "test", 123, 456, 0xbad);
return 0;
}
2 man 3 printf
34
4.4.7 Summary
As the previous chapters show, there are many security mitigations available
in compilers today. These help mitigate the impact of security vulnerabilities
in software from increasingly advanced attacks. These mechanisms do not pre-
vent exploitation of all memory corruption vulnerabilities, but they make it
significantly harder for an attacker to succeed. In many cases an attacker needs
several vulnerabilities to exploit a system. With ASLR an attacker often needs
an information leak before exploitation of buffer overflows is possible, for exam-
ple. In section 4.6 and 4.6.3 we will look at some relatively new mechanisms
available in modern compilers.
35
ThinLTO does not load everything into a single monolithic module. C code is
processed as usual by clang and initial optimizations are performed. A summary
of each function is generated, and only this summary info is linked into a giant
index called thin-link. Optimizations and code generation is fully-parallel, unlike
with full LTO.
36
to clang when compiling, or one could narrow down the set of schemes by using
-fsanitize or -fno-sanitize to enable or disable some of them, respectively.
Since the Linux kernel is written in C and assembly, the C++ implemen-
tation is not as relevant for this thesis. The only type of function call we are
interested in are indirect function calls, which are performed using function
pointers in C. At each call site, an extra check verifies that the function being
called is a valid function that matches the function pointer’s signature. Consider
the code example in figure 4.4.
In the code we have a struct named ops that contains a single function
pointer. The signature matches the foo() and baz() functions. Note that it
does not match the function signature of bar(), since this function takes an
int argument, while the others do not. When the program is started with no
arguments, the foo() function is run. If there is one argument, bar() is used
instead. Finally, if there are two arguments, baz() will be used. CFI should
be able to catch the function signature mismatch at runtime. The following
snippet shows what happens when the program is run with zero, one, and two
arguments:
$ ./cfi
op.func() = 1337
$ ./cfi a
cfi.c:35:29: runtime error: control flow integrity check for type
'int (void)' failed during indirect function call
(cfi+0x234540): note: bar defined here
$ ./cfi a b
op.func() = -559038737
The functions with correct prototypes succeeded, as expected, and the func-
tion signature mismatch was caught at runtime. We will now discuss how the
CFI checks are performed at runtime. When compiling, clang will turn C code
into something known as LLVM intermediate representation, or IR for short, be-
fore any assembly code is emitted. IR is an assembly-like representation of the
code used internally in the compiler. One advantage of using IR is that compil-
ers can generate code for many different languages as long as they are translated
to the same IR. Further analysis and transformation from IR to machine code
can be reused.
IR can be emitted for several phases of compilation, which we will discuss
further later in this section. In figure 4.5 we can see IR 3 for for the code in
figure 4.4 during the preoptimization phase. In this phase, no optimizations
have been done by the compiler yet. The code has been simplified slightly,
and all omitted code is replaced with [...]. In the IR, indirect function calls
are replaced with the LLVM instrinsic function llvm.type.test. As the name
implies, this intrinsic checks the type of the function.
Following is the type test extracted from the IR:
%25 = call i1 @llvm.type.test(i8* %24, metadata !"_ZTSFivE")
The most interesting part of the type test is the final argument, "ZTSFivE".
This is the mangled type name the code expects the function pointer signature
3 produced using -Wl,-save-temps flag to store bitcode during compilation and then dis-
37
#include <stdio.h>
#include <string.h>
int foo(void)
{
return 1337;
}
int bar(int a)
{
return a * 42;
}
int baz(void)
{
return 0xdeadbeef;
}
struct ops {
int (*func)(void);
};
(void)argv;
memset(&op, 0, sizeof(op));
if (argc == 2)
op.func = (int (*)(void))bar;
else if (argc == 3)
op.func = baz;
else
op.func = foo;
printf("op.func() = %d\n", op.func());
return 0;
}
38
to match. The name can be decoded using the c++filt tool: 4 int (). That is,
a function that returns int and takes no arguments. This matches the function
pointer signature from the C code:
struct ops {
int (*func)(void);
};
During the IPO phase 5 , LLVM will perform lowering of the llvm.type.test
function into its actual implementation. The implementation looks a bit strange
in assembly code, so it is worth to spend some time explaining it here. First
of all, the functions with matching prototypes are laid out after each other in
a jump table. For the previous C code example in figure 4.4, the jump-table in
pseudo-C would look something like this:
The jump table corresponding to the int () function signature contains two
entries: foo() and baz(). The jump table for the type int (int) contains only
one entry: bar(). These jump tables make it easier to quickly verify whether a
function belongs to a particular set of function signatures. The compiler makes
sure that only functions with the same signature are laid out consecutively.
The first step when lowering the call is to calculate the offset into the jump
table. Then, two things need to be checked to verify that the call is valid. First,
the offset has to fall within range of the jump table entries for this function type.
Second, we need to check that the function is properly aligned. Recall that on
x86 64 instructions do not need to be aligned on any particular boundary, as
the instruction length is variable. On arm64, all instructions have to be aligned
on a 4-byte boundary since all instructions are 4 bytes long.
On x86 64, each jump table entry is aligned to 8 bytes, on arm64 they are
aligned to 4 bytes. The reason lies in the implementation of the table entries.
For x86 64 a jump table entry is constructed using a jmp instruction, which is
(usually) 5 bytes in size. The instruction opcode is e9, followed by a 4-byte
offset relative to the next instruction. To align the next entry in the jump table
to 8 bytes, int3 instructions are inserted as padding. In figure 4.6 we have the
jump table entries for foo() and baz() as outputed by objdump 6 . The int3
instructions will cause a trap signal, effectively aborting the program if program
flow is intentionally or unintentionally redirected to the padding area.
4 echo " ZTSFivE" | c++filt
5 SeeLowerTypeTestsModule::lowerTypeTestCall() in lib/Transforms/IPO/LowerTypeTests.cpp
6 objdump -D cfi
39
; Function Attrs: noinline nounwind optnone uwtable
define hidden i32 @main(i32, i8**) #0 !type !8 !type !9 {
%3 = alloca i32, align 4
%4 = alloca i32, align 4
%5 = alloca i8**, align 8
%6 = alloca %struct.ops, align 8
store i32 0, i32* %3, align 4
store i32 %0, i32* %4, align 4
store i8** %1, i8*** %5, align 8
%7 = load i8**, i8*** %5, align 8
%8 = bitcast %struct.ops* %6 to i8*
call void @llvm.memset.p0i8.i64(i8* align 8 %8, i8 0, i64 8, i1 false)
%9 = load i32, i32* %4, align 4
%10 = icmp eq i32 %9, 2
br i1 %10, label %11, label %13
11: ; preds = %2
%12 = getelementptr inbounds %struct.ops, %struct.ops* %6, i32 0, i32 0
store i32 ()* bitcast (i32 (i32)* @bar to i32 ()*), i32 ()** %12, align 8
br label %21
13: ; preds = %2
%14 = load i32, i32* %4, align 4
%15 = icmp eq i32 %14, 3
br i1 %15, label %16, label %18
0000000000234628 <baz>:
234628: e9 23 ff ff ff jmpq 234550 <baz.cfi>
23462d: cc int3
23462e: cc int3
23462f: cc int3
41
byte array can be fetched and tested using a bit mask.
The need for bit vectors is usually eliminated for indirect function calls,
which makes the checks somewhat simpler compared to virtual calls. Loading
from and checking the values in the bit vector is thus not included in the code
examples we will be looking at.
The LLVM IR we saw in figure 4.5 corresponded to the preoptimization
phase of compilation. During the next phases, the llvm.type.test intrinsic is
lowered to IR more resembling the final assembly code output. In figure 4.7 we
can see the LLVM IR after lowering the type test.
21:
%22 = getelementptr inbounds %struct.ops, %struct.ops* %6, i32 0, i32 0
%23 = load i32 ()*, i32 ()** %22, align 8
%24 = bitcast i32 ()* %23 to i8*, !nosanitize !10
%25 = ptrtoint i8* %24 to i64
%26 = sub i64 %25, ptrtoint (void ()* @.cfi.jumptable to i64)
%27 = lshr i64 %26, 3
%28 = shl i64 %26, 61
%29 = or i64 %27, %28
%30 = icmp ule i64 %29, 1
br i1 %30, label %33, label %31, !prof !11, !nosanitize !10
Note that the type test is now turned into a sequence of sub, lshr, shl, and
or instructions more resembling normal assembly code. The actual assembly
code does not look that different from LLVM IR actually. A comparison between
x86 64 and arm64 assembly code can be seen in figure 4.8. The code is almost ex-
actly the same size, only differing by one instruction. The assembly code was cre-
ated using IDA Pro [33] and cleaned up for clarity. Also note that abort() is not
actually called; the real code calls ubsan handle cfi check fail abort().
This function is part of the compiler-rt project, which contains several runtime
libraries. Looking at the function name we can see that it is part of the UBSan,
or UndefinedBehaviorSanitizer [50] runtime.
The @PAGE/@PAGEOFF syntax may seem a bit strange. This is the way IDA
Pro displays it, which is the tool used to produce these assembly listings. The
adrp instruction fetches the address of a 4KB page at a PC-relative offset. An
add instruction can then be used to add the offset of the wanted symbol to the
page address. This pattern is used since we cannot encode full addresses (4 or
8 byte) in instructions since all instructions are only 4 bytes in size on arm64.
The adr/add instructions are used to work around this limitation.
When LLVM generates the jump tables, their entries use the original symbol
42
mov rax, offset bar
adrp x8, foo@PAGE
mov [rbp-16], rax
add x8, x8, foo@PAGEOFF
mov rsi, [rbp-16]
adrp x9, bar
mov rax, offset foo
add x9, x9, bar
mov rcx, rsi
str x9, [sp,#8]
sub rcx, rax
ldr x1, [sp,#8]
mov rax, rcx
sub x8, x1, x8
shr rax, 3
lsr x9, x8, #2
shl rcx, 3Dh
orr x8, x9, x8,lsl#62
or rax, rcx
cmp x8, #1
cmp rax, 1
b.ls loc_9D0
jbe short loc_2354AD
call abort
call abort
loc_9D0:
loc_2354AD:
mov w0, #4
mov edi, 4
blr x1
call rsi
(a) arm64 CFI check
(b) x86 64 CFI check
names from functions. Thus, the foo() symbol from the previous examples used
to point to the original function, but now points to the jump table instead. Every
function that is redirected through the jump table get a .cfi suffix. foo() turns
into foo.cfi(), for example.
Since CFI instruments every single indirect function call (unless told not to
do so explicitly 7 ), it is interesting to consider how much overhead is generated
for each instrumented call site. In figure 4.9 we see a comparison between a
normal call and a call protected with CFI for arm64.
We can see that there are a lot of instructions added to the CFI call. In-
cluding the two instructions not shown in the listing used to fetch the argument
to the abort() function, 9 instructions have been added. This results in an
increase of 4 × 9 = 36 bytes. In addition, a jump table entry is constructed,
adding another 4 bytes. In figure 4.10 we have the same comparison for x86 64.
There are a lot of instructions added here as well. For this example, 11
instructions are added in the CFI example. It is not as easy to calculate the
size overhead here, since instruction size varies on x86 64 and we cannot simply
count the number of instructions added. The example without CFI enabled
consists of 30 bytes, while the example with CFI consists of 76 bytes. This
results in a 153.33 % increase in size, compared to a 166.67 % increase for
arm64. These numbers may not be the same for every situation, depending on
compiler optimizations and other factors, but they highlight the fact that CFI
adds some extra size to the code.
43
adrp x8, foo@PAGE
add x8, x8, foo@PAGEOFF
adrp x9, bar@PAGE
add x9, x9, bar@PAGEOFF
str x9, [sp,#8]
adrp x8, bar@PAGE
ldr x1, [sp,#8]
add x8, x8, bar@PAGEOFF
sub x8, x1, x8
str x8, [sp,#8]
lsr x9, x8, #2
ldr x8, [sp,#8]
orr x8, x9, x8,lsl#62
mov w0, #4
cmp x8, #1
blr x8
b.ls loc_9D0
call abort
(a) Normal arm64 call
loc_9D0:
mov w0, #4
blr x1
Figure 4.9: Comparison between calls with and without CFI on arm64
Figure 4.10: Comparison between calls with and without CFI on x86 64
44
CFI is enabled, control flow checking will be enforced across DSO boundaries,
meaning that if you attempt to call functions in another library with the wrong
function signature the call will be blocked at runtime.
45
static void __init_shadow_call_stack(pthread_internal_t* thread __unused) {
#ifdef __aarch64__
// Allocate the stack and the guard region.
char* scs_guard_region = reinterpret_cast<char*>(
mmap(nullptr, SCS_GUARD_REGION_SIZE, 0, MAP_PRIVATE | MAP_ANON, -1, 0));
thread->shadow_call_stack_guard_region = scs_guard_region;
// The address is aligned to SCS_SIZE so that we only need to store the lower log2(SCS_SIZE) bits
// in jmp_buf.
char* scs_aligned_guard_region =
reinterpret_cast<char*>(align_up(reinterpret_cast<uintptr_t>(scs_guard_region), SCS_SIZE));
// We need to ensure that [scs_offset,scs_offset+SCS_SIZE) is in the guard region and that there
// is at least one unmapped page after the shadow call stack (to catch stack overflows). We can't
// use arc4random_uniform in init because /dev/urandom might not have been created yet.
size_t scs_offset =
(getpid() == 1) ? 0 : (arc4random_uniform(SCS_GUARD_REGION_SIZE / SCS_SIZE - 1) * SCS_SIZE);
// Make the stack readable and writable and store its address in register x18. This is
// deliberately the only place where the address is stored.
char *scs = scs_aligned_guard_region + scs_offset;
mprotect(scs, SCS_SIZE, PROT_READ | PROT_WRITE);
__asm__ __volatile__("mov x18, %0" ::"r"(scs));
#endif
}
#include <stdio.h>
int foobar(void)
{
return 0xbadc0de;
}
int main(void)
{
return foobar() + 0x1337;
}
46
main:
main: sub sp, sp, #32
sub sp, sp, #32 str x30, [x18], #8
stp x29, x30, [sp, #16] stp x29, x30, [sp, #16]
add x29, sp, #16 add x29, sp, #16
mov w8, #4919 mov w8, #4919
stur wzr, [x29, #-4] stur wzr, [x29, #-4]
str w8, [sp, #8] str w8, [sp, #8]
bl foobar bl foobar
ldr w8, [sp, #8] ldr w8, [sp, #8]
add w0, w0, w8 add w0, w0, w8
ldp x29, x30, [sp, #16] ldp x29, x30, [sp, #16]
add sp, sp, #32 ldr x30, [x18, #-8]!
ret add sp, sp, #32
ret
(a) ShadowCallStack disabled
(b) ShadowCallStack enabled
4.7.1 CET
CET provides the following capabilities according to Intel: Shadow Stack, and
indirect branch tracking. A shadow stack is a second stack used exclusively
for control transfer operations. When enabled the call instruction pushes the
return address both to the normal stack and to the shadow stack. When the
function returns later, the ret instruction will pop the return address from both
stacks and compare them. In the event that they do not match, an exception
is raised. Shadow stacks should be properly protected so that they are not
easy targets for attackers with an arbitrary read/write primitive. For CET, the
shadow stack is separate from the normal stack and is only used to store control
transfer information. It is protected using page table protections. Thus, the
shadow stack is not directly writable by software. The stack is only accessed by
control transfer instructions (like call and ret) and shadow stack management
instructions. The other feature, indirect branch tracking, includes a new in-
struction called endbranch used to mark valid jump target addresses of indirect
calls and jumps in the program. While the ShadowCallStack on arm64 uses the
x18 register as a pointer to the shadow stack, CET introduces support for the
shadow stack pointer (SSP) register.
Indirect branch tracking changes the behavior of jmp and call instructions.
The CPU implements a state machine to track indirect branch instructions. If
the following instruction after an indirect jmp or call is not an endbranch in-
47
struction, the control-flow transfer was not valid. In other words, endbranch
is used to mark valid indirect call targets in the program. To provide back-
ward compatibility with older CPUs that do not support CET, the endbranch
instruction is encoded such that it is treated as a nop on these CPUs.
CET is not yet available in hardware, but support for it has been enabled
in GCC 8, binutils 2.31, and glibc 2.28, and added to LLVM in december 2019.
There is also a patchset available for adding support to the Linux kernel [81],
but it has not been upstreamed yet.
4.7.2 PAC
PAC raises the bar for attackers trying to modify protected pointers in memory.
In this section we will describe how PAC is designed and what protection it
provides.
PAC takes advantage of the fact that the address space in 64-bit architectures
is less than 64-bits. The actual number of bits depends on the platform, but
usually 55 or 48 bits are actually used. This means that 64-bit pointers have
unused bits in the upper part of the pointer. These unused bits can be used to
store a pointer authentication code. The PAC is inserted into each protected
pointer before writing it to memory, and verified before it is used. Several new
instructions are introduced to sign and authenticate pointers.
PAC uses a cryptographically strong algorithm to authenticate pointers
named QARMA [8]. QARMA is “a new family of lightweight tweakable block
ciphers” [67]. A 128-bit key is used together with a context value to sign point-
ers. The context is useful to separate different classes of pointers, for example
stack pointers and function pointers.
In section 3.5.2 we discussed how return addresses on the stack are protected
using a stack cookie. This mitigation adds some overhead to every function
prologue and epilogue since the cookie first have to be stored on the stack,
and then validated before returning from the function. With PAC, a single
instruction can be used to tag and verify the link register (LR), which is where
the return addresses are stored on arm64. The instruction PACIASP is used to
protect LR, and AUTIASP is used to verify it.
If pointers used for indirect function calls are signed using PAC, a form of
CFI can be implemented which makes sure that no invalid pointers are used for
function calls. Different contexts can be used to group function pointers together
and provide something similar to what clang does for its CFI implementation,
which we discussed in section 6.3.2.
Basic support for PAC exists in the Linux kernel, but there is no real usage
yet. Apple’s mobile operating system iOS has support for PAC, and it is enabled
on certain iPhones [10]. It will be interesting to see how support for PAC on
Linux develops over the next few years.
48
Chapter 5
5.1 Introduction
This chapter will go into a bit more detail on the Linux kernel, before delving
into what sort of optimization we can do using standard compilers and linkers.
First, we will talk more about the different releases of the kernel in section 5.2,
then move on to the Linux kernel’s build system in section 5.3. Finally we
will discuss different security features commonly enabled in the Linux kernel in
section 5.4.
Prepatch
These releases are often referred to as ”RC”, or release candidate, kernels. They
are not yet ready for use in stable releases, and must be compiled from source.
This is where new features can be tested before the end up in a normal stable
release and can be used by users that aren’t developers.
Mainline
Mainline is where all the new features from the rc releases are introduced. New
mainline releases come out every 2-3 months.
Stable
A mainline release is considered stable after it has been released. Bug fixes for
a stable release is backported from the mainline tree.
Longterm
Several ”longterm maintenance” kernel releases are provided for the purpose of
backporting bugfixes to older releases. Longterm releases usually only see im-
49
portant bugfixes backported to them. However, new minor releases are relatively
frequent, while new major releases are infrequent.
Following is a table, accurate as of January 30th 2020, of the current longterm
release kernels and their maintainers [2].
50
5.4 Linux kernel security
In section 4.4.1 we discussed different security mitigations provided by compilers.
These protections are common for most software in user space as well as kernel
space. There are some interesting security mitigations that are special for the
Linux kernel, however. In the following subsections we will describe some of
these to highlight some of the steps taken to make the Linux kernel more secure.
These security features are controlled by configuration options and may not be
enabled for certain distros or Android devices. Although most of these are not
directly relevant for CFI, they highlight the need for a diverse set of different
security mechanisms in a modern operating system kernel.
5.4.1 KASLR
As mentioned in section 3.4 we have a protection known as kernel address space
layout randomization (KASLR). This is the kernel version of ASLR, which we
described in section 4.4.2.
5.4.2 SMEP/PXN
Supervisor Mode Execution Prevention (SMEP) and Privileged Execute Never
(PXN) are CPU features, for x86 64 and arm64 respectively, that prevents the
kernel from executing code located in user space. Without these protections,
an attacker can redirect execution from the kernel into user space, where the
attacker potentially has full control over the memory layout. This type of attack
is known as ret2usr [37]. On x86 64, SMEP is enabled by setting bit 20 in the
CR4 register. On arm64, PXN is controlled through translation table entries.
5.4.3 SMAP/PAN
Supervisor Mode Access Prevention (SMAP) and Privileged Access Never (PAN)
are CPU features for x86 64 and arm64 respectively that can prevent access to
unprivileged data. It is used to make sure that the kernel cannot directly access
memory in user space. To access memory in user space, the kernel has to use
functions designed for this purpose, like copy from user(). On x86 64, SMAP
is enabled by setting bit 21 of CR4. To access memory in user space on arm64,
one either has to clear the PAN bit or use the specialized instructions ldt* and
stt* for loading and storing memory.
On arm64, PAN is only supported on devices based on ARMv8.1 and newer.
For older devices there is a software emulated version.
51
STRICT DEVMEM will put restrictions on what memory is available through this
device. If IO STRICT DEVMEM is not enabled, user space can access the PCI
space and BIOS code/data regions through /dev/mem. If IO STRICT DEVMEM is
enabled, only idle io-memory ranges are accessible.
52
/*
* Validates that the given object is:
* - not bogus address
* - fully contained by stack (or stack frame, when available)
* - fully within SLAB object (or object whitelist area, when available)
* - not in kernel text
*/
void __check_object_size(const void *ptr, unsigned long n, bool to_user)
{
if (static_branch_unlikely(&bypass_usercopy_checks))
return;
53
Chapter 6
Forward-edge and
backward-edge control-flow
integrity in the Linux kernel
6.1 Introduction
This chapter will describe how support for LTO, CFI, and SCS is implemented
in the Linux kernel. Before we can use these features, however, the kernel has
to be built with clang. In section 6.2, we describe the work that has gone
into making this possible. Then in section 6.3 we describe the actual patches
implementing LTO, CFI, and SCS in the kernel. In section 6.3.3 we describe
how a CFI failure looks like, and how we can fix false positives by patching the
kernel. In section 6.4 we describe the SCS implementation in further detail.
54
void func(size_t n)
{
/* VLA */
char array[n];
The code is totally useless, but shows how one could use VLAs in C code.
The array named array has a size determined at runtime by the n parameter
to func(). VLAs have not been very popular in the Linux kernel, and Linus
Torvalds has stated his dislike with VLAs on the public kernel mailing list.
With the 4.20 kernel, however, VLAs have been completely removed. The main
reasons may have been for performance and security, since compilers generated
slow code for VLAs and it may lead to security bugs if the array size is not
properly checked. Clang does not support VLAs, however, so it was necessary
to completely remove them from the kernel to build it with clang.
55
-flto for full LTO, or -flto=thin for ThinLTO. If the clang version supports
the flag, -fsplit-lto-unit will be used to enable LTO unit splitting which will
split bitcode objects into into regular and thin LTO halves, which is enabled for
CFI for example. To speed up incremental builds with ThinLTO a cache can
be used. The cache directory can be specified with a linker flag 2 .
When using loadable modules, it is important that the module is built for
the exact same kernel as it is loaded on. Many things can change between
kernel versions, and by changing configuration options two kernels with the same
version can vary greatly. Structure layouts may change, functions may become
deprecated, and new functionality could be added or removed. To make sure
that it is safe to load a module, the kernel has an option called MODVERSIONS.
modversions stores checksums in a special section of every module, and when
the kernel loads the module it can check if the module is compatible with the
running kernel.
LTO produces bitcode files instead of object files, so some tricks have to be
used to extract symbols from the bitcode files using the llvm-nm utility instead
of objdump which is normally used.
To reduce binary size, the flags -mllvm -import-instr-limit=5 are used
to limit inlining across translation unit boundaries. The authors report an 11
% decrease in size for a stripped arm64 defconfig vmlinux binary by switching
from the default value of 100 to 5.
56
pertaining to CFI. First up, CFI CLANG enables CFI. This option depends on
LTO. By default, CFI violations result in a kernel panic. This might not be
desirable, especially during development. The option CFI PERMISSIVE can be
used to enable permissive CFI mode, where a warning is used instead of a kernel
panic when a violation is caught. CFI CLANG SHADOW speeds up cross-module
CFI checks, more on this option later in this section.
All modules now contain a CFI check function. This function is generated
by the compiler and is named cfi check(). The prototype for the function
looks like this:
Some parts of the kernel is not instrumented with CFI, for example error
handlers, functions that jump to a physical address, and exception callbacks.
Finally we will discuss some of the minor changes that have been done to
different parts of the C code. The kernel has something called kallsyms, which
is used for symbol lookup within the kernel. From user space it can be accessed
through the /proc/kallsyms file. When both ThinLTO and CFI are enabled,
LLVM appends a hash to static function names, which might break tools relying
on information from kallsyms. To prevent potential issues, the following code
snippet is added to remove hashes from function names:
57
When running a kernel with CFI enabled on an Intel NUC a CFI failure was
reported in a networking driver. Following is the CFI failure message with some
information removed to make it more readable.
The kernel was compiled with permissive CFI, which means that it will
continue running even though a CFI failure is encountered. When permissive
mode is disabled, the kernel will panic instead. The code uses the standard
WARN macro which will log information like the ID of the current procesor, the
PID of the current process, and a stack trace. In addition, the CFI code will
print the indirect function call target that caused the failure.
The target that caused this CFI failure was cfg80211 wext giwname(). It
was called from ioctl standard call(). Looking at the prototype of the called
function we see that it looks like this:
The next step is to look at the call site to see where the error occurs. See
the following code snippet 3 :
3 defined in net/wireless/wext-core.c
58
1 /*
2 * Wrapper to call a standard Wireless Extension handler.
3 * We do various checks and also take care of moving data between
4 * user space and kernel space.
5 */
6 static int ioctl_standard_call(struct net_device * dev,
7 struct iwreq *iwr,
8 unsigned int cmd,
9 struct iw_request_info *info,
10 iw_handler handler)
11 {
12 const struct iw_ioctl_description * descr;
13 int ret = -EINVAL;
14
41 return ret;
42 }
At line 24 we can see that the handler passed to ioctl standard call() is
called. The handler has the type iw handler which looks like this:
59
Comparing iw handler to the declaration of cfg80211 wext giwname() we see
that the type of the third parameter does not match. The handler expects an
argument of type union iwreq data while the function actually takes a char
*. The fix here is to make sure that the two function signatures match. To
make these match we can change cfg80211 wext giwname() to use the expected
parameter type instead of char *. Following is an excerpt of the definition of
iwreq data with comments removed for brevity:
union iwreq_data {
char name[IFNAMSIZ];
Also see the following code snippet with the definition of cfg80211 wext giwname():
iwreq data is a union type, so simply casting it to a char * works fine here.
However, to make it compatible with the iw handler type we have to change
the char * parameter to a iwreq data and copy the name into the name field
of this union instead. A patch to fix this issue can be found in figure 6.1. Most
of the CFI failures can be, and have been, fixed this way. Fixing these simple
CFI failures should not make huge changes to the original code.
6.4 ShadowCallStack
In section 4.6.3 we briefly introduced ShadowCallStack, or SCS for short. In this
section we will dive deeper into how SCS is implemented in the Linux kernel.
As of early April, 2020, the SCS patchset 4 consists of 12 commits with 31
changed files. In total there is 440 additions and 7 deletions. Compared with
the patchsets for LTO and CFI this is a little simpler.
Let us start with the changes to the build system. A new configuration
option, ARCH SUPPORTS SHADOW CALL STACK is added so that architectures can
signal that they support SCS. To enable SCS, the SHADOW CALL STACK option
is used. The compiler also needs to support SCS, which is currently only clang
versions larger than or equal to 7.0. As the official documentation for SCS states
[48], the compiler flag -ffixed-x18 has to be enabled on arm64 since the SCS
implementation uses the x18 register to store the shadow stack pointer. To
4 from https://github.com/samitolvanen/linux
60
---
include/net/cfg80211-wext.h | 2 +-
net/wireless/wext-compat.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)
61
/*
* Called when gcc's -fstack-protector feature is used, and
* gcc detects corruption of the on-stack canary value
*/
__visible void __stack_chk_fail(void)
{
panic("stack-protector: Kernel stack is corrupted in: %pB",
__builtin_return_address(0));
}
EXPORT_SYMBOL(__stack_chk_fail);
62
struct thread_info {
unsigned long flags;
mm_segment_t addr_limit;
#ifdef CONFIG_ARM64_SW_TTBR0_PAN
u64 ttbr0;
#endif
union {
u64 preempt_count;
struct {
#ifdef CONFIG_CPU_BIG_ENDIAN
u32 need_resched;
u32 count;
#else
u32 count;
u32 need_resched;
#endif
} preempt;
};
#ifdef CONFIG_SHADOW_CALL_STACK
void *shadow_call_stack;
#endif
};
The only field added is the shadow call stack pointer at the end of the
struct. Some minor changes have been done to the core arm64 code to support
SCS. Upon entry to the kernel from user space, the shadow stack has to be
loaded from the task struct into x18. Conversely, when returning to user
space from the kernel the shadow stack pointer is saved. These operations are
perfomed with the scs load and scs save assembly code macros, displayed
below:
TSK TI SCS SP is the offset of the shadow call stack member in the thread info
struct.
63
Chapter 7
Performance benchmarks
7.1 Introduction
In this chapter we will discuss chosen performance benchmarks, including what
hardware and software was chosen, why the benchmark was chosen, and how to
perform the benchmark.
The goal is to measure the performance impact on the kernel with LTO,
CFI, and SCS. To have a baseline to compare against, kernels compiled with
gcc and clang without any special optimizations will be used. Since several
companies and projects are starting to adopt clang/LLVM as their toolchain
of choice, it would be interesting to see if there are any notable differences in
performance between gcc and clang. The kernel configurations in table 7.1 will
be benchmarked on x86 64. For arm64, the configurations in table 7.2 will be
benchmarked.
In the patchset for LTO support in the kernel, the inline limit is set to 5,
instead of the default of 100. This decreases the size of the resulting kernel bi-
nary at the cost of (potential) performance gains. As the patchset is meant for
Android phones, where size is of greater concern than on laptops and servers,
for example, it makes sense to limit inlining. Since binary size is less impor-
tant on x86 64, we will increase the limit to see how much impact it has on
performance and size. Although inlining should increase performance, the in-
creased size may also negatively impact performance if it leads to worse cache
locality or other similar issues. The patches implementing a configurable limit
can be found in section B.2. This limit is controlled with a linker flag called
64
compiler LTO inline limit CFI SCS
gcc no N/A no no
clang no N/A no no
clang yes 5 no no
clang yes 100 no no
clang yes 5 yes no
clang yes 100 yes no
clang yes 5 yes yes
clang yes 100 yes yes
ImportInstrLimit().
2 Debian 1:2.11+dfsg-1ubuntu7.21
3 see tools/create-image.sh in the syzkaller sources
65
For Android, it does not really make sense to run benchmarks for web servers
like nginx and apache2. We could have performed the same benchmarks if we
targeted servers running arm64, for example. Redis is also more commonly used
on servers and desktop computers. WireGuard has not made it into the Android
common kernel, so we will skip that as well.
So what should we benchmark on Android? As discussed in section 2.3,
binder is a popular IPC mechanism on Android. Binder is implemented in the
kernel, making it a perfect component to benchmark since it is heavily used by
apps and it will be directly affected by LTO, CFI, and SCS. The benchmarks
chosen are based on the official Android documentation for performance testing
[5]. Throughput and latency tests for binder and hwbinder are included in the
AOSP. To make sure that CPU throttling does not affect the benchmarks it
has been turned off by setting the Linux CPU scaling governor to performance.
This mode makes sure that the CPU runs at maximum frequency. The script
used for running the benchmarks can be found in the appendix A.5.
For Android the hikey-linaro-android-4.19 kernel was chosen. Ker-
nel versions hikey-linaro-android-4.14, hikey-linaro-android-4.19, and
common-android-5.4 are supported on the HiKey 960. At the time of compi-
lation, the newest commit for the 4.19 kernel was 03a6248 4 . For x86 64 kernel
version 5.7.0-rc2 was used.
66
benchmarks like binary size and compilation time. In the following sections we
will briefly discuss some of the benchmarks and why we perform them.
The following sections will look at some papers that have inspired the per-
formance benchmarking for this thesis. However, no previous research on the
performance impact of clang, LTO, and CFI on x86 64 was found.
• Linpack
• IOZone
Just like the paper from Österlund et al. [63], he uses LMBench. The other
tools are not used in that research, however. Cachebench is a part of the LL-
Cbench benchmarking suite and is a tool used to determine some parameters
about an architectures memory subsystem. The goal of the benchmark is to
parameterize the performance of multiple levels of caches present in the proces-
sor. This benchmark is great for measuring cache performance, but is not very
relevant to this thesis since the CPU caches should not be too affected by the
running kernel while benchmarking.
Another benchmarking suite used is Linpack, which “measures the peak
value of floating point computational power, also known as GFLOPS” [62].
While this benchmark is great for testing the performance of virtual machines
or to test hardware directly, this is another benchmark where the running kernel
should not have any impact.
Lastly, IOZone is used. This is a file system benchmark that generates and
measures a variety of file operations. According to the manual, the bench-
mark tests file I/O performance using the following operations: read, write, re-
read, re-write, read backwards, read strided, fread, fwrite, random read/write,
pread/pwrite variants.
67
DROP THE ROP. Fine-grained Control-flow Integrity for the Linux
Kernel
In this paper, Moreira et. al [58] presents a kernel-level fine-grained CFI mech-
anisms and measures its performance called kCFI. They also present a novel
technique called Call Graph Detaching (CGD) which enables the construction
of more precise CFGs. When evaluating the performance of kCFI they use
LMbench for micro benchmarking just like a lot of the other previously men-
tioned papers do. For macro benchmarks they use the following tests from the
Phoronix Test Suite [56]: IOZone, Linux Kernel Unpacking, PostMark, Timed
Linux Kernel Compilation, GnuPG, Openssl, PyBench, Apache Benchmark,
PHPBench, Dbench, and PostgreSQL. They ran a lot more macro benchmarks
than done in this thesis, but the Apache Bencmark was the one where they saw
the biggest performance difference. They measured a 2% increase in code size
when applying kCFI and a 4% increase when they applied kCFI + CGD.
OpenMandriva
OpenMandriva Lx [60] is one Linux distribution where packages are built using
clang with LTO enabled [61]. They have switched to using clang as their default
compiler toolchain instead of gcc, which is the norm for most Linux distributions.
In addition they have started building their Linux kernels with clang as well.
LTO is not supported for the kernel yet.
68
requires more than the 16 GB of RAM available on the laptop. Because of
these issues, full LTO was not included.
name type
LMBench micro
redis-benchmark macro
ApacheBench macro
WireGuard bandwidth macro
LMBench has a lot of different benchmarks which are mainly divided into
two categories: latency and bandwidth. See table 7.4 for a description of the
different benchmarks.
name lmbench tool description
read bandwidth bw file rd io only read a file in 64KB blocks
read open2close bandwidth bw file rd open2close read a file in 64KB blocks, with open/close
Mmap read bandwidth bw mmap rd mmap only map file in memory and read
Mmap read open2close bandwidth bw mmap rd open2close map file in memory and read with open/close
libc bcopy unaligned bw mem bcopy measures how fast the system can bcopy data
libc bcopy aligned bw mem bcopy conflict measures how fast the system can bcopy data
Memory bzero bandwidth bw mem bzero measures how fast the system can bzero memory
unrolled bcopy unaligned bw mem fcp measures the time to copy data from one location to another
unrolled partial bcopy unaligned bw mem cp measures the time to copy data from one location to another
Memory read bandwidth bw mem frd measures the time to read data into the processor
Memory partial read bandwidth bw mem rd measures the time to read data into the processor
Memory write bandwidth bw mem fwr measures the time to write data to memory
Memory partial write bandwidth bw mem wr measures the time to write data to memory
Memory partial read/write bandwidth bw mem rdwr measures time to read data into memory and then write data to the same memory location
7.6 WireGuard
In addition to the previously mentioned benchmarks, another interesting bench-
mark was added to this thesis. In version 5.6 of the Linux kernel, support for
WireGuard was added. WireGuard [26] is a secure network tunnel which aims
to replace IPsec and other solutions like OpenVPN. WireGuard is much less
complex than IPsec and uses modern cryptographic primitivese All the core
functionality is implemented directly in the Linux kernel. Since WireGuard is
a part of the Linux kernel, it would be interesting to see if LTO and CFI have
any effect on its performance. The WireGuard source code contains a script
for running tests, but we will be using a slightly modified version of that script
used in Phoronix Test Suite 5 .
The script used to test WireGuard is located at
tools/testing/selftests/wireguard/netns.sh in the Linux kernel sources.
5 see A.7 for more information
69
Both of the scripts start by creating a network topology consisting of three
network namespaces. The following diagram shows the topology 6 :
_____________________ _____________________________ ____________________
| $ns1 namespace | | $ns0 namespace | | $ns2 namespace |
| | | | | |
| ________ | | ________ | | _______ |
|| wg0 |__________|___|_________| lo |__________|___|___________| wg0 ||
||________|_________ | | _______|________|________ | | __________|_______||
||192.168.241.1/24 || | |(ns1) (ns2) | | ||192.168.241.2/24 ||
||fd00::1/24 || | |127.0.0.1:1 127.0.0.1:2| | ||fd00::2/24 ||
||__________________|| | |[::]:1 [::]:2 | | ||__________________||
|____________________| | |_________________________| | |____________________|
|_____________________________|
Namespaces [38] [39] [40] [41] [42] [43] [27] are heavily used in container tech-
nology like Docker [35]. They allow wrapping a global resource in an abstraction
that makes the processes within that namespace believe that they have their
own instance of that global resource. Linux provides many different namespaces,
including: cgroup, IPC, network, mount, and PID.
The WireGuard test script creates three network namespaces named $ns0,
$ns1, and $ns2 in the diagram above. $ns1 and $ns2 do not talk directly to each
other, but talk through $ns0. All the traffic goes through the loopback interface
in $ns0. This is a great benchmark for WireGuard, as it puts a lot of pressure
on the underlying cryptographic protocols used to protect the communication
between two WireGuard peers.
70
Chapter 8
Results
8.1 Introduction
In the following sections we will discuss the results from benchmarks performed.
First, we will summarize the configurations used. The first and most common
configuration is a kernel compiled using gcc version 9.3.0-10 with a fairly new
standard Linux kernel configuration from Ubuntu (5.3.0-26-generic). Then
we have a kernel compiled with clang 11 1 . The configuration is the same for gcc
and clang. Next up we have two kernels compiled with clang and LTO enabled.
The first one has the inlining limit set to 5, while the other one (LTOv2) has
it set to 100. Finally, we have two kernels with CFI enabled. As with the
LTO kernels, one has an inline limit of 5, while the other (CFIv2) has it set to
100. Note that LTOv2 may be referred to as LTO100, and CFIv2 referred to as
CFI100 in the benchmark results.
In section 8.3 we will look at the binary sizes of the different compiled kernels.
Then in section 8.4 we will look at compilation time. We will then move on to
micro benchmarks with LMBench in section 8.5.1. Finally, we will look at
different macro benchmarks. In section 8.6.1 we will look at redis performance,
followed by nginx/apache2 performance measured using ApacheBench in section
8.6.2. The last macro benchmark is for WireGuard and can be found in section
8.6.3.
After looking at the benchmark results we will summarize our findings in
section 8.8 and finally discuss future work in section 8.9.
71
8.3 Binary size
In table 8.1 we can see the resulting sizes of the different kernels. Note that all
sizes are displayed in bytes. gcc produces the smallest kernel, followed by clang,
then the two LTO kernels, and finally kernels with LTO and CFI. CFI produces
the largest kernel, which is expected because of all the extra validation code and
jump tables. The percentage increase in tabe 8.1 is measured from the smallest
size. The CFI kernel is 20.21 % larger in size than the kernel compiled with gcc.
We can see that the increased inlining greatly increases binary size. There is an
11.67 % increase in size from LTO to LTOv2, and a 10.66 % increase from CFI
to CFIv2.
kernel version compiler/feature size percent increase
5.7.0-rc2 gcc 8409920 N/A
5.7.0-rc2 clang 8737856 3.90 %
5.7.0-rc2 LTO 8810432 4.76 %
5.7.0-rc2 LTOv2 9791520 16.43 %
5.7.0-rc2 CFI 10109952 20.21 %
5.7.0-rc2 CFIv2 11005760 30.87 %
In table 8.2 we can see the size of the Android kernel images. The size
increases are relatively similar to the x86 64 results. From LTO to LTOv2 there
is an 13.20 % increase in size. From CFI to CFIv2 there is a 10.12 % increase.
Enabling SCS increases the size slightly, but it is less than 1 %.
72
In the following snippet we run the find command to locate all files named
vmlinux in our home folder. To measure the time elapsed we simply put time
in front of the command.
ccache was not used for any of these benchmarks, and all of the kernels
were built using new output directories so that no old object files could be
reused. Everything was built from scratch. ccache can greatly speed up the
build process after the kernel has been built once, since it will cache object files
created during compilation that can be reused later. ccache will not help speed
up the LTO and linking steps, however. Clang does have support for a ThinLTO
cache which can help speed up incremental compilation.
8.5.1 LMBench
To make the results from LMBench more readable, they have been split into cat-
egories based on their execution time. Short benchmarks are grouped together,
and longer ones are grouped together. In figure 8.1, we see the results for the
simple syscall benchmarks, signal handlers, protection fault, pipe latency, and
AF UNIX socket stream latency. In figure 8.2, we see the results for the different
select syscall benchmarks. We can see process benchmarks in figure 8.3. The
bandwidth benchmarks can be found in figures 8.4, 8.5, 8.6, 8.7, 8.8, 8.9, 8.10,
8.11, 8.12, and 8.13.
For the simple LMBench benchmarks the results are pretty even between the
different kernel configurations. The simple syscall benchmark was very unstable
between different runs of the benchmarks, and we can thus ignore it. The gcc
kernel was the fastest one in 7 out of 11 of the simple benchmarks. The CFI
kernels performed worst on average.
73
LMBench
15.0 gcc
clang
12.5 lto
microseconds
10.0 cfi
lto_100
7.5 cfi_100
5.0
2.5
0.0
Simple read
Simple syscall
Simple write
Simple stat
Simple fstat
Protection fault
Pipe latency
Simple open/close
For the select() syscall results in figure 8.2, gcc was the fastest kernel in
every single benchmark. The CFI kernels performed worst on average here as
well.
In the process benchmarks, where the timing of process creation is measured,
the LTOv2 kernel was the fastest in all of the benchmarks.
It is not very surprising that the CFI kernels performed worst, as CFI adds
a lot of overhead to every single indirect function call. There was not a very
big difference in the performance of the different kernels, however, which may
indicate that CFI does not slow down the kernel that much in these micro
benchmarks.
To get a better understanding of why the benchmarks might differ between
the kernel configurations, we will examine a syscall in the kernel images. We will
start by examining the stat syscall, which is implemented in fs/stat.c. The
size of the structure passed to the stat syscall has changed over time, leading
to several versions of the syscall being present. We will focus on the newest
version, which is newstat. In figure 8.14 we see the definition of newstat in the
kernel sources.
74
microseconds microseconds
0
10
20
30
40
50
60
0
200
400
600
800
Select on 10 fd's
cfi
lto
gcc
cfi
lto
gcc
clang
clang
cfi_100
Process fork+exit lto_100
cfi_100
lto_100
Select on 100 fd's
75
Process fork+execve
LMBench
LMBench
Select on 100 tcp fd's
0
1000
2000
3000
4000
5000
6000
1000
2000
3000
4000
5000
6000
7000
0.000512 0.000512
0.001024 0.001024
cfi
cfi
lto
lto
gcc
gcc
0.002048 0.002048
clang
clang
0.004096 0.004096
cfi_100
cfi_100
lto_100
lto_100
0.008192 0.008192
0.016384 0.016384
0.032768 0.032768
0.065536 0.065536
0.131072 0.131072
0.262144 0.262144
76
0.524288 0.524288
1.05 1.05
2.1 2.1
Size in MB
Size in MB
4.19 4.19
8.39 8.39
16.78 16.78
LMBench bandwidth: read
33.55 33.55
67.11 67.11
Figure 8.4: LMBench bandwidth read results
16000
MB/s
14000
12000
10000
8000
0.000512
0.001024
0.002048
0.004096
0.008192
0.016384
0.032768
0.065536
0.131072
0.262144
0.524288
1.05
2.1
4.19
8.39
16.78
33.55
67.11
134.22
268.44
536.87
1073.74
2147.48
Size in MB
3000
2000
1000
0
0.000512
0.001024
0.002048
0.004096
0.008192
0.016384
0.032768
0.065536
0.131072
0.262144
0.524288
1.05
2.1
4.19
8.39
16.78
33.55
67.11
134.22
268.44
536.87
1073.74
2147.48
Size in MB
77
MB/s MB/s
6000
7000
8000
9000
10000
20000
30000
40000
0.000512 0.000512 50000
0.001024 0.001024
0.002048 0.002048
0.004096 0.004096
0.008192 0.008192
0.016384 0.016384
0.032768 0.032768
0.065536 0.065536
0.131072 0.131072
0.262144 0.262144
78
0.524288 0.524288
1.05 1.05
2.1 2.1
Size in MB
Size in MB
4.19 4.19
8.39 8.39
16.78 16.78
33.55 33.55
67.11 67.11
cfi
cfi
lto
lto
gcc
gcc
1073.74 1073.74
clang
clang
2147.48 2147.48
cfi_100
cfi_100
lto_100
lto_100
LMBench bandwidth: Memory partial read
gcc
clang
50000 lto
cfi
lto_100
40000 cfi_100
MB/s
30000
20000
10000
0.000512
0.001024
0.002048
0.004096
0.008192
0.016384
0.032768
0.065536
0.131072
0.262144
0.524288
1.05
2.1
4.19
8.39
16.78
33.55
67.11
134.22
268.44
536.87
1073.74
2147.48
Size in MB
6400
6200
6000
0.000512
0.001024
0.002048
0.004096
0.008192
0.016384
0.032768
0.065536
0.131072
0.262144
0.524288
1.05
2.1
4.19
8.39
16.78
33.55
67.11
134.22
268.44
536.87
1073.74
2147.48
Size in MB
79
LMBench bandwidth: Memory partial write
30000 gcc
27500 clang
lto
25000 cfi
lto_100
22500 cfi_100
20000
MB/s
17500
15000
12500
10000
0.000512
0.001024
0.002048
0.004096
0.008192
0.016384
0.032768
0.065536
0.131072
0.262144
0.524288
1.05
2.1
4.19
8.39
16.78
33.55
67.11
134.22
268.44
536.87
1073.74
2147.48
Size in MB
15000
10000
0.000512
0.001024
0.002048
0.004096
0.008192
0.016384
0.032768
0.065536
0.131072
0.262144
0.524288
1.05
2.1
4.19
8.39
16.78
33.55
67.11
134.22
268.44
536.87
1073.74
2147.48
Size in MB
80
SYSCALL_DEFINE2(newstat, const char __user *, filename,
struct stat __user *, statbuf)
{
struct kstat stat;
int error = vfs_stat(filename, &stat);
if (error)
return error;
return cp_new_stat(&stat, statbuf);
}
The entry point in the kernel images is x64 sys newstat(). For 32-bit
syscalls, the entry point is ia32 sys newstat(). This function is just a thin
wrapper around the real implementation, which is named do sys newstat()
in the gcc kernel, and se sys newstat() in the kernels compiled with clang.
The implementation examples from the different kernels have been converted
from assembly to C for a more compact representation of the code. In addition,
we have attempted to use code from the kernel sources to make it look more
natural. The code for the gcc and clang kernels can be found in figure 8.15 and
figure 8.16 respectively.
81
int __se_sys_newstat(const char *filename, struct stat *statbuf)
{
int error;
struct kstat stat;
struct stat tmp;
if (!valid_dev(stat->dev) || !valid_dev(stat->rdev))
return -EOVERFLOW;
INIT_STRUCT_STAT_PADDING(tmp);
tmp.st_dev = encode_dev(stat->dev);
tmp.st_ino = stat->ino;
if (sizeof(tmp.st_ino) < sizeof(stat->ino) && tmp.st_ino != stat->ino)
return -EOVERFLOW;
tmp.st_mode = stat->mode;
tmp.st_nlink = stat->nlink;
if (tmp.st_nlink != stat->nlink)
return -EOVERFLOW;
SET_UID(tmp.st_uid, from_kuid_munged(current_user_ns(), stat->uid));
SET_GID(tmp.st_gid, from_kgid_munged(current_user_ns(), stat->gid));
tmp.st_rdev = encode_dev(stat->rdev);
tmp.st_size = stat->size;
tmp.st_atime = stat->atime.tv_sec;
tmp.st_mtime = stat->mtime.tv_sec;
tmp.st_ctime = stat->ctime.tv_sec;
tmp.st_blocks = stat->blocks;
tmp.st_blksize = stat->blksize;
return copy_to_user(statbuf, &tmp, sizeof(tmp)) ? -EFAULT : 0;
}
The first thing to note is that the function is significantly smaller in the gcc
kernel. The reason for this is that the call to cp new stat() has been inlined
in the kernel compiled with clang. Other than this, the clang, LTO, and CFI
kernels look the same. However, if we look into vfs statx() we start to see
more differences.
We will not display full code listings here, but rather we will discuss impor-
tant differences between the different versions of the compiled function. The
original code definition of vfs statx() in the kernel looks like this:
82
1 int vfs_statx(int dfd, const char __user *filename, int flags,
2 struct kstat *stat, u32 request_mask)
3 {
4 struct path path;
5 int error = -EINVAL;
6 unsigned lookup_flags;
7
8 if (vfs_stat_set_lookup_flags(&lookup_flags, flags))
9 return -EINVAL;
10 retry:
11 error = user_path_at(dfd, filename, lookup_flags, &path);
12 if (error)
13 goto out;
14
In the gcc kernel, no real differences between the binary and source code
were found. vfs stat set lookup flags() on line 8 and user path at() on
line 11 were both inlined, but these are marked as static inline in the kernel
sources, so this is not a huge surprise. Clang inlined the call to vfs getattr()
as well. This is a very small function that is not used many places in the kernel,
however, so it makes sense to inline it. The LTO kernel takes it one step further
and inlines user path at empty() as well. In the CFI kernel, however, the call
to user path at empty() is not inlined as it goes through a jump table.
user path at empty() is defined in fs/namei.c while vfs statx() is de-
fined in fs/stat.c, so it makes sense that the clang kernel is not able to inline
this call as it exists in another object file. The function is very small, so it makes
sense that the LTO kernel inlines it across object files.
Recall that the LTOv2 kernel can inline larger functions across object file
boundaries, so we expect to see more aggressive inlining in the LTOv2 kernel
compared to the others. This is exactly what we observe, as the LTOv2 kernel
inlines the call to path put() as well. The CFIv2 kernel also inlines the call to
path put(), but the call to user path at empty() still goes through a jump
table and thus is not inlined.
Both the LTOv2 and CFIv2 kernels further inlines vfs getattr(). This is
how the function looks in C code:
83
int vfs_getattr(const struct path *path, struct kstat *stat,
u32 request_mask, unsigned int query_flags)
{
int retval;
retval = security_inode_getattr(path);
if (retval)
return retval;
return vfs_getattr_nosec(path, stat, request_mask, query_flags);
}
An increase in code size does not mean that the number of instructions ex-
ecuted when running the function increases, however. When inlining functions,
the function call overhead is removed, as there are no call instructions used,
no need to pass arguments in registers, and no need for function prologues and
epilogues. In addition, the compiler might be able to perform optimizations
that were not possible before inlining the call, like smarter register usage and
so on.
Consider the assembly code listing from the gcc kernel in figure 8.17. The
code is from the user path at empty() function, which was inlined on several
kernel configurations. When this function is inlined, there is no need for the
prologue, which contains roughly 10 instructions. The epilogue is also omitted,
saving around 5 instructions. Although modern CPUs can execute billions of
instructions per second, the number of instructions can have an impact if the
number is big enough. So when the whole kernel is affected by aggressive in-
lining, the overall number of instructions executed to perform an action may
decrease, leading to faster execution times. For these simple results, however, it
seems like the performance may be negatively impacted by the increased code
size, as the gcc kernel performed well for many of the benchmarks. The differ-
ence was miniscule, however, so maybe LMBench does not produce results that
are accurate enough to really see any difference.
84
1 push rbp
2 mov rbp, rsp
3 push r14
4 mov r14, rcx
5 push r13
6 mov r13d, edi
7 mov rdi, rsi
8 push r12
9 mov r12d, edx
10 mov rdx, r8
11 mov esi, r12d
12 call getname_flags
13 mov rcx, r14
14 mov edx, r12d
15 mov edi, r13d
16 mov rsi, rax
17 xor r8d, r8d
18 call filename_lookup_0
19 pop r12
20 pop r13
21 pop r14
22 pop rbp
23 retn
Figure 8.17: user path at empty() assembly code from gcc kernel
85
Redis
80000 gcc
clang
LTO
CFI
LTO100
CFI100
70000
60000
50000
Requests/s
40000
30000
20000
10000
0
PING_INLINE
PING_BULK
SET
GET
INCR
LPUSH
RPUSH
LPOP
RPOP
SADD
HSET
SPOP
LPUSH
LRANGE_100
LRANGE_300
LRANGE_500
LRANGE_600
MSET
Figure 8.18: redis-benchmark results
8.6.1 Redis
In figure 8.18 we can see the results from running redis’ benchmarking suite on
the different kernels.
The Y axis on the graph measures requests per second, which means that
higher numbers mean better performance. For this benchmark we can clearly
see that CFI is the slowest for a majority of the benchmarks. gcc kernels perform
a little better than clang on average, but the LTO kernel performs very well,
beating the others in almost every single benchmark. One interesting thing to
note is that LTO usually beats LTOv2 in these benchmarks, which may indicate
that increased inlining may be slower for certain workloads.
kernel PING INLINE PING BULK SET GET INCR LPUSH RPUSH LPOP RPOP SADD HSET SPOP LPUSH LRANGE 100 LRANGE 300 LRANGE 500 LRANGE 600 MSET
gcc 3.56 3.23 4.89 2.74 0.0 0.15 0.61 1.90 3.52 5.76 0.99 1.78 0.0 0.0 0.07 0.86 2.71 2.94
clang 1.82 1.91 1.60 0.0 1.12 0.46 0.30 1.29 1.68 1.15 1.15 0.0 0.99 1.30 0.25 0.16 1.60 2.58
lto 0.0 0.0 0.0 0.67 0.15 0.0 0.0 0.0 0.0 0.0 0.0 0.07 0.15 0.58 0.0 0.19 1.08 0.53
cfi 11.43 13.51 13.98 12.81 12.53 9.34 11.80 11.20 13.46 13.20 10.86 11.29 10.86 5.22 2.36 1.31 1.43 8.57
lto100 1.21 3.82 1.53 1.70 2.24 0.76 1.29 2.28 4.20 4.83 1.76 1.63 0.38 0.84 0.25 0.0 0.0 0.0
cfi100 9.24 13.80 11.76 11.41 11.04 8.88 10.06 12.72 12.31 13.12 11.85 11.00 11.02 4.24 1.99 1.22 1.79 6.41
To see the difference between the benchmarks more clearly, we have plotted
the difference in percent in table 8.5. The green cells represent the fastest
benchmarks, the yellow ones are 0.1 to 4.9 % slower, and the red cells are more
than 5 % slower. We can clearly see that the LTO kernel is the fastest overall.
For the benchmarks where the LTO kernel is not the fastest, it is within 1 % of
the fastest benchmark in all benchmarks except one.
86
8.6.2 ApacheBench
This section shows the results from running ApacheBench against nginx and
Apache on the different kernels. Figure 8.19 shows the number of requests per
second for nginx. Figure 8.20 shows the average request time for nginx. Figure
8.21 shows the number of requests per second for Apache, and figure 8.22 shows
the average request time for Apache. The benchmarks are run with 1, 10, 20,
and 30 concurrent connections, and each of these benchmarks are run 10 times.
An average is calculated over these results.
In addition to the figures we have some tables showing the performance
difference between the benchmarks. The best benchmarks are represented by
green cells. The yellow cells represent benchmarks with a decreased performance
of 0.1 to 4.9 %. The red cells represent benchmarks that have a slowdown of
5.0 % or more.
kernel 1 10 20 30
gcc 4.07 6.35 3.72 5.58
clang 1.52 8.19 3.37 3.09
lto 1.30 4.23 0.0 0.98
cfi 15.62 16.28 14.17 14.68
lto100 0.0 0.0 1.24 0.0
cfi100 14.70 17.11 17.12 17.36
kernel 1 10 20 30
gcc 4.28 6.36 3.91 5.64
clang 1.52 8.90 3.68 3.06
lto 1.41 4.51 0.0 0.71
cfi 15.49 16.42 14.61 14.34
lto100 0.0 0.0 1.84 0.0
cfi100 14.54 17.46 17.49 17.27
kernel 1 10 20 30
gcc 6.50 4.10 5.06 3.67
clang 0.0 2.70 2.69 1.18
lto 2.19 0.0 3.73 1.16
cfi 17.75 13.08 13.93 12.23
lto100 13.80 0.26 0.0 0.0
cfi100 21.60 12.01 13.03 12.33
The redis benchmarks showed that the LTO kernel performed the best, while
these benchmarks show that LTOv2 has the best overall performance. The CFI
kernels see a significant performance drop here as well.
87
ApacheBench/nginx
gcc
12000 clang
LTO
CFI
LTO100
CFI100
10000
8000
Requests/s
6000
4000
2000
0
1 10 20 30
Concurrent connections
88
ApacheBench/nginx
gcc
clang
LTO
CFI
LTO100
0.200 CFI100
0.175
0.150
0.125
Time/request (ms)
0.100
0.075
0.050
0.025
0.000
1 10 20 30
Concurrent connections
89
ApacheBench/apache2
gcc
clang
10000 LTO
CFI
LTO100
CFI100
8000
6000
Requests/s
4000
2000
0
1 10 20 30
Concurrent connections
90
ApacheBench/apache2
gcc
clang
0.30 LTO
CFI
LTO100
CFI100
0.25
0.20
Time/request (ms)
0.15
0.10
0.05
0.00
1 10 20 30
Concurrent connections
91
kernel 1 10 20 30
gcc 7.24 4.16 5.03 3.57
clang 0.0 3.00 2.76 1.16
lto 2.10 0.0 3.75 1.06
cfi 18.26 13.07 13.82 12.26
lto100 14.19 0.39 0.0 0.0
cfi100 22.16 12.20 12.93 12.16
8.6.3 WireGuard
In this section we will present the results of running the WireGuard benchmark
detailed in section 7.6. In figure 8.23 we have the UDP throughput results for
WireGuard. In figures 8.24 and 8.25 we have the results for TCP send and
receive, respectively.
kernel udp ipv4 jumbo ipv4 udp ipv4 jumbo ipv6 udp ipv4 normal ipv4 udp ipv4 normal ipv6 udp ipv6 jumbo ipv4 udp ipv6 jumbo ipv6 udp ipv6 normal ipv4 udp ipv6 normal ipv6
gcc 17.73 2.90 11.90 0.04 0.0 0.0 0.0 2.32
clang 9.46 0.0 5.13 2.78 13.85 11.38 2.80 4.19
lto 0.76 6.75 3.59 2.17 4.15 6.79 0.89 0.0
cfi 17.96 10.89 8.36 2.89 15.69 3.91 9.84 6.03
lto100 0.0 0.89 0.0 0.0 0.44 2.44 2.70 4.85
cfi100 14.29 20.66 9.67 5.82 10.42 25.68 8.55 10.16
kernel tcp ipv4 jumbo ipv4 tcp ipv4 jumbo ipv6 tcp ipv4 normal ipv4 tcp ipv4 normal ipv6 tcp ipv6 jumbo ipv4 tcp ipv6 jumbo ipv6 tcp ipv6 normal ipv4 tcp ipv6 normal ipv6
gcc 0.0 3.62 0.0 0.0 0.67 3.34 0.65 0.79
clang 0.95 3.45 3.01 0.54 3.75 3.40 2.04 1.82
lto 1.69 2.39 3.09 1.16 0.0 0.0 0.0 0.0
cfi 3.22 2.04 3.46 3.72 2.62 7.26 5.65 4.73
lto100 0.15 0.0 0.31 0.57 2.21 5.62 0.13 1.73
cfi100 13.66 16.15 7.68 4.71 13.12 15.56 7.65 6.31
kernel tcp ipv4 jumbo ipv4 tcp ipv4 jumbo ipv6 tcp ipv4 normal ipv4 tcp ipv4 normal ipv6 tcp ipv6 jumbo ipv4 tcp ipv6 jumbo ipv6 tcp ipv6 normal ipv4 tcp ipv6 normal ipv6
gcc 0.0 3.62 0.0 0.0 0.67 3.35 0.66 0.79
clang 0.95 3.45 3.02 0.54 3.75 3.40 2.04 1.83
lto 1.70 2.39 3.09 1.17 0.0 0.0 0.0 0.0
cfi 3.23 2.04 3.47 3.73 2.62 7.26 5.65 4.74
lto100 0.15 0.0 0.32 0.56 2.21 5.62 0.13 1.73
cfi100 13.66 16.15 7.68 4.71 13.12 15.57 7.65 6.32
Some of the benchmarks are named jumbo, while others are named normal.
This refers to the maximum transmission unit (MTU) used when sending the
packets. MTU relates to the maximum size of packets sent on the network.
Thus, increasing the MTU means that more data can be sent with each network
packet. This explains why the jumbo benchmarks have a higher throughput than
the normal ones.
The same results are also plotted as tables like we have done with the other
macro benchmarks. The UDP send bandwidth benchmarks can be found in
table 8.10. The TCP send and receive bandwidth benchmarks can be found in
tables 8.11 and 8.12 respectively.
The results for the WireGuard benchmarks are less clear than the other
macro benchmarks we have performed. We could probably have run these
92
Throughput
0
1
2
3
4
5
6
1e9
udp_ipv4_jumbo_ipv4
udp_ipv4_jumbo_ipv6
udp_ipv4_normal_ipv4
udp_ipv4_normal_ipv6
93
WireGuard
udp_ipv6_jumbo_ipv4
udp_ipv6_jumbo_ipv6
udp_ipv6_normal_ipv4
cfi100
lto100
Throughput
0.0
0.2
0.4
0.6
0.8
1.0
1e10
tcp_ipv4_jumbo_ipv4
tcp_ipv4_jumbo_ipv6
tcp_ipv4_normal_ipv4
tcp_ipv4_normal_ipv6
94
tcp_ipv6_jumbo_ipv4
WireGuard TCP send
tcp_ipv6_jumbo_ipv6
tcp_ipv6_normal_ipv4
cfi100
lto100
Throughput
0.0
0.2
0.4
0.6
0.8
1.0
1e10
tcp_ipv4_jumbo_ipv4
tcp_ipv4_jumbo_ipv6
tcp_ipv4_normal_ipv4
tcp_ipv4_normal_ipv6
95
tcp_ipv6_jumbo_ipv4
WireGuard TCP receive
tcp_ipv6_jumbo_ipv6
tcp_ipv6_normal_ipv4
cfi100
lto100
benchmarks several times and computed the average to get more accurate re-
sults. We still observe that the CFI kernels perform worse than the other kernels
by between 2.04 % and 25.68 %.
8.8 Summary
It is hard to say if clang’s CFI implementation is worth using on x86 64 because
of the performance impact. Although some exploitation scenarios are made
more difficult for attackers, only the forward-edge of the control flow graph is
protected. Coupled with SCS it would probably be more attractive. When
CET-enabled processors from Intel arrive on the market it would be interesting
to compare the overhead of enabling that compared to clang’s CFI implemen-
tation. The security benefit is not the same, however, since CFI checks the
function signature, while CET still allows function pointer hijacking as long
as the destination is the start of a valid function. With CET we will also get
backward-edge CFI, which, coupled with forward-edge CFI provides a pretty
good defense.
Enabling LTO showed a significant increase in performance on x86 64, and
it would be interesting to see if more Linux distributions start picking up clang
and LTO. Although there were some issues with PTI and CFI, no issues were
encountered while using a kernel with LTO enabled. It would be interesting to
test the robustness of the kernel in a production environment to see if it still
holds up, or if there are any unknown issues.
Using LTO and CFI really increases compilation time, especially on weaker
hardware. Thus these features are not very well suited during rapid develop-
ment. LTO and CFI should probably only be enabled once the kernel is ready
for release, when extra compilation time is not detrimental for developer pro-
ductivity.
Binary size might be an issue in certain environments, like embedded devices.
Here, gcc might be the preferred option as it was able to produce a smaller
96
kernel. However, on x86 64 the size increase is probably negligible. When using
LTO/CFI with an inline limit of 100, the kernel increased 30.87 % in size. From 8
MB to 11 MB. 3 MB is definitely not a problem for a normal laptop or desktop
computer as they usually have several hundred GB of storage. However, the
increased kernel size will also affect memory usage of the system. For embedded
devices with limited amounts of RAM this may become an issue.
Overall the results show that LTO can provide a performance boost for
certain workloads. Some workloads saw an increased performance when using
LTO with an increased inlining limit, while others saw a performance drop
compared to less aggressive inlining. There are definitely potential for more
research in this area to find a good balance for the inlining limit.
97
circumstances. There is almost no difference in performance when increasing
the LTO inlining limit, so it is probably not worth increasing the limit anyways.
Knowing the impact on the CPU cache is still interesting, and this might vary
a lot between different CPUs as well, making it an interesting research topic.
98
Appendix A
Scripts
3 set -eux
4
5 KERNEL_VERSION=5.7.0-rc2
6
27 # strip from binutils doesn't really like modules compiled with clang
28 # and especially not modules compiled with LTO, so we have to strip the
29 # modules manually before generating initrd
30 sudo make O=$OUTPUT_DIR modules_install
31 sudo find /lib/modules/$KERNEL_VERSION-$1+ -name \*.ko -exec \
32 llvm-strip-9 --strip-debug {} +
99
33 sudo make O=$OUTPUT_DIR install
34 sudo depmod
3 set -eux
4
42 mkdir -p $OUTPUT_DIR
43
100
45 CC=clang-11 HOSTCC=clang-11 LD=ld.lld-11 O=$OUTPUT_DIR
46
101
95 time make -j8 HOSTCC=clang-11 HOSTLD=ld.lld-11 \
96 CC=clang-11 LD=ld.lld-11 O=$OUTPUT_DIR 2>&1 \
97 | tee build_results_$1.txt
102
50 -e SHADOW_CALL_STACK \
51 --set-str CONFIG_LOCALVERSION "-scs"
52 (cd ${OUT_DIR} && \
53 make O=${OUT_DIR} $archsubarch CC=${CC} CROSS_COMPILE=${CROSS_COMPILE} olddefconfig)
54 }
A.4 Docker
The following Dockerfile was used to build all the Linux kernel images.
1 FROM clangbuiltlinux/debian
2
16 ENTRYPOINT [ "/bin/bash" ]
With the following ccache config file for better incremental compilation per-
formance.
cache_dir = /opt/ccache
max_size = 10.0G
1 #!/bin/bash
2
10 sudo lmbench-run
11
103
12 sudo systemctl start redis-server
13 redis-benchmark > redis_results_$(uname -r).txt
14
3 NAME=$1
4
10 # hwbinder throughput
11 ./libhwbinder_benchmark --benchmark_out_format=json \
12 --benchmark_repetitions=64 \
13 --benchmark_report_aggregates_only=true \
14 --benchmark_out=hwbinder_throughput_$NAME.txt
15
16 # binder throughput
17 ./libbinder_benchmark --benchmark_out_format=json \
18 --benchmark_repetitions=64 \
19 --benchmark_report_aggregates_only=true \
20 --benchmark_out=binder_throughput_$NAME.txt
21
22 # hwbinder latency
23 ./libhwbinder_latency -i 10000 -pair 3 > hwbinder_latency_$NAME.txt
24
25 # binder latency
26 ./schd-dbg -i 10000 -pair 3 > binder_latency_$NAME.txt
1 #!/usr/bin/env python3
2 import os
3 import sys
4 import re
5 import matplotlib.pyplot as plt
6 import numpy as np
7
9 def parse(filename):
10 with open(filename, "r") as f:
104
11 data = f.readlines()
12 names = []
13 values = []
14
15 for i in range(len(data)):
16 line = data[i]
17 if line == "\n":
18 continue
19
27 for i in range(len(data)):
28 line = data[i]
29 if "requests per second" not in line:
30 continue
31 values.append(float(line.split(" ")[0]))
32
35
36 def main():
37 results_path = "."
38 files = os.listdir(results_path)
39 results = {}
40
41 for f in files:
42 if not "redis_results_5.7.0-rc2" in f or f.startswith("."):
43 continue
44 print("parsing file {}...".format(f))
45 pattern = r"redis_results_5.7.0-rc2-([a-z0-9]+)\+.txt"
46 name = re.match(pattern, f).group(1)
47 results[name] = parse("./" + f)
48
49 gcc = results["gcc"][1]
50 clang = results["clang"][1]
51 lto = results["lto"][1]
52 cfi = results["cfi"][1]
53 lto100 = results["lto100"][1]
54 cfi100 = results["cfi100"][1]
55
56 labels = results["gcc"][0]
57
58 print(labels)
59 print(gcc)
60 out = """\\begin{table*}[!htbp]
105
61 \\begin{adjustbox}{width=1\\textwidth}
62 \\centering
63 \\begin{tabular}{@{}lllllllllllllllllll@{}}
64 kernel"""
65 for label in labels:
66 out += " & {}".format(label.replace("_", "\\_"))
67 out += " \\\\\n"
68 out += "\\midrule\n"
69
86 for i in range(len(res)):
87 val = res[i]
88 if val == best:
89 res[i] = 0.0
90 else:
91 res[i] = (float(best - val) / val) * 100
92 print(res)
93
106
111 out += "\\end{adjustbox}\n"
112 out += "\\caption{Redis benchmark results}\n"
113 out += "\\label{fig:redis_results_table}\n"
114 out += "\\end{table*}\n"
115 print(out)
116
117 x = np.arange(len(labels))
118 width = 0.12
119
129 ax.set_ylabel("Requests/s")
130 ax.set_title("Redis")
131 ax.set_xticks(x)
132 ax.set_xticklabels(labels)
133 ax.legend()
134 plt.xticks(rotation=90)
135
138
9 def parse(filename):
10 with open(filename, "r") as f:
11 data = f.read()
12 res = {}
13
107
18 pattern += r" all concurrent requests\)"
19 tpr = float(re.search(pattern, data).group(1))
20 concurrency = int(re.search(r"Concurrency Level:[\s]+([0-9]+)",
21 data).group(1))
22
29
36
43 if len(values["gcc"]["request"]) != 0:
44 rects1 = ax.bar(x - (width * 2.5), values["gcc"]["request"],
45 width, label="gcc")
46 if len(values["clang"]["request"]) != 0:
47 rects2 = ax.bar(x - (width * 1.5), values["clang"]["request"],
48 width, label="clang")
49 if len(values["lto"]["request"]) != 0:
50 rects3 = ax.bar(x - (width * 0.5), values["lto"]["request"],
51 width, label="LTO")
52 if len(values["cfi"]["request"]) != 0:
53 rects4 = ax.bar(x + (width * 0.5), values["cfi"]["request"],
54 width, label="CFI")
55 if len(values["lto100"]["request"]) != 0:
56 rects5 = ax.bar(x + (width * 1.5), values["lto100"]["request"],
57 width, label="LTO100")
58 if len(values["cfi100"]["request"]) != 0:
59 rects6 = ax.bar(x + (width * 2.5), values["cfi100"]["request"],
60 width, label="CFI100")
61
62 ax.set_ylabel("Requests/s")
63 ax.set_xlabel("Concurrent connections")
64 ax.set_title("ApacheBench/{}".format(name))
65 ax.set_xticks(x)
66 ax.set_xticklabels(labels)
67 ax.legend()
108
68
69 fig.tight_layout()
70 fig.savefig("ab_results_requests_{}.pdf".format(name),
71 bbox_inches="tight")
72 generate_table(name, values, labels, "request")
73
74
81 if len(values["gcc"]["request"]) != 0:
82 rects1 = ax.bar(x - (width * 2.5), values["gcc"]["time"],
83 width, label="gcc")
84 if len(values["clang"]["request"]) != 0:
85 rects2 = ax.bar(x - (width * 1.5), values["clang"]["time"],
86 width, label="clang")
87 if len(values["lto"]["request"]) != 0:
88 rects3 = ax.bar(x - (width * 0.5), values["lto"]["time"],
89 width, label="LTO")
90 if len(values["cfi"]["request"]) != 0:
91 rects4 = ax.bar(x + (width * 0.5), values["cfi"]["time"],
92 width, label="CFI")
93 if len(values["lto100"]["request"]) != 0:
94 rects5 = ax.bar(x + (width * 1.5), values["lto100"]["time"],
95 width, label="LTO100")
96 if len(values["cfi100"]["request"]) != 0:
97 rects6 = ax.bar(x + (width * 2.5), values["cfi100"]["time"],
98 width, label="CFI100")
99
107 fig.tight_layout()
108 fig.savefig("ab_results_time_{}.pdf".format(name), bbox_inches="tight")
109 generate_table(name, values, labels, "time")
110
111
109
118 for label in labels:
119 out += " & {}".format(label.replace("_", "\\_"))
120 out += " \\\\\n"
121 out += "\\midrule\n"
122
123 fixed_results = {}
124
110
168 else:
169 description = "requests/s"
170 out += "\\caption{{ApacheBench {} {}}}\n".format(name, description)
171 if bench_type == "time":
172 suffix = "tps"
173 else:
174 suffix = "rps"
175 out += "\\label{{table:ab_{}_{}}}\n".format(name, suffix)
176 out += "\\end{table*}\n"
177
178 print(out)
179
180
186 if len(sys.argv) != 2:
187 print("Usage: {} [bench|pdf]".format(sys.argv[0]))
188 sys.exit()
189
208 sys.exit()
209
111
218 for concurrency in [1,10,20,30]:
219 results[webserver][name][concurrency] = list()
220
235 values = {}
236 # now calculate the mean of all the results for each concurrency level
237 for webserver in [ "apache2", "nginx" ]:
238 for name in ["gcc","clang","lto","cfi","lto100","cfi100"]:
239 values[name] = {}
240 values[name]["request"] = list()
241 values[name]["time"] = list()
242 if name not in results[webserver]:
243 continue
244 for concurrency in [1,10,20,30]:
245 res = results[webserver][name][concurrency]
246 if len(res) == 0:
247 break
248 avg_rps = sum(map(lambda x : x[0], res))/len(res)
249 avg_tpr = sum(map(lambda x : x[1], res))/len(res)
250 values[name]["request"].append(avg_rps)
251 values[name]["time"].append(avg_tpr)
252 # plot requests per second and time per request
253 plot_reqs(webserver, values)
254 plot_time(webserver, values)
255
256
112
8
9 def convert(filename):
10 with open(filename, encoding="utf8", errors="ignore") as f:
11 data = f.read()
12
51
113
58 m = re.search("Protection fault: ([0-9]+\.[0-9]+) microseconds", data)
59 protection_fault = m.group(1)
60 m = re.search("Pipe latency: ([0-9]+\.[0-9]+) microseconds", data)
61 pipe_latency = m.group(1)
62 m = re.search("AF_UNIX sock stream latency: ([0-9]+\.[0-9]+) microseconds",
63 data)
64 af_unix_stream_lat = m.group(1)
65 m = re.search("Process fork\+exit: ([0-9]+\.[0-9]+) microseconds",
66 data)
67 proc_fork_exit = m.group(1)
68 m = re.search("Process fork\+execve: ([0-9]+\.[0-9]+) microseconds", data)
69 proc_fork_execve = m.group(1)
70 m = re.search("Process fork\+/bin/sh -c: ([0-9]+\.[0-9]+) microseconds",
71 data)
72 proc_fork_bin_sh = m.group(1)
73
114
108 "Simple fstat": float(simple_fstat),
109 "Simple open/close": float(simple_open_close),
110 "Signal handler installation": float(signal_handler_install),
111 "Signal handler overhead": float(signal_handler_overhead),
112 "Protection fault": float(protection_fault),
113 "Pipe latency": float(pipe_latency),
114 "AF\\_UNIX sock stream latency": float(af_unix_stream_lat),
115 }, "select": {
116 "Select on 10 fd's": float(select_10),
117 "Select on 100 fd's": float(select_100),
118 "Select on 250 fd's": float(select_250),
119 "Select on 500 fd's": float(select_500),
120 "Select on 10 tcp fd's": float(select_tcp_10),
121 "Select on 100 tcp fd's": float(select_tcp_100),
122 "Select on 250 tcp fd's": float(select_tcp_250),
123 "Select on 500 tcp fd's": float(select_tcp_500),
124 }, "process": {
125 "Process fork+exit": float(proc_fork_exit),
126 "Process fork+execve": float(proc_fork_execve),
127 "Process fork+/bin/sh -c": float(proc_fork_bin_sh),
128 }
129 }
130 }
131
134
115
158
159 x = np.arange(len(labels))
160 width = 0.12
161
179 ax.set_ylabel("microseconds")
180 ax.set_title("LMBench")
181 ax.set_xticks(x)
182 ax.set_xticklabels(labels)
183 ax.legend()
184 plt.xticks(rotation=90)
185
186 fig.tight_layout()
187
188 fig.savefig("lmbench_results_{}.pdf".format(bench_type),
189 bbox_inches="tight")
190
116
208
218 ax.set_ylabel("MB/s")
219 ax.set_xlabel("Size in MB")
220 ax.set_title("LMBench bandwidth: {}".format(bench_type))
221 ax.set_xticks(x)
222 ax.set_xticklabels(labels)
223 plt.xticks(rotation=90)
224 ax.legend()
225
226 fig.tight_layout()
227
232
245
117
8 targets = [ "clang", "lto", "lto100", "cfi", "cfi100", "scs", "scs100" ]
9
10
11 def plot_binder_throughput():
12 results = {}
13 for target in targets:
14 results[target] = {}
15 filename = "binder_throughput_{}.txt".format(target)
16 with open(filename, "r") as f:
17 data = json.load(f)
18 benchmarks = data["benchmarks"]
19
28 # b/ns
29 # for an approximate result, multiply the data transfer
30 # rate value by 1.074
31 throughput = (size * iterations) / time
32 print(name)
33 print("({} * {}) / {}".format(size, iterations, time))
34 print("throughput: {:.2f} b/ns".format(throughput))
35 print("throughput: {:.2f} Gb/s".format(throughput * 1.074))
36 if not name in results[target]:
37 results[target][name] = []
38 results[target][name].append(throughput)
39
40 # calculate average
41 for key, value in results.items():
42 res = results[key]
43
48 clang = results["clang"].values()
49 lto = results["lto"].values()
50 lto100 = results["lto100"].values()
51 cfi = results["cfi"].values()
52 cfi100 = results["cfi100"].values()
53 scs = results["scs"].values()
54 scs100 = results["scs100"].values()
55
118
58 x = np.arange(len(labels))
59 width = 0.12
60
61 fig, ax = plt.subplots()
62
73 ax.set_ylabel("Bandwidth Gb/s")
74 ax.set_xlabel("BM_sendVec_binder payload size (bytes)")
75 ax.set_title("binder throughput")
76 ax.set_xticks(x)
77 ax.set_xticklabels(labels)
78 ax.set_yscale("log", basey=2)
79 ax.legend()
80 plt.xticks(rotation=90)
81
82 fig.savefig("binder_throughput_results.pdf", bbox_inches="tight")
83
84
85 def plot_binder_latency():
86 num_pairs = 3
87 results = {}
88 for target in targets:
89 filename = "binder_latency_{}.txt".format(target)
90 with open(filename, "r") as f:
91 data = json.load(f)
92
119
108 labels = [ "other_ms", "fifo_ms" ]
109 clang = results["clang"]
110 lto = results["lto"]
111 lto100 = results["lto100"]
112 cfi = results["cfi"]
113 cfi100 = results["cfi100"]
114 scs = results["scs"]
115 scs100 = results["scs100"]
116
117 x = np.arange(len(labels))
118 width = 0.12
119
137
120
158 cfi100 = results["cfi100"][:2]
159 scs = results["scs"][:2]
160 scs100 = results["scs100"][:2]
161
162 x = np.arange(len(labels))
163 width = 0.12
164
182
199 # b/ns
200 # for an approximate result, multiply the data
201 # transfer rate value by 1.074
202 throughput = (size * iterations) / time
203 print(name)
204 print("({} * {}) / {}".format(size, iterations, time))
205 print("throughput: {:.2f} b/ns".format(throughput))
206 print("throughput: {:.2f} Gb/s".format(throughput * 1.074))
207 if not name in results[target]:
121
208 results[target][name] = []
209 results[target][name].append(throughput)
210
234 x = np.arange(len(labels))
235 width = 0.12
236
122
258
265
1 #!/bin/bash
2 # SPDX-License-Identifier: GPL-2.0
3 #
4 # Copyright (C) 2015-2020 Jason A. Donenfeld <[email protected]>.
5 # All Rights Reserved.
6 #
7 # NOTE: the original comment about topology is removed since LaTeX is crap at
8 # handling UTF-8
9 set -e
10
11 exec 3>&1
12 export LANG=C
13 netns0="wg-test-$$-0"
14 netns1="wg-test-$$-1"
15 netns2="wg-test-$$-2"
16 pretty() { echo -e "\x1b[32m\x1b[1m[+] ${1:+NS$1: }${2}\x1b[0m" >&3; }
17 pp() { pretty "" "$*"; "$@"; }
18 maybe_exec() { if [[ $BASHPID -eq $$ ]]; then "$@"; else exec "$@"; fi; }
19 n0() { pretty 0 "$*"; maybe_exec ip netns exec $netns0 "$@"; }
20 n1() { pretty 1 "$*"; maybe_exec ip netns exec $netns1 "$@"; }
21 n2() { pretty 2 "$*"; maybe_exec ip netns exec $netns2 "$@"; }
22 ip0() { pretty 0 "ip $*"; ip -n $netns0 "$@"; }
23 ip1() { pretty 1 "ip $*"; ip -n $netns1 "$@"; }
24 ip2() { pretty 2 "ip $*"; ip -n $netns2 "$@"; }
25 waitiperf() { pretty "${1//*-}" "wait for iperf:5201 pid $2";
26 while [[ $(ss -N "$1" -tlpH 'sport = 5201') != *\"iperf3\",pid=$2,fd=* ]];
27 do sleep 0.1; done;
28 }
29 ping6="ping6"
123
30 type $ping6 >/dev/null 2>&1 || ping6="ping -6"
31
32 cleanup() {
33 set +e
34 exec 2>/dev/null
35 ip0 link del dev wg0
36 ip1 link del dev wg0
37 ip2 link del dev wg0
38 local to_kill="$(ip netns pids $netns0) $(ip netns pids $netns1) \
39 $(ip netns pids $netns2)"
40 [[ -n $to_kill ]] && kill $to_kill
41 pp ip netns del $netns1
42 pp ip netns del $netns2
43 pp ip netns del $netns0
44 exit
45 }
46
66 configure_peers() {
67 ip1 addr add 192.168.241.1/24 dev wg0
68 ip1 addr add fd00::1/24 dev wg0
69
73 n1 wg set wg0 \
74 private-key <(echo "$key1") \
75 listen-port 1 \
76 peer "$pub2" \
77 allowed-ips 192.168.241.2/32,fd00::2/128
78 n2 wg set wg0 \
79 private-key <(echo "$key2") \
124
80 listen-port 2 \
81 peer "$pub1" \
82 allowed-ips 192.168.241.1/32,fd00::1/128
83
89 tests() {
90 # Ping over IPv4
91 n2 ping -c 10 -f -W 1 192.168.241.1
92 n1 ping -c 10 -f -W 1 192.168.241.2
93
125
130 tests "normal" "4"
131 ip1 link set wg0 mtu $big_mtu
132 ip2 link set wg0 mtu $big_mtu
133 tests "jumbo" "4"
134
10
31 # format:
126
32 # {tcp,udp}_ipv{4,6}_{normal,jumbo},ipv{4,6}_outer_uname.json
33
34
47 fixed_results = {}
48
52 print(values)
53 for i in range(len(labels)):
54 res = []
55 for target in targets:
56 if proto == "udp":
57 res.append(values[target][i])
58 else:
59 res.append(values[target][i][bench_type])
60
64 for i in range(len(res)):
65 val = res[i]
66 if val == best:
67 res[i] = 0.0
68 else:
69 res[i] = (float(best - val) / val) * 100
70
71 for i in range(len(targets)):
72 target = targets[i]
73 fixed_results[target].append(res[i])
74
127
82 out += " & \\cellcolor{{red!50}}{:.2f}".format(result)
83 else:
84 out += " & \\cellcolor{{yellow!50}}{:.2f}".format(result)
85 out += " \\\\\n"
86 out += "\\bottomrule\n"
87 out += "\\end{tabular}\n"
88 out += "\\end{adjustbox}\n"
89 out += "\\caption{{WireGuard {} {} bandwidth}}\n".format(proto, bench_type)
90 out += "\\label{{table:wg_{}_{}_bandwidth_results}}\n".format(proto, bench_type)
91 out += "\\end{table*}\n"
92
93 print(out)
94
95 def main():
96 files = os.listdir("wg_results")
97 results = {}
98
128
132 results[name]["udp"][res_name] = end["sum"]["bits_per_second"]
133
176 ax.set_ylabel("Throughput")
177 ax.set_title("WireGuard")
178 ax.set_xticks(x)
179 ax.set_xticklabels(labels)
180 ax.legend()
181 plt.xticks(rotation=90)
129
182 fig.savefig("wireguard_results_udp.pdf", bbox_inches="tight")
183
213 ax.set_ylabel("Throughput")
214 ax.set_title("WireGuard TCP receive")
215 ax.set_xticks(x)
216 ax.set_xticklabels(labels)
217 ax.legend()
218 plt.xticks(rotation=90)
219 fig.savefig("wireguard_results_tcp_recv.pdf", bbox_inches="tight")
220 tcp_values = { "gcc": list(gcc["tcp"].values()),
221 "clang": list(clang["tcp"].values()),
222 "lto": list(lto["tcp"].values()),
223 "lto100": list(lto100["tcp"].values()),
224 "cfi": list(cfi["tcp"].values()),
225 "cfi100": list(cfi100["tcp"].values()) }
226
130
232
249 ax.set_ylabel("Throughput")
250 ax.set_title("WireGuard TCP send")
251 ax.set_xticks(x)
252 ax.set_xticklabels(labels)
253 ax.legend()
254 plt.xticks(rotation=90)
255 fig.savefig("wireguard_results_tcp_send.pdf", bbox_inches="tight")
256 generate_table(tcp_values, labels, "sent", "tcp")
257
258
131
Appendix B
Code
132
37 diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
38 index 6fa42e9c4e6f..441d91c888a0 100644
39 --- a/arch/x86/pci/mmconfig-shared.c
40 +++ b/arch/x86/pci/mmconfig-shared.c
41 @@ -527,7 +527,7 @@ pci_mmcfg_check_reserved(struct device *dev, struct pci_mmcfg_region *cfg, int e
42 /* Don't try to do this check unless configuration
43 type 1 is available. how about type 2 ?*/
44 if (raw_pci_ops)
45 - return is_mmconf_reserved(e820__mapped_all, cfg, dev, 1);
46 + return is_mmconf_reserved(e820__mapped_all_cb, cfg, dev, 1);
47
48 return 0;
49 }
133
6 lto-clang-flags += -fvisibility=default $(call cc-option, -fsplit-lto-unit)
7
8 # Limit inlining across translation units to reduce binary size
9 -LD_FLAGS_LTO_CLANG := -mllvm -import-instr-limit=5
10 +LD_FLAGS_LTO_CLANG := -mllvm -import-instr-limit=$(CONFIG_LTO_INLINE_LIMIT)
11
12 KBUILD_LDFLAGS += $(LD_FLAGS_LTO_CLANG)
13 KBUILD_LDFLAGS_MODULE += $(LD_FLAGS_LTO_CLANG)
14 diff --git a/arch/Kconfig b/arch/Kconfig
15 index 6b6c82713..2c202e385 100644
16 --- a/arch/Kconfig
17 +++ b/arch/Kconfig
18 @@ -498,6 +498,17 @@ config THINLTO
19 help
20 Use ThinLTO to speed up Link Time Optimization.
21
22 +config LTO_INLINE_LIMIT
23 + int "LTO inline limit"
24 + depends on LTO_CLANG
25 + default 5
26 + help
27 + This option controls the function size limit when inlining functions.
28 + Increasing this number can lead to a high increase in binary size.
29 + The number is the maximum number of instructions allowed for a function
30 + to be inlined. So for the default value of 5, only functions with less
31 + than 5 instructions will be inlined.
32 +
33 choice
34 prompt "Link-Time Optimization (LTO) (EXPERIMENTAL)"
35 default LTO_NONE
134
Bibliography
135
[14] Chong Xu Bing Sun Jin Liu. How to Survive the Hardware-assisted
Controlflow Integrity Enforcement. https://i.blackhat.com/asia-
19/Thu-March-28/bh-asia-Sun-How-to-Survive-the-Hardware-
Assisted-Control-Flow-Integrity-Enforcement.pdf.
[15] Tyler Bletsch et al. “Jump-oriented programming: a new class of code-
reuse attack.” In: Proceedings of the 6th ACM Symposium on Informa-
tion, Computer and Communications Security. 2011, pp. 30–40.
[16] Daniel Pierre Bovet. Implementing virtual system calls. https://lwn.
net/Articles/615809/. [Online; accessed 05-February-2020].
[17] Canonical. Ubuntu. https://ubuntu.com/.
[18] ClangBuiltLinux. https : / / clangbuiltlinux . github . io/. [Online;
accessed 22-January-2020].
[19] ClangBuiltLinux GitHub. https : / / github . com / ClangBuiltLinux /
linux/. [Online; accessed 22-January-2020].
[20] Kees Cook. experimenting with Clang CFI on upstream Linux. https:
//outflux.net/blog/archives/2019/11/20/experimenting-with-
clang - cfi - on - upstream - linux/. [Online; accessed 23-April-2020].
2019.
[21] Jonathan Corbet. Kernel support for control-flow enforcement. https:
//lwn.net/Articles/758245/.
[22] Jonathan Corbet and Greg Kroah-Hartman. Linux Kernel Development
Report. https://www.linuxfoundation.org/2017- linux- kernel-
report-landing-page/. [Online; 28-April-2019]. 2017.
[23] Crispan Cowan et al. “Stackguard: Automatic adaptive detection and
prevention of buffer-overflow attacks.” In: USENIX security symposium.
Vol. 98. San Antonio, TX. 1998, pp. 63–78.
[24] Debian. https://www.debian.org/.
[25] Solar Designer. “Getting around non-executable stack (and fix).” In:
Bugtraq (1997).
[26] Jason A. Donenfeld. “WireGuard: Next Generation Kernel Network
Tunnel.” In: Proceedings of the 2017 Network and Distributed System
Security Symposium. Document ID: 4846ada1492f5d92198df154f48c3d54205657bc.
San Diego, CA, 2017. isbn: 1-891562-46-0. url: https://www.wiregu
ard.com/papers/wireguard.pdf.
[27] Jake Edge. Namespaces in operation, part 7: Network namespaces. ht
tps://lwn.net/Articles/580893/. [Online; accessed 22-April-2020].
Jan. 22, 2014.
[28] Ted Eisenberg et al. “The Cornell commission: on Morris and the worm.”
In: Communications of the ACM 32.6 (1989), pp. 706–709.
[29] glibc. Locales in GLIBC. https : / / sourceware . org / glibc / wiki /
Locales. [Online; accessed 05-February-2020].
[30] GNU. https://www.gnu.org/.
[GPL] GNU General Public License. Version 2. Free Software Foundation,
June 1991. url: https://www.gnu.org/licenses/old- licenses/
gpl-2.0.en.html.
136
[31] Google. Android. https://www.android.com/.
[32] Istvan Haller et al. “Typesan: Practical type confusion detection.” In:
Proceedings of the 2016 ACM SIGSAC Conference on Computer and
Communications Security. 2016, pp. 517–528.
[33] Hex-Rays. IDA Pro. https://www.hex- rays.com/products/ida/
index.shtml.
[34] Hong Hu et al. “Data-oriented programming: On the expressiveness of
non-control data attacks.” In: 2016 IEEE Symposium on Security and
Privacy (SP). IEEE. 2016, pp. 969–986.
[35] Docker Inc. Docker. https://www.docker.com/.
[36] Intel. Control-flow Enforcement Technology Specification. https://sof
tware.intel.com/sites/default/files/managed/4d/2a/control-
flow- enforcement- technology- preview.pdf. [Online; accessed 31-
January-2020]. May 2019.
[37] Vasileios P Kemerlis, Georgios Portokalidis, and Angelos D Keromytis.
“kGuard: lightweight kernel protection against return-to-user attacks.”
In: 21st {USENIX} Security Symposium ({USENIX} Security 12). 2012,
pp. 459–474.
[38] Michael Kerrisk. Namespaces in operation, part 1: namespaces overview.
https : / / lwn . net / Articles / 531114/. [Online; accessed 22-April-
2020]. Jan. 4, 2013.
[39] Michael Kerrisk. Namespaces in operation, part 2: the namespaces API.
https : / / lwn . net / Articles / 531381/. [Online; accessed 22-April-
2020]. Jan. 8, 2013.
[40] Michael Kerrisk. Namespaces in operation, part 3: PID namespaces.
https : / / lwn . net / Articles / 531419/. [Online; accessed 22-April-
2020]. Jan. 16, 2013.
[41] Michael Kerrisk. Namespaces in operation, part 4: more on PID names-
paces. https : / / lwn . net / Articles / 532748/. [Online; accessed 22-
April-2020]. Jan. 23, 2013.
[42] Michael Kerrisk. Namespaces in operation, part 5: User namespaces.
https : / / lwn . net / Articles / 532593/. [Online; accessed 22-April-
2020]. Feb. 27, 2013.
[43] Michael Kerrisk. Namespaces in operation, part 6: more on user names-
paces. https : / / lwn . net / Articles / 540087/. [Online; accessed 22-
April-2020]. Mar. 6, 2013.
[44] Chris Lattner. “LLVM: An Infrastructure for Multi-Stage Optimiza-
tion.” See http://llvm.cs.uiuc.edu. MA thesis. Urbana, IL: Com-
puter Science Dept., University of Illinois at Urbana-Champaign, Dec.
2002.
[45] Linux. https://www.kernel.org.
[46] Moritz Lipp et al. “Meltdown.” In: arXiv preprint arXiv:1801.01207
(2018).
137
[47] LLVM. LLVM Link Time Optimization: Design and Implementation.
https : / / llvm . org / docs / LinkTimeOptimization . html. [Online;
accessed 25-May-2020].
[48] LLVM. ShadowCallStack. Clang 11 documentation. https://clang.l
lvm.org/docs/ShadowCallStack.html. [Online; accessed 31-January-
2020].
[49] LLVM. THINLTO. https://llvm.org/docs/LinkTimeOptimization
.html. [Online; accessed 25-May-2020].
[50] LLVM. UndefinedBehaviorSanitizer. https://clang.llvm.org/docs/
UndefinedBehaviorSanitizer.html. [Online; accessed 25-May-2020].
[51] LLVM.org. https://llvm.org/. [Online; accessed 21-January-2020].
[52] H.J. Lu. Control-flow Enforcement Technology. https://www.linuxp
lumbersconf.org/event/2/contributions/147/attachments/72/
83/CET-LPC-2018.pdf. Nov. 2018.
[53] Linux Programmer’s Manual. elf - format of Executable and Linking
Format (ELF) files. http://man7.org/linux/man-pages/man5/elf.
5.html. [Online; accessed 05-February-2020].
[54] Linux Programmer’s Manual. ld.so, ld-linux.so - dynamic linker/loader.
http://man7.org/linux/man- pages/man8/ld- linux.so.8.html.
[Online; accessed 05-February-2020].
[55] Linux Programmer’s Manual. pthreads - POSIX threads. http://man7.
org/linux/man-pages/man7/pthreads.7.html. [Online; accessed 10-
April-2020].
[56] Matthew Tippett Michael Larabel. Phoronix Test Suite. http://phor
onix-test-suite.com/. [Online; accessed 31-January-2020].
[57] Microsoft. Data Execution Prevention. https : / / docs . microsoft .
com/en- us/windows/win32/memory/data- execution- prevention.
[Online; accessed 22-May-2020]. 2018.
[58] João Moreira et al. “DROP THE ROP fine-grained control-flow in-
tegrity for the Linux kernel.” In: Black Hat Asia (2017).
[59] Tim Newsham. Format string attacks. 2000.
[60] OpenMandriva. OpenMandriva. https://www.openmandriva.org/.
[61] OpenMandriva. The best, until OpenMandriva does better: released OMLx
4.0. https://www.openmandriva.org/en/news/article/the-best-
until- openmandriva- does- better- released- omlx- 4- 0. [Online;
accessed 06-June-2020].
[62] Jan Magnus Granberg Opsahl. “Open-source virtualization. Function-
ality and performance of Qemu/KVM, Xen, Libvirt and VirtualBox.”
MA thesis. University of Oslo, Department of Informatics, 2013.
[63] Sebastian Österlund et al. “kMVX: Detecting Kernel Information Leaks
with Multi-variant Execution.” In: ASPLOS. Apr. 2019. url: https:
//www.vusec.net/download/?t=papers/kmvx_asplos19.pdf.
[64] Bob Page. A REPORT ON THE INTERNET WORM. https://www.
ee.ryerson.ca/ ˜elf/hack/iworm.html. [Online; accessed 24-May-
2020]. 1988.
138
[65] Captain Planet. “A eulogy for format strings.” In: Phrack (Nov. 2010)
(2010).
[66] Chromium Project. Memory safety. https : / / www . chromium . org /
Home/chromium-security/memory-safety. [Online; accessed 24-May-
2020].
[67] Inc. Qualcomm Technologies. Pointer Authentication on ARMv8.3 -
Design and Analysis of the New Software Security Instructions. https:
//www.qualcomm.com/media/documents/files/whitepaper-pointe
r-authentication-on-armv8-3.pdf.
[68] Felix Schuster et al. “Counterfeit object-oriented programming: On the
difficulty of preventing code reuse attacks in C++ applications.” In:
2015 IEEE Symposium on Security and Privacy. IEEE. 2015, pp. 745–
762.
[69] Hovav Shacham et al. “The geometry of innocent flesh on the bone:
return-into-libc without function calls (on the x86).” In: ACM con-
ference on Computer and communications security. New York, 2007,
pp. 552–561.
[70] R. Shirey. Internet Security Glossary, Version 2. RFC 4949. RFC Edi-
tor, Aug. 2007, pp. 1–365. url: https://www.rfc-editor.org/rfc/
rfc4949.txt.
[71] syzkaller. syzkaller - kernel fuzzer. https : / / github . com / google /
syzkaller. [Online; accessed 14-June-2020].
[72] PaX Team. Address space layout randomization (ASLR). https : / /
pax.grsecurity.net/docs/aslr.txt.
[73] Caroline Tice et al. “Enforcing Forward-Edge Control-Flow Integrity in
{GCC} & {LLVM}.” In: 23rd {USENIX} Security Symposium ({USENIX}
Security 14). 2014, pp. 941–955.
[74] Sami Tolvanen. Control Flow Integrity in the Android kernel. https:
//android-developers.googleblog.com/2018/10/control-flow-
integrity - in - android - kernel . html. [Online; accessed 24-April-
2020]. Oct. 10, 2018.
[75] Sami Tolvanen. Linux. https://github.com/samitolvanen/linux.
[Online; accessed 23-April-2020].
[76] Sami Tolvanen. Protecting against code reuse in the Linux kernel with
Shadow Call Stack. https://security.googleblog.com/2019/10/
protecting - against - code - reuse - in - linux _ 30 . html. [Online;
accessed 21-May-2020]. Oct. 30, 2019.
[77] Minh Tran et al. “On the expressiveness of return-into-libc attacks.”
In: International Workshop on Recent Advances in Intrusion Detection.
Springer. 2011, pp. 121–141.
[78] Victor van der Veen et al. “The Dynamics of Innocent Flesh on the
Bone: Code Reuse Ten Years Later.” In: CCS. Oct. 2017. url: http:
//vvdveen.com/publications/newton.pdf.
[79] Arjan van de Ven. “New Security Enhancements in Red Hat Enterprise
Linux v.3, update 3.” In: (2004).
139
[80] Rafal Wojtczuk. “The advanced return-into-lib (c) exploits: Pax case
study.” In: Phrack Magazine (2001).
[81] yyu168. linux cet. https://github.com/yyu168/linux_cet.
140