0% found this document useful (0 votes)
12 views

Demons Kernel

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Demons Kernel

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

Demons in the Shared Kernel: Abstract Resource Attacks Against


OS-level Virtualization
Nanzi Yang∗ Wenbo Shen∗† Jinku Li
Xidian University Zhejiang University Xidian University
Xi’an, China Key Laboratory of Blockchain and Xi’an, China
Cyberspace Governance of Zhejiang
Province
Hangzhou, China

Yutian Yang Kangjie Lu Jietao Xiao


Zhejiang University University of Minnesota, Twin Cities Xidian University
Hangzhou, China Minneapolis, USA Xi’an, China

Tianyu Zhou Chenggang Qin Wang Yu


Zhejiang University Ant Group Ant Group
Hangzhou, China Hangzhou, China Hangzhou, China

Jianfeng Ma Kui Ren


Xidian University Zhejiang University
Xi’an, China Hangzhou, China

ABSTRACT highly practical and critical. We further conduct a systematic anal-


Due to its faster start-up speed and better resource utilization ef- ysis to identify vulnerable abstract resources in the Linux kernel,
ficiency, OS-level virtualization has been widely adopted and has which successfully detects 1,010 abstract resources and 501 of them
become a fundamental technology in cloud computing. Compared can be repeatedly consumed dynamically. We also conduct the at-
to hardware virtualization, OS-level virtualization leverages the tacking experiments in the self-deployed shared-kernel container
shared-kernel design to achieve high efficiency and runs multiple environments on the top 4 cloud vendors. The results show that
user-space instances (a.k.a., containers) on the shared kernel. all environments are vulnerable to abstract resource attacks. We
However, in this paper, we reveal a new attack surface that conclude that containing abstract resources is hard and give out
is intrinsic to OS-level virtualization, affecting Linux, FreeBSD, multiple strategies for mitigating the risks.
and Fuchsia. The root cause is that the shared-kernel design in
OS-level virtualization results containers in sharing thousands of CCS CONCEPTS
kernel variables and data structures directly and indirectly. Without • Security and privacy → Virtualization and security.
exploiting any kernel vulnerabilities, a non-privileged container
can easily exhaust the shared kernel variables and data structure KEYWORDS
instances to cause DoS attacks against other containers. Compared
with the physical resources, these kernel variables or data structure OS-level Virtualization; Shared Kernel; Abstract Resource Attack
instances (termed abstract resources) are more prevalent but under- ACM Reference Format:
protected. Nanzi Yang, Wenbo Shen, Jinku Li, Yutian Yang, Kangjie Lu, Jietao Xiao,
To show the importance of confining abstract resources, we con- Tianyu Zhou, Chenggang Qin, Wang Yu, Jianfeng Ma, and Kui Ren. 2021.
duct abstract resource attacks that target different aspects of the Demons in the Shared Kernel: Abstract Resource Attacks Against OS-
OS kernel. The results show that attacking abstract resources is level Virtualization. In Proceedings of the 2021 ACM SIGSAC Conference
on Computer and Communications Security (CCS ’21), November 15–19, 2021,
∗ Co-first authors. Virtual Event, Republic of Korea.. ACM, New York, NY, USA, 15 pages.
† Corresponding author. https://doi.org/10.1145/3460120.3484744

Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation 1 INTRODUCTION
on the first page. Copyrights for components of this work owned by others than ACM Operating-system-level virtualization (a.k.a., OS-level virtualiza-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a tion) allows multiple self-contained and isolated user-space envi-
fee. Request permissions from [email protected]. ronments to run on the same kernel [67]. Compared to hardware
CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea virtualization (i.e., virtual machines), OS-level virtualization elim-
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8454-4/21/11. . . $15.00 inates the burden of maintaining an operating system kernel for
https://doi.org/10.1145/3460120.3484744 each user-space instance and thus has a faster start-up speed and

764
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

better resource utilization efficiency. Therefore, OS-level virtual- practical and critical—it can easily disable new program execution,
ization has been widely adopted in recent years and has become slow down the memory writes by 97.3%, crash all file-open related
a fundamental technology in cloud computing. The user-space in- operations, and deny all new SSH connections. Even worse, it affects
stances in OS-level virtualization are named as jails in FreeBSD [33], all aspects of OS services. Moreover, experiments also demonstrate
Zones in Solaris [59], and containers1 in Linux [67]. that other than Linux, FreeBSD and Fuchsia are also vulnerable to
Despite its high efficiency, OS-level virtualization also introduces abstract resource attacks.
multiple security concerns. First, OS-level virtualization is vulner- It is unfortunate that even though abstract resources are criti-
able to kernel vulnerabilities due to the shared kernel [40]. As a cal, they are inherently hard to contain for several fundamental
result, it cannot isolate kernel bugs. Once the shared kernel is com- reasons. First, it is impractical to enumerate all possible abstract
promised, all user-space instances (referred to as containers) lose resources in operating system kernels. Different from the few phys-
isolation and protection. Moreover, researchers recently questioned ical resource types, abstract resource types in the kernel are many
the isolation of container techniques, such as information leaks [22], and various. Second, it is fairly easy to form conditions leading to
covert channels [24], and out-of-band workloads that break control abstract resource exhaustion. When implementing new features
groups [23]. in the kernel, developers are often concerned about the physical
However, in this paper, we reveal a new attack surface that is resource consumption while paying much less attention to abstract
intrinsic to OS-level virtualization. Compared to hardware virtual- resource consumption. Moreover, the OS kernel has complex data
ization, OS-level virtualization leverages the shared-kernel design and path dependencies, leading to various ways to exhaust abstract
to achieve high efficiency. In a typical OS-level virtualization envi- resources in the kernel.
ronment, containers run on the same OS kernel and request various Therefore, we design and implement a tool based on LLVM to
services via 300+ system calls. Notice that the underlying OS kernel identify vulnerable abstract resources in the Linux kernel systemat-
contains hundreds of thousands of variables and data structure ically. We propose new techniques to identify the shareable abstract
instances to provide services for containers. As a result, these con- resources and analyze their container controllability. We apply our
tainers are directly and indirectly sharing these kernel variables and tool to the latest Linux kernel and detect 1,010 abstract resources.
data structure instances. 501 of them can be repeatedly consumed dynamically. From the
Unfortunately, these shared kernel variables and data structure detected abstract resources, we pick 7 resources that affect each
instances are new attack surfaces in OS-level virtualization. With- aspect of OS services based on our familiarity (i.e., we know the
out exploiting any vulnerabilities, a non-privileged container can impacts of exhausting that resource). We further conduct the attack-
easily exhaust certain kernel variables and data structure instances, ing experiments on these selected resources in the shared-kernel
causing DoS attacks in OS-level virtualization environments. As a container environments deployed on the top 4 cloud vendors, in-
result, even other containers have enough physical resources, with cluding AWS, MS Azure, Google Cloud, and Alibaba Cloud. The
the kernel critical variables or data structure instances being ex- results show that all environments are vulnerable to our attacks.
hausted, they still cannot perform any meaningful tasks. Compared At last, we give out multiple strategies for mitigating the risks of
with the physical resources supported by the real hardware, we abstract resource attacks.
regard these kernel variables or data structure instances as abstract The contributions of this paper are as follows:
resources and the exhaustion attacks on these resources as abstract • New Attack Surface: We reveal a new attack surface that is
resource attacks. intrinsic to OS-level virtualization. We propose a new attack
Though abstract resources can be exploited for DoS attacks, they called abstract resource attack. We demonstrate that the abstract
are often under-protected. The kernel and container developers resource attack is highly practical and is a broad class of attacks
focus more on protecting physical resources rather than abstract that affect Linux, FreeBSD, and Fuchsia.
resources. For example, the Linux kernel provides control groups to • Systematic Analysis: We design and implement a static analy-
restrict the resource usages for each container instance. However, sis tool based on LLVM to identify vulnerable abstract resources
among 13 control groups, 12 of them are for physical resources, in the Linux kernel. We propose and implement novel tech-
restricting the usages of CPU, memory, storage, and IO. Only the niques, including configuration-based analysis and container-
PIDs control group is designed for limiting the abstract resource controllability analysis. Our tool detects 501 abstract resources
pid. As a result, hundreds of container-shared abstract resources do that can be dynamically and repeatedly triggered in the Linux
not have any restrictions, such as the global dirty ratio, open-file kernel.
structs, and pseudo-terminal structs, which makes them vulnerable • Practical Evaluation: We evaluate 7 abstract resource attacks in
to DoS attacks. the self-deployed shared-kernel container environments on AWS,
To show the criticality of confining abstract resources on OS- MS Azure, Google Cloud, and Alibaba Cloud. All environments
level virtualization, we conduct attacks using Docker containers on are vulnerable to abstract resource attacks. 2 In particular, two
the Linux kernel, targeting abstract resources on different aspects environments are vulnerable to 6 attacks, one environment is
of the operating system services, including process management, vulnerable to 5 attacks, and the other is vulnerable to 4 attacks.
memory management, storage management, and IO management. We responsibly disclosed our findings to all cloud vendors. All of
Our experiments show that attacking abstract resources is highly them confirmed the identified problems.

1 In
this paper, we use the container to refer to the self-contained user-space execution 2 Current public cloud vendors do not provide the shared-kernel containers to different
environment that shares the kernel of the host system. users directly. Containers in public cloud are usually isolated by virtual machines.

765
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

• Community Impact: We plan to open-source our tool and accounting and resource usage limiting has little difference between
the identified abstract resources at https://github.com/ZJU-SEC/ v1 and v2, though. Control group v1 is currently used by default
AbstractResourceAttack, so that they can help the Linux kernel because it is more stable and provides control over more resources.
community and the container community to identify the weak It manages 13 types of resources while v2 supports only 9 resource
spots of resource isolation in OS-level virtualization. types until now [44]. More specifically, among the 13 types of re-
sources, 5 of them are for CPU accounting, including cpu, cpuacct,
2 BACKGROUND cpuset, freezer, perf_event; 3 of them are for memory, including
OS-level virtualization relies on the underlying OS kernel for re- memory, hugetlb, rdma; blkio is for storage; and 3 are for IO, in-
source isolation and containment. More specifically, the Linux ker- cluding devices, net_cls, net_prio. Only PIDs control group are
nel provides namespaces for resource isolation and control groups for the abstract resource of PID.
for resource containment. While limiting the usages of shared abstract resources in con-
tainer processes can mitigate DoS attacks, it is again impractical to
2.1 Linux Namespaces extend control groups to include all abstract resources. Accounting
Linux namespaces provide process-level resource isolation. Cur- resources and enforcing limits on so many types of resources will
rently, Linux namespaces are divided into 8 types. According to introduce unacceptable overhead.
their release time, we list them as follows:
• Mount for file system isolation; 3 ABSTRACT RESOURCE ATTACKS
• UTS for hostname and domain name isolation; In this section, we first clarify the threat model and assumptions.
• IPC for IPC and message queue isolation; Next, we discuss weaknesses in the container isolation. Finally,
• PID for process ID isolation; we show that abstract resource attacks also work on FreeBSD and
• Network for network resource isolation; Fuchsia kernels.
• User for UID/GID isolation; Threat model and assumptions. In this paper, as we are target-
• Cgroup for control group isolation; ing OS-level virtualization, we assume the containers are running
• Time for clock time isolation. on the same shared kernel. Containers enforce state-of-the-art pro-
A process can be assigned to different namespaces of different tection and follow the most security practices in deployment. More
types. But for each type, it can only belong to one namespace. By specifically, containers are running as different non-root users with
default, a process is in the same namespaces as its parent. It can all capabilities dropped. While the kernel is enforcing as many
be added to a new namespace during process creation by passing namespaces and control groups as possible for the container. More-
specific flags, or during process running by calling the setns system over, the kernel is also using seccomp to block sensitive system
call. Ideally, only processes within the same namespace can share calls. We further assume that the kernel has no bugs and all security
the namespace isolated resources. Resources are thus isolated across mechanisms are working properly.
namespaces. As a result, running out of an isolated resource in one On the other side, the attacker controls one container and at-
namespace does not affect processes in other namespaces. Such a tempts to disrupt other containers running on the same kernel. The
design inherently requires that the namespace mechanism correctly attacker can run any code within the container and call seccomp
and thoroughly contains the resources. allowed system calls. However, he/she is not allowed to exploit ker-
However, there still exist hundreds of types of abstract resources nel vulnerabilities. Furthermore, the attacker is in a non-privileged
that are not included by namespaces. The large attacking surface container as a non-root user, with no capabilities at all. Finally,
still exists even with the protection of namespaces. One may ar- the attacker is not allowed to escalate the privilege or regain any
gue to isolate all the abstract resources using namespaces. This is capabilities. In the following, we show that due to shared abstract
however impractical: the huge number and flexibility of abstract resources in the kernel, even such an attacker still can launch DoS
resources make the solution unacceptable due to huge code changes attacks to other containers.
and high performance overhead.
3.1 Weaknesses in OS-level Virtualization
2.2 Linux Control Groups In OS-level virtualization, containers are directly and indirectly
On the other hand, Linux control groups are used to limit resource sharing thousands of kernel abstract resources, which makes them
usages. A control group accounts for resources used by all pro- vulnerable to resource-exhaustion attacks. We leverage an exam-
cesses within that control group. Control groups are organized as ple in the Linux kernel to illustrate the details. Figure 1 shows
a tree structure, where resources accounted for children are also the global variable nr_files and function alloc_empty_file in the
accounted for their parents. The limits on resource usages are also Linux kernel. alloc_empty_file allocates struct file (line 17). For
enforced recursively on the tree so that resource usages in a control each allocated struct file, nr_files accounts it by increasing the
group should not exceed the limits of all its ancestors. counter (line 19). In the host Linux kernel, the total number of
Control groups mainly manage hardware resources like CPU, struct file is limited by files_stat.max_files (line 13). If the
memory, storage, IO, and etc. There are two versions of control limit is reached, the alloc_empty_file returns an error (line 23).
groups, namely v1 and v2. The main difference is that control group However, the Linux kernel does not provide any namespaces or
v1 can have a tree hierarchy for each type of resources while control control groups to isolate or limit nr_files. As a result, nr_files
group v2 has only one hierarchy. The implementation of resource is directly controllable to all containers—any allocation of struct

766
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

1 static struct percpu_counter nr_files __cacheline_aligned_in_smp; 1 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
2 ↩→ size_t set_tid_size)
3 static long get_nr_files(void) 2 {
4 { 3 ...
5 return percpu_counter_read_positive(&nr_files); 4 nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, pid_max,
6 } ↩→ GFP_ATOMIC);
7 5 ...
8 struct file *alloc_empty_file(int flags, const struct cred *cred) 6 if (nr < 0) {
9 { 7 retval = (nr == -ENOSPC) ? -EAGAIN : nr;
10 static long old_max; 8 goto out_free;
11 struct file *f; 9 }
12 10 pid->numbers[i].nr = nr;
13 if (get_nr_files() >= files_stat.max_files && 11 ...
↩→ !capable(CAP_SYS_ADMIN)) { 12 }
14 ...
15 goto over;
16 }
17 f = __alloc_file(flags, cred);
18 if (!IS_ERR(f)) Figure 2: Linux kernel source of idr allocation.
19 percpu_counter_inc(&nr_files); idr_alloc_cyclic checks the idr against pid_max, return-
20 ...
21 over: ing a negative number idr if goes beyond.
22 ...
23 return ERR_PTR(-ENFILE);
24 }
struct task_struct, pid, state and various data structures to sup-
port the derived entities, such as struct thread_info for thread,
Figure 1: Linux kernel source of nr_files. nr_files is a global struct rq runqueues for scheduling, struct shm_info and struct
variable shared by all containers. For each allocated struct msginfo for inter-process communication (IPC), struct spinlock
file, nr_files increases by 1 (line 19). and struct semaphores for synchronization. In fact, process man-
agement in Linux introduces thousands of abstract resources. In
the following, we introduce the attack against struct idr as an
file from any container increases the same shared global variable
example.
nr_files.
Such a sharing of nr_files leads to a new attack. In Linux, ev- 3.2.1 Attacking idr of PID. The Linux kernel introduces struct
erything is a file. So many operations, such as file open, process idr for integer ID management. Process management also uses idr
creation, pipe creation, new network connection creation, even for the pid allocation. Figure 2 shows the alloc_pid function, which
the timer creation (timerfd_create) and event generation (eventfd), calls idr_alloc_cyclic to get a new pid. idr_alloc_cyclic, in turn,
increase nr_files. A malicious container can pop nr_files to its up- checks pid_max during the idr allocation and return a negative error
per limit easily. Actually, in our experiment, the quota of nr_files code if the idr grows beyond pid_max. Later we will show that even
can be quickly exhausted in several seconds. Consequently, all op- with PID namespace and PIDs control group enabled, idr can still
erations that consume struct file will fail. The impact is severe: be regarded as a globally shared resource for all processes. Similar
the victim-container cannot even run a command (as it needs to to the fork bomb, a malicious container process can repeatedly fork
open a command file) or exec a new binary, leading to program to exhaust all idr. As a result, all containers on the shared-kernel
crashes. From the above example, we find that even the container cannot create any new processes or threads.
has enough physical resources, such as CPU or memory, it still In our experiments, the attacker-container spawns processes
cannot run any new programs without the quota in nr_files. repeatedly by calling the fork system call. As a result, in the victim-
To demonstrate that abstract resource attacks affect all kernel container, all operations related to new-process creation fail with
functionalities, we present one abstract resource attack for each as- an error of “Resource temporarily unavailable”. Even root users on
pect of the Linux kernel functionalities, including process, memory, the host-machine suffer from the same failure.
storage, and IO management [21]. In this section, we present the
attack results on the local test environments and defer the attack 3.2.2 The effectiveness of the PID namespace. Linux v2.6.24 intro-
results of the top 4 vendors to §5. duces the PID namespace, which provides processes an independent
For the local test environment setup, the test machine has the set of PIDs from other PID namespaces [47]. However, in the PID
Intel Core i5 CPU, with 8 GB memory and 500 GB HDD, and it runs namespace implementation, the Linux kernel allocates an extra PID
Ubuntu 18.04 with Linux kernel v5.3.1. We refer to it as the host- in the root PID namespace for any PID allocated in other PID names-
machine. On the host-machine, we set up two docker containers paces, so that all PIDs in the other PID namespaces can be mapped
using Docker 18.06.0-ce, and use them as attacker-container and to the root PID namespace. In other words, the root PID namespace
victim-container, respectively. We set up both containers following is still globally shared. As a result, even the attacker-container is in
the docker security best practices [9, 12, 30], which is running a separated PID namespace, its PID allocation still exhausts the PID
them in different non-root users, dropping all capabilities, enabling in the root PID namespace, causing the new-process-create failures
namespaces and control groups, and applying seccomp system call on both the victim-container and host-machine. Therefore, even
blocking, as discussed in the threat model. with the PID namespace enabled, containers are still vulnerable to
the above idr-exhaustion attack.
3.2 Attacks on Process Management 3.2.3 The effectiveness of the PIDs control group. The PIDs con-
To implement process management, the Linux kernel has intro- trol group was also introduced recently in Linux v4.3 [44]. It is
duced a series of abstract resources, such as process-control-block used to limit the total number of PIDs that are allocated in one

767
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

control group. More specifically, the PIDs control group checks 3.4 Attacks on Storage Management
against the process limit during the process forking, and returns The operating system kernel abstracts the disk or other secondary
an error and aborts the forking if the total process number in the storage as the file and introduces various file-related abstract re-
PIDs control group (pids_cgroup->counter) reaches the upper limit sources. In fact, the storage management in the Linux kernel is
(pids_cgroup->limit). PIDs control group is effective in defending complicated, which involves thousands of functions and data struc-
against direct forking. However, it only charges the pid number to tures. In our experiment, we find that 133 storage-related abstract
the current process. Similar to the work-delegation approach in [23], resources are reachable from container processes. Unfortunately,
the attacker-container can trick the kernel to fork a large number the kernel does not provide any namespaces or control groups to
of kernel threads, such as frequent aborting to cause the kernel to isolate or restrict the usage of these abstract resources. As a re-
spawn interrupt-handling threads. In this way, the idr is exhausted sult, the attacker-container can exhaust these abstract resources to
by kernel threads, which bypasses the restriction enforced by the launch DoS attacks against other containers on the shared kernel.
PIDs control group. Next, we illustrate how a malicious container can exploit the file
limit variable nr_files for the DoS attacks.

3.4.1 Attacking nr_files. As mentioned in §3.1, nr_files is a global


3.3 Attacks on Memory Management variable in the Linux kernel, which counts the total number of
The Linux kernel introduces various kernel data structures, such opened files in the kernel. More specifically, for each allocated
as mm_struct for holding all the memory-related information of a struct file, the kernel increases nr_files by one, as shown in
process, and vm_area_struct for representing the virtual memory lines 17-19 of Figure 1. Unfortunately, nr_files is shared among all
area. Moreover, to improve the reading and writing efficiency, the processes. It is neither isolated by namespaces nor restricted by any
Linux kernel also uses the memory as buffers to cache certain data. control groups. As a result, the attacker-container can easily exhaust
Besides, it also introduces the write-back scheme, in which the nr_files to achieve DoS attacks. To verify the feasibility of this
writing is done only to the memory. The dirty memory pages will attack, our attack-container spawns hundreds of processes, each of
be written to the disk later by the kernel thread. Using the write- which opens 1,024 files. Consequently, nr_files reaches its limit.
back scheme, the caller only needs to write to the memory, and it As a result, on both the host-machine and the victim-container, all
does not need to wait for the time-consuming disk-IO operations to file-open operations fail, and the kernel issues a warning of “Too
finish (i.e., write-through), which significantly improves the write many open files in system.”
performance. However, we find that the kernel does not isolate or Our attack confirms that even with a few hundred of processes,
restrict the dirty memory area usages, giving the attacker chances the attacker is able to exhaust nr_files. While for usability, PIDs
to exhaust all dirty memory, which slows down other containers control group usually allows thousands of processes. Therefore,
significantly. Next, we discuss the attack on dirty memory. even with the PIDs control group enabled, the attacker-container
can still DoS-attack nr_files successfully. Even worse, nr_files is
shared among all processes including root and non-root processes.
3.3.1 Attacking dirty_throttle_control memory dirty ratio. The Therefore, not only are the non-privileged container processes
Linux kernel introduces the dirty_throttle_control struct for dirty- impacted, the root process on the host-machine cannot perform
area control, which uses the dirty field to represent the whole any file-open operations either.
kernel-space dirty ratio. Whenever the dirty value is too high, the
kernel wakes up background threads to sync the dirty memory 3.5 Attacks on IO Management
to disk. However, in the meantime, as the dirty ratio is too high, The IO management is an essential part of an operating system. For
the kernel blocks the write-back and converts all writes to write- management convenience, the Linux kernel abstracts IO devices
through, which slows down the write performance dramatically. into /dev files and introduces abstract resources, such as tty_struct,
Unfortunately, the kernel does not provide any isolation for to implement the IO device management. Similar to the previous
the memory dirty ratio. Any process can impact the global mem- cases, these abstract resources are not isolated or limited by any
ory dirty ratio. In our attack, the attacker-container uses the dd namespaces or control groups, thus it leads to new attacks. In the
command to generate files, which quickly occupies all dirty mem- following, we introduce the attacks against pty_count, which causes
ory, reaching the memory dirty ratio limit. As a result, all writes DoS to the SSH connection.
from the host-machine or the victim-container are converted to
write-through, which dramatically downgrades the performance. 3.5.1 Attacking pty_count. The Linux kernel abstracts the pseudo-
In our experiments, the performance of command dd if=/dev/zero terminal (abbreviated as pty) to /dev/ptmx and /dev/pts [46]. At
of=/mnt/test bs=1M count=1024 on the victim-container drops from the meantime, kernel also uses a global variable called pty_count
1.2 GB/s to 32.6 MB/s due to the attack, resulting in 97.3% slow down. to count the total number of the opened pseudo-terminal, which
Besides, even the privileged root user on the host-machine also has increases by one for each time /dev/ptmx is opened, as shown in line
a 96.1% performance downgrade. 6 of Figure 3. However, the kernel does not provide any namespaces
Note that the currently Linux kernel has no namespaces related or control groups to isolate or limit pty_count usages. Consequently,
to memory management, and memory control groups are used to the attacker can easily exhaust the pty_count.
limit the memory usage instead of the memory dirty ratio. There- In our experiments, the attacker keeps opening /dev/ptmx in the
fore, it cannot defend against the attacks on memory dirty ratio. container to trigger ptmx_open, which calls devpts_new_index and

768
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

1 static atomic_t pty_count = ATOMIC_INIT(0); For the numvnodes, FreeBSD uses a vnode struct to represent a file
2
3 int devpts_new_index(struct pts_fs_info *fsi) system entity, such as a file or a directory. FreeBSD also keeps a
4 { global variable numvnodes to record the total number of vnode in
5 int index = -ENOSPC;
6 if (atomic_inc_return(&pty_count) >= (pty_limit - the whole kernel. And the limit is in maxvnodes. In the experiments,
7
↩→ (fsi->mount_opts.reserve ? 0 : pty_reserve)))
goto out;
we can easily exhaust the host-machine’s numvnodes and reach the
8 ... maxvnodes limit by repeatedly creating directories in the attacker-
9 return index;
10 } jail.
11
12 static int ptmx_open(struct inode *inode, struct file *filp)
Attacking Fuchsia. Fuchsia uses the Zircon kernel, which intro-
13 { duces the concept of handle to allow user-space programs to refer-
14 ...
15 index = devpts_new_index(fsi); ence kernel objects [19]. Zircon maintains a global data structure
16
17 }
...
called gHandleTableArena for allocating all handles. The limit for
handles in the kernel is in kMaxHandleCount. Handles are used very
frequently in Zircon. Surprisingly, we find that the creation of han-
Figure 3: Linux kernel source of pty_count usage. pty_count is dles is not restricted. We further confirm this problem on the Fuchsia
global atomic variables, shared by all containers on the same emulator. A user with basic rights [20] (similar to capabilities in the
kernel. Linux) can repeatedly create handles to exhaust all handles, which
leads to the whole system crash. We report this problem to the
Fuchsia developers. They have confirmed this problem, and plan to
increases pty_count. In a couple of seconds, the pty_count limit is fix the problem after identifying more attack vectors to local DoS.
reached, and all the following ptmx_open operations fail. The con-
sequences are severe as pty devices are widely used by various 3.7 Summary
applications such as SSH connection. As a result, all SSH connec-
tion attempts to any other container fail due to the failed pseudo- From the above discussions, it is easy to see that abstract resource
terminal-open. Even worse, the host-machine cannot start any new attacks are highly practical and the consequences are severe. What
containers, as the connections to a new container are denied due makes things worse is that abstract resources are pretty common
to the same error. in the Linux kernel, affecting every aspect of Linux functionali-
ties. Furthermore, abstract resource attack is intrinsic to OS-level
virtualization. It also works on FreeBSD and Fuchsia kernels.
3.6 Attacking FreeBSD and Fuchsia Kernels
The root cause of abstract resource attacks is the shared kernel 4 STATIC ANALYSIS OF
data (i.e., abstract resources). Next, we demonstrate that the shared
kernel data also makes both the FreeBSD and the Fuchsia vulnerable
CONTAINER-EXHAUSTIBLE ABSTRACT
to abstract resource attacks. RESOURCES
Attacking FreeBSD. In FreeBSD kernel, following similar resources As mentioned before, abstract resources are critical to containers.
in the Linux kernel, we manually identified 5 shared globally ab- On the other side, there are thousands of abstract resources, which
stract resources, namely, dp_dirty_total, numvnodes, openfiles, pid, makes it virtually impossible to enumerate all of them. In this paper,
and pty. Our experiments further confirm that the former two can we take an initial step to identify exhaustible abstract resources
be DoS attacked, while the latter three are limited by rctl per-jail. shared by containers.
The experiments are conducted on the FreeBSD 13.0-RELEASE Challenges. We need to resolve two challenges. First, it is chal-
with Ezjail-admin v3.4.2 running in a virtual machine with Intel lenging to identify meaningful abstract resources, especially those
Core i5 processor, 8GB memory, and 40GB hard disk. Ezjail [53] is that are shared in the kernel. An abstract resource in the Linux
a jail administration framework. The ezjail commands provide a kernel can be a variable or a data structure instance. However, not
simple way to create multiple jails using FreeBSD’s jail system. Jails all variables or data structure instances are meaningful abstract
here are similar to the containers on the Linux. We set up two jails resources. We need to find the abstract resources that are critical to
following the FreeBSD’s handbook [18] and use rctl [54] to limit the OS functionalities. Moreover, the identified abstract resources
per Jail’s resources. We use these two jails as the attacker-jail and need to be shared between containers so that one container can
the victim-jail, which is similar to the container setup in §3.1. exhaust these resources to attack other containers. Unfortunately,
For the dirty counter dp_dirty_total, ZFS in FreeBSD introduces there is no documentation regarding shareable abstract resources.
the dsl_pool struct for recording the data of each ZFS pool. The To address this challenge, we propose configuration-based analy-
dsl_pool struct uses the dp_dirty_total field to represent the whole sis and access-based analysis to identify various shared abstract
ZFS pool dirty data. When the dp_dirty_total reaches the limit of resources in the Linux kernel.
zfs_dirty_data_max, ZFS delays the upcoming writing and waits Second, it is challenging to decide if the container can exhaust a
for the dirty data to be synchronized to the disk. Unfortunately, specific abstract resource. Different from regular user-space pro-
FreeBSD does not provide any isolation for the dp_dirty_total. In grams, resource accesses from a container face more restrictions,
the attacker-jail, we run the command dd if=/dev/zero of=/mnt/test such as namespaces, control groups, and seccomp. Moreover, as
bs=1M count=1024 (same with the one in §3.3) to exhaust the dirty each container runs in a separate user, its resource consumption is
total dp_dirty_total. As a result, the victim-jail has a 46% IO per- also restricted by the per-user limitation. Thus the simple reach-
formance downgrade. ability analysis to the resource consumption sites cannot tell the

769
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

Shareable Abstract Resources Container Controllability Analysis (§4.2)


Identification (§4.1)

Seccomp
Restriction
Configuration Analysis
-based Shareable
Analysis Abstract
Resources
Container
Syscall Per-User Controllable
Kernel Reachability Restriction Abstract
Source (IR) Sensitive Analysis Analysis Resources
Access- Functions
based
Namespace
Analysis
Isolation
Analysis

Figure 4: The architecture of the analysis tool.

controllability of the container on an abstract resource. For example, ① identify sysctl struct
for abstract resources that are isolated by namespaces, even though
1 static struct ctl_table fs_table[] = {
the container can consume these abstract resources, it still may not 2 ...
{
affect other containers due to the namespace isolation. Therefore, 3
4 .procname = "file-nr",
to overcome this challenge, we propose container controllability 5 .data = &files_stat,
6 .proc_handler = proc_nr_files,
analysis, which includes seccomp restriction analysis, per-user re- 7 },
...
striction analysis, and namespace isolation analysis, to further filter 8
9 }
② identify critical variable
container-exhaustible resources. 10
11 int proc_nr_files(...)
Figure 4 shows the architecture of our tool, which automatically 12 {
13 files_stat.nr_files = get_nr_files();
identifies container-exhaustible abstract resources. The analysis 14 ...
tool takes kernel source IR as the input. It first identifies all the ker- 15 }
16
nel shareable abstract resources using configuration-based analysis 17 static long get_nr_files(void)
18 {
and access-based analysis in §4.1. Then, it conducts the syscall reach- 19 return percpu_counter_read_positive(&nr_files);
ability analysis and container restriction analysis in §4.2, which 20 }
21
includes seccomp, per-user and namespace restriction analysis, to 22 struct file *alloc_empty_file(int flags, ...)
23 {
analyze the container controllability over these abstract resources. 24 ...
if (get_nr_files() >= files_stat.max_files &&
Moreover, we give out the analysis results in §4.3. 25
,! !capable(CAP_SYS_ADMIN)) {
26 ...
27 goto over; ③ check critical variable
4.1 Identification of Kernel Shareable Abstract 28 }
usages
29 ...
Resources 30 }

As mentioned before, it is challenging to identify meaningful ab-


stract resources from thousands of kernel variables and data struc- Figure 5: The sysctl data structures in Linux kernel.
ture instances. Even harder, to make sure these abstract resources
are directly or indirectly shared between containers, we need to
narrow them down to the shareable kernel abstract resources.
To overcome this challenge, we leverage kernel programming basic steps. First, it uses the specific sysctl data types to identify
paradigms and propose configuration-based analysis and access- all sysctl-related data structures. These data structures contain the
based analysis to identify kernel shareable resources. configurable sysctl kernel parameters. Second, the sysctl data struc-
ture usually contains the function that displays the sysctl value
4.1.1 Configuration-based Analysis. The Linux kernel provides in /proc/sys/ folder. Therefore, by analyzing that function, we are
sysctl interfaces under /proc/sys to allow user-space programs able to pinpoint the exact variable for this kernel parameter. Finally,
to configure kernel parameters [49]. Our key observation is that if a kernel parameter is used for restricting resource consumption,
most of these sysctl configurations are used for abstract resource its corresponding variable should appear in comparison instruc-
limiting, such as limiting the file number fs.file-nr or memory tions. Therefore, we follow the use-def chain to check the usages
huge pages vm.nr_hugepages. As a result, all containers are sharing of the identified variable and mark it as an abstract resource if it is
the same global limit specified by sysctl configurations. Such sysctl used in a comparison instruction.
configurations offer important clues about the abstract resources We design and implement an inter-procedural analysis pass in
that are shareable between containers. LLVM. We use an example in Figure 5 to illustrate the details. Specif-
Based on the above observation, we propose to identify the ically, the Linux kernel uses the type struct ctl_table to configure
shareable kernel abstract resources using the sysctl configurations, sysctl kernel parameters, such as the file system configurations
termed as the configuration-based analysis, which consists of three in fs_table shown in line 1 of Figure 5. Therefore, the pass first

770
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

traverses all kernel global variables to collect all struct ctl_table processes. To achieve this, we perform the traditional backward
variables, such as fs_table in Figure 5. control-flow analysis based on the kernel control flow graph, in
Second, fs_table uses the function pointer in proc_handler to which indirect calls are resolved based on struct-type [42, 70]. If
display the parameter in the /proc/sys/ file system. Therefore, from there are no paths from system call entries to the abstract resource
the proc_handler field, the pass follows its points-to and launches consumption sites, we mark this abstract resource unreachable
an inter-procedural analysis to obtain the exact variable, whose from the container.
value is displayed in the sysctl configuration interface. As shown in Second, note that reachability analysis alone is not enough, we
line 19 of Figure 5, our pass marks nr_files as the critical variable. need to further make sure that there are no additional container-
Third, our pass checks all usages of identified critical variables. specific restrictions on the path. In other words, we need to check
If one critical variable is used in a comparison instruction (i.e., if there are any restriction checks on the paths to make sure that
icmp in LLVM IR), our pass records the locations and marks this the container can exhaust these abstract resources. As mentioned
variable as the abstract resource. For example, nr_files is used for before, different from user-space programs, the container faces
comparison in line 25 of Figure 5. Our pass further detects that if the more restrictions such as seccomp, namespaces, control groups as
comparison fails, an error is returned in lines 25 and 27. Therefore, well as per-user resource limitations. Since our reachability analysis
our pass marks nr_files as an abstract resource. By analyzing all is standard, in the following, we focus on the restriction analysis.
struct ctl_table structures, our pass gets a collection of abstract Seccomp Restriction Analysis. Seccomp is a mechanism used
resources. for system-call filtering. Our restriction analysis against seccomp is
as follows. In our implementation, we use Docker default seccomp
4.1.2 Access-based Analysis. Besides sysctl configurations, the profile [15], which blocks more than 50 system calls. Among all the
Linux kernel also uses lock or atomic mechanism to protect the paths from system call entries to the resource consumption sites,
concurrently-accessed resources. Therefore, we propose to use con- we filter out paths that originate from any blocked system calls.
current accesses as an indication to identify a set of shareable Per-User Restriction Analysis. In a real deployment, the con-
abstract resources. tainers are usually running as different users. Thus, the resource
As the race condition and concurrency analysis is an old topic, we consumption from each container is also restricted by the per-
adopt the existing lockset detection approaches [5, 68]. If the lock is user resource quotas. For example, Linux provides the user-limits
taken on a field of a data structure, we mark this data structure as an command ulimit for limiting resource consumption of a specific
abstract resource and add this function into the sensitive function user [50]. While the underlying implementation of ulimit is using
set. Moreover, if a variable is modified quantitatively between the rlimit [39, 45] to set multiple per-user resource quotas.
lock and unlock functions, we also mark it as the abstract resource. Besides ulimit, Linux also provides interfaces that allow users to
Besides the lock/unlock, we observe that atomic and percpu leverage PAM (Pluggable Authentication Module) [63] to deploy
counter are also used to protect concurrently-accessed data, such per-user quotas. The PAM uses the setup_limits function [64] to
as percpu_counter_inc (line 19 in Figure 1) and atomic_inc_return set per-user resource quotas, which calls setrlimit to configure
(line 6 in Figure 3). Therefore, we implement a pass to analyze multiple rlimit constraints. For the resources limited by ulimit, rlimit
all atomic and percpu counter usages. Our pass first analyzes the and the PAM, the attacker-container cannot consume beyond the
function parameters, and adds all functions with struct atomic_t, per-user quotas. As a result, it cannot fully control those abstract
struct atomic64_t, and struct percpu_counter parameters to an resources to launch DoS attacks. As both the ulimit and the PAM
atomic/percpu function set. Second, our pass traverses all state- use rlimit to set per-user resource quotas, we need to analyze rlimit
ments in all kernel functions to check all usages of atomic/percpu and filter out the abstract resources restricted by it.
functions. If a variable is passed to an atomic/percpu function, we For rlimit analysis, our key observation is that a rlimit value
mark it as an abstract resource. is usually specified in struct rlimit or struct rlimit64. There-
During the implementation, we find that the LLVM linker merges fore, we first traverse the kernel IR to identify all variables that
structure types that have the same memory layout, such as typedef are loaded from struct rlimit or struct rlimit64. And then, we
struct {int counter;} atomic_t and typedef struct {uid_t val;} perform data-flow analysis to follow all the propagation and usages
kuid_t. The reason is that uid_t is of type unsigned int, which has of these variables and mark those functions if they are used in any
the same size as int. Therefore, the LLVM linker merges them and comparison instructions. In these functions, rlimit is checked to
mis-uses kuid_t for atomic_t. To address this problem, we trace the limit certain resources. We consider those resources not exhaustible
LLVM linker and find that the get method in lib/Linker/IRMover.cpp by the attacker-container, therefore we filter out the paths based on
compares a new type with existing types and merges them if the these functions. Our tool identifies 40 functions that check rlimit.
memory layout is same. Therefore, we disable the merging by com- Namespace Isolation Analysis. As mentioned before, the Linux
menting out the comparing and merging code. kernel introduces namespaces for resource isolation. For a names-
pace isolated resource, the Linux kernel creates a “copy” for it under
4.2 Container-Controllability Analysis each namespace so that the modification in one namespace does not
With identified abstract resources, we propose container controlla- affect other namespaces. Therefore, to confirm container controlla-
bility analysis to make sure that the container can actually consume bility, we need to make sure that those abstract resources are not
those abstract resources. Our idea of the container controllability protected by namespaces. Here, the challenge is that even though
analysis is two-fold. First, we need to make sure the abstract re- Linux has documentation about namespaces, there are no specifica-
source consumption sites from §4.1 can be reached by the container tions about which abstract resources are isolated by namespaces.

771
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

Table 1: Summary of static analysis results. Res. is short Table 2: Summary of validation results. Res. Dir. is the direc-
for resources; Reachable is container-reachable abstract tory of resources. The drivers either have hardware support,
resources. Limited is per-user and namespace limited re- or have no hardware support. No. is resources number, Re-
sources. Manual is manually filtered out resources. CC Res. peatedly means the resource consumption can be repeatedly
means container controllable resources. triggered.

Res. Type Reachable Limited Manual CC Res. Res. Dir. No. Repeatedly True Postive
Proc. 526 72 136 318 Non-Driver 700 389 55.6%
Mem. 110 10 26 74 Have HW 218 112 51.4%
Driver
Storage 256 57 66 133 No HW 92 - -
IO 952 203 264 485
Total 1,844 342 492 1,010
Dynamic Validation. To further validate the dynamic exhaustion
of these 1,010 resources, we develop a dynamic validation method
for resource consumption. For each resource, we first obtain its
Therefore, we propose namespace isolation analysis to identify the consumption sites and the triggered system calls from the above
abstract resources protected by namespaces systematically. controllability analysis. After that, we instrument those consump-
Our key observation is that for a namespace-isolated resource, tion sites to monitor the actual resource consumption. Next, we
the corresponding data structure has a pointer field that points to execute the test cases of the corresponding triggered system calls
the namespace it belongs to. Therefore, our tool first traverses all to repeatedly trigger the consumption and record the results. We
fields of each data structure type in the kernel. If the type has a leverage 1,156 test cases from the Linux Test Project (LTP) [14] and
namespace pointer, we mark it as an isolated resource. Second, for develop 177 new ones to cover more cases. We also develop scripts
the identified isolated resources, our tool uses it to filter the shared to automate the above steps.
abstract resources identified in §4.1. We applied our dynamic validation method to test the consump-
Note that some namespace-isolated resources may still be vul- tion of all 1,010 resources. The results are summarized in Table 2.
nerable to abstract resource attacks due to the mapping between For the 1,010 detected resources, 700 of them are not in the driver
different namespaces. As mentioned in §3.2.2, idr is isolated by folder, while the other 310 resources are in the driver folder, as
pid_namespace->idr. However, each idr allocated in a non-root PID shown in Table 2. For the 700 non-driver resources, 389 of them
namespace is mapped to a new idr in the root PID namespace, so can be repeatedly triggered dynamically, leading to a true posi-
that the root namespace can manage it. As a result, the root PID tive rate of 55.6%. The resources in the driver folder need to be
namespace is globally shared by all containers in all PID names- handled specially for two reasons. First, drivers are specific to the
paces. Therefore, it is still vulnerable to the idr exhaustion attacks. hardware. Without the corresponding hardware, the driver code
In our analysis, we manually filter out these resources. cannot be triggered dynamically. Our key observation is that most
hardware-supported drivers expose specific interfaces under /dev
4.3 Analysis Results or /sys/class folders. Based on this observation, we remove 92 re-
We implement our analysis tool with about 2,500 lines of C++ code sources in drivers that are not supported by our hardware. Second,
in LLVM 12.0. The Linux kernel IR is generated based on the latest the test cases provided by LTP might not cover a specific driver.
Linux stable version v5.10 with defconfig. The results are shown To resolve this problem, we modify the LTP test cases and develop
in Table 1. In particular, by applying the configuration-based anal- new test cases for the drivers. Among the 218 driver resources, 112
ysis and the access-based analysis, together with the reachability of them can be repeatedly triggered, leading to a true positive rate
analysis from system calls and the seccomp restriction analysis, our of 51.4%, as shown in Table 2.
tool identifies 1,844 shared abstract resources that are reachable by Identifying container-exhaustible abstract resources is a very
containers. challenging task, as it requires the domain knowledge to trigger
Resource Filtering. With the per-user quota restriction and the the exhaustion of abstract resources and it needs to assess the
namespace isolation analysis, our tool finds 342 resources that impacts when these resources are exhausted. In this paper, we
are limited by the rlimit or have pointers pointing to namespace conduct a preliminary analysis. Note that a thorough analysis and
structures. Those resources either have a limit check on the path risk assessment needs help from the Linux kernel and the container
or get namespaced. community. Therefore, we plan to open source our tool and the
We further conduct a manual analysis. Specifically, for every re- detected abstract resources. We think it will help the Linux kernel
source 𝑅 in the identified abstract resources, we walk through all the and the container community to identify the weak spots of resource
detected modifications of 𝑅 or the fields of 𝑅. If the modification is isolation and develop robust resource containment schemes.
not quantitative, such as being assigned with boolean, enumeration,
or string types, we mark this modification as non-quantitative. If all
5 ABSTRACT RESOURCE ATTACKS ON
the modifications to 𝑅 and the fields of 𝑅 are non-quantitative, we CLOUD PLATFORMS
mark 𝑅 as non-exhaustible. Our manual analysis identifies 492 ab- In this section, we further evaluate abstract resource attacks on the
stract resources that are non-exhaustible, as shown in Table 1. After container environments of public cloud vendors. We first present
manual analysis, there are still 1,010 abstract resources remaining. the environment setup and then give out the results.

772
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

Table 3: Summary of the abstract resources chosen from analysis results.

OS Service Resource Name Identification Consumption Function Syscall


Process PID idr Access alloc_pid() fork()
Memory dirty ratio Access balance_dirty_pages() write()
inode Access __ext4_new_inode() creat()
Storage
nr_files Configuration alloc_empty_file() open()
pty_count Configuration devpts_new_index() open()
IO netns_ct->count Access __nf_conntrack_alloc() connect()
random entropy Access extract_entropy() read()

5.1 Environment Setup and Ethical denies privilege escalation by setting allowPrivilegeEscalation:
Considerations false. However, it still adds two capabilities, i.e., CAP_NET_ADMIN and
CAP_SYS_TIME, and does not enforce seccomp.
To evaluate the effectiveness of the abstract resource attack, we set
Same as the AWS settings, we adopt a tighter security policy for
up the container environments on both local and cloud platforms.
containers on AKS. In addition to the best practices suggestions
The local test environment has been presented in §3.1.
(i.e., non-root user and disallowing privilege escalation), we run
Ethical Considerations. For the cloud platforms, we intend to
AKS containers in non-root users, drop all capabilities, enable all
minimize the impact of our attacks on other cloud users as much
namespaces and control groups, and use docker seccomp profile [15]
as possible. Therefore, we use a dedicated virtual server, e.g., AWS
to block 50+ sensitive system calls. And we apply the same security
EC2, Azure VM, Google GCE, and Alibaba ECS, to conduct the
policy for both the attacker-container and the victim-container.
experiments. In addition, we ensure that we are the only user of
Google Cloud. For the container services, we choose the Kuber-
that server.
netes and use Google Kubernetes Engine (GKE) [27] to deploy two
Moreover, most container users leverage the container orches-
container instances on a Google Compute Engine instance [28]. The
tration systems to deploy and manage containers [36]. Therefore,
Google Compute Engine (GCE) instance we use contains 4 CPUs, 16
we choose the most popular one named Kubernetes and leverage
GB memory, and 100 GB SSD. More specifically, we apply one GCE
cloud vendors’ Kubernetes services to deploy two docker containers
instance and deploy two containers (i.e., the attacker-container and
(i.e., the attacker-container and the victim-container) on the virtual
the victim-container) based on the regular runtime on that GCE
server. For strong isolation, we apply different Kubernetes names-
instance.
paces [37] for the attacker-container and the victim-container. As
For the container deployment, we follow the GKS container setup
mentioned in §4.2, containers are also subjected to per-user quota
wizard. Google Cloud provides best practices for operating contain-
restrictions. To enforce the per-user quotas in our experiments, we
ers [29], which suggests avoiding privileged containers. Therefore,
run the attacker-container and the victim-container in separate
in securityContext of the yaml configuration file, we disallow the
users with the per-user quota enforced. We also discuss restrictions
privileged escalation, run the container as a non-privileged user,
that can be deployed by the PAM in §6.
and drop all the capabilities. The GKS setup wizard enables 6 names-
Amazon AWS. For the container services, we use Elastic Kuber-
paces and 13 control groups by default. Besides, we apply the docker
netes Service (EKS) [2] to deploy two container instances on an
default seccomp profile to filter out sensitive system calls.
EC2 instance. The EC2 instance contains 4 CPUs, 8 GB memory,
Furthermore, the GKE also offers Google’s secure container
and 20 GB SSD disk. During the container deployment, we surpris-
runtime—gVisor [31], which leverages a user-space kernel named
ingly find that the “Amazon EKS default pod security policy” uses
Sentry, to serve the system calls from applications. Sentry calls
eks.privileged as the default pod security policy [3]. Note that this
about 50 system calls of the host machine to provide services as
policy allows containers to run as a privileged user and also allows
needed. gVisor is regarded as a secure sand-boxed runtime for
privilege escalation as well as host network accesses.
containers [31]. For the container deployment based on gVisor,
To better demonstrate the effectiveness of our proposed attack,
all its security settings (including non-privileged user, dropping
we adopt a stronger security policy from our local test environment
capabilities) are the same as the GKE docker runtime settings.
to EKS containers, which runs containers in non-root users, drops
Alibaba Cloud. For the container services, Alibaba Cloud provides
all privileges, enables all namespaces and control groups, and uses
Elastic Container Instance, Container Service for Kubernetes, Con-
docker seccomp profile [15] to block 50+ sensitive system calls
tainer Registry, and Alibaba Cloud Service Mesh [1]. We use the
including ptrace, pivot_root, etc. And we apply the same security
Container Service for Kubernetes to deploy two container instances
policy for both the attacker-container and the victim-container.
on an Elastic Computing Service (ECS) instance. The ECS instance
MS Azure. We use Azure Kubernetes Service (AKS) [51] to deploy
contains 4 CPUs, 16 GB memory, and 120 GB SSD disk. For con-
two container instances on an Azure virtual machine. The Azure
tainer security, we follow the official guide for container service
VM contains 2 CPUs, 8 GB memory, and 120 GB disk. To improve the
deployment [11], which runs containers with non-root user by
security of the deployed containers, Azure provides best practices
setting runAsUser to 1000. However, it does not disallow privilege
for pod security policy in AKS [52], which runs a container in
escalation and does not enforce seccomp and SELinux either.
the non-root user by setting runAsUser:1000 in yaml file, and it

773
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

We adopt a stronger security policy, which is the same as previ- The inode attack. In the inode attack, the victim-container keeps
ous ones. We run containers in non-root users, drop all capabilities, allocating inode structures. Unfortunately, the mount namespace
enable all namespaces and control groups, and use docker sec- does not isolate the inode. Neither the Linux kernel provides any
comp profile [15] to block sensitive system calls. And we apply inode related control groups. As a result, all inodes on the partition
the same security policy for both the attacker-container and the are exhausted. All operations consuming inodes fail, including
victim-container. the ones from the victim-container or the host machine. In our
experiments, Alibaba Cloud is vulnerable to the inode attack. The
5.2 Selection of Abstract Resources victim-container even gets evicted. Moreover, the host-machine
cannot create any new files either.
To conduct the attacks, we need to select meaningful abstract re-
The nr_files attack. The nr_files attack has been discussed in §3.4.1.
sources. To demonstrate the effectiveness of abstract resource at-
nr_files is globally shared by all containers. There are no names-
tacks, we want to select abstract resources that affect each aspect
paces or control groups to limit its usages. With nr_files quota
of the operating system services, including process management,
exhausted, various operations fail, including file open, executing a
memory management, storage management, and IO management.
new program, pipe creation, socket creation, and the timer creation,
Therefore, we first classify all the identified resources into these
as everything in Linux is a file. Our experiment shows all of the
four categories, i.e., for process, for memory, for storage, and for
top 4 vendors are vulnerable to the nr_files attack.
IO management, according to their declaration locations. Then, we
The pty_count attack. The pty_count attack has been discussed
pick at least one resource from each category based on our domain
in §3.5.1, which uses up all open pseudo-terminals quota. As a
knowledge, i.e., we know the impacts of resource exhaustion.
result, all operations that need to open a new pseudo-terminal fail,
Eventually, we select 7 abstract resources covering all four as-
such as SSH connections. Unfortunately, all of the top 4 vendors
pects, as shown in Table 3. The resource names are listed in the
are vulnerable to the pty_count attack.
second column of Table 3. Among the selected abstract resources, 5
The netns_ct->count attack. Netfilter in the Linux kernel provides
of them (i.e., PID idr, dirty ratio, inode, netns_ct->count, and random
connection tracking functionalities, which keeps track of all logical
entropy) are identified by the access-based analysis, and the other
network connections [66]. While the total connection has a limit,
2 (i.e., nr_files and pty_count) are identified by the configuration-
and it is counted by struct netns_ct->count [34]. Both the host
based analysis, as shown in the third column of Table 3. We also
machine and the containers need to maintain the connections. Even
list the resource consumption functions in the fourth column and
though the containers are in the different net namespace, all of their
the system calls we can use to trigger the attacks in the last column
connections need to consume the init_net.ct.count [35] of the init
of Table 3.
net namespace of the host machine. Therefore, if one can generate a
large number of TCP connections in a short time, it can consume all
5.3 Attacking Results on Cloud Platforms quota of init_net.ct.count, causing Netfilter malfunction. In our
As mentioned in the previous session, we set up 5 test environments experiments, the attacker-container can exhaust init_net.ct.count
for our proposed attack, including the ones on local, AWS, Azure, in a few seconds, which causes random packet dropping. Again,
Google Cloud, and Alibaba Cloud. For each test environment, we all environments of the top 4 vendors are vulnerable to the struct
set up two containers with tight security policies, as the attacker- netns_ct->count attack.
container and victim-container. The attacker-container launches The random entropy attack. In the Linux kernel, every read to
attacks targeting certain abstract resources. We use the above 7 the /dev/random consumes the random entropy. Whenever the ran-
selected abstract resources to launch the attacks. A benchmark dom entropy drops below a threshold, the Linux kernel blocks read
is running on both the victim-container and the host-machine to operations to /dev/random and waits for the entropy to increase [41].
measures their performance downgrade under abstract resource As there are no namespace or control groups to isolate the random
attacks. The results are shown in Table 4. entropy, the attacker-container can easily consume all random en-
The PID idr attack. The PID idr attack and its root cause have tropy by repeatedly reading /dev/random, and lead to benign reads
been detailed in §3.2.1. For the PID attack on the vendors, all victim blocked. The latest Linux kernel v5.10 fixed this issue by redirect-
containers and even the host-machine in Local, AWS, Azure, and ing /dev/random reads to /dev/urandom. However, both Azure and
Google test environments cannot fork new processes. The victim Alibaba Cloud are vulnerable to this attack.
containers even get evicted. Alibaba Cloud is not vulnerable to the
PID attack.
The dirty ratio attack. The dirty ratio attack has been discussed 5.4 Attacking gVisor
in §3.3.1. Without the attack, the IO performance is regarded as We also conduct the 7 resource attacks on gVisor. To set up gVi-
100%. Under the dirty ratio attack, the IO performance of victim- sor environment, we select runsc, instead of runc, as the con-
container on AWS, Azure, and Alibaba Cloud drop to 6.3%, 1.2%, tainer runtime in Google Kubernetes Engine (GKE), as mentioned
6.7%, respectively. Even worse, the host-machine is also vulnerable in §5.1. Among the 7 attacks, two attacks, i.e., nr_files attack and
to this attack, while its IO performance drops to 8.3%on AWS and netns_ct->count attack, still work in the gVisor environment. In the
drops to 8.6% on Alibaba Cloud. Here MS Azure does not provide following, we present our analysis to show why these two attacks
any access to the host machine, so we cannot get Azure host IO work on gVisor.
performance. Google Cloud is not vulnerable to the dirty ratio For the nr_files, gVisor uses Sentry to serve syscalls and Gofer
attack. to handle different types of IO for the Sentry. Sentry intercepts

774
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

Table 4: Summary of the attack results on different environments. “Y” indicates a successful attack, “-” indicates a failed attack.

Abstract Resources Local AWS Azure Google Alibaba Attacking Results


PID idr Y Y Y Y - Fork fail, victim container is evicted
dirty ratio Y Y Y - Y IO performance down for over 90%
inode Y - - - Y Victim container gets evicted
nr_files Y Y Y Y Y Operations requiring open-new-file fail
pty_count Y Y Y Y Y New SSH connections are rejected
netns_ct->count Y Y Y Y Y Random packets dropping
random entropy Y - Y - Y /dev/random read blocked

the open syscall from the container and sends the request to Gofer. nproc, and sigpending [48]. From our communication with the
On the other side, Gofer handles the request by calling the openat cloud vendors, we are not aware that any cloud vendors adopt
syscall of the host OS. Eventually, the openat syscall on the host PAM. Therefore, it is suggested to use PAM for certain abstract
OS triggers the alloc_empty_file function, which consumes the resource restrictions.
nr_files. In this way, the attacker in gVisor is able to exhaust the Using VM for strong isolation. For the security-critical applica-
nr_files of the host machine. tions, we suggest not using the multi-tenancy container environ-
For the netns_ct->count, Sentry intercepts the connect syscall ments. Stronger isolation schemes, such as the virtual machine-
and uses its own network stack to forward the data packets to based virtualization, are more preferable.
the veth-peer network card created in the host. The veth-peer is Using Monitoring Tools. We recommend to use the monitoring
attached to the virtual bridge in the host. When a network frame is tools for Kubernetes clusters, such as Falco [61], to monitor the
forwarded via a virtual bridge, the netfilter on the host is triggered resource consumption of containers. For sensitive abstract resources
to call the nf_conntrack_alloc function, which in turn consumes such as nr_files and inode, users should customize their own rules
the netns_ct->count. Therefore, attackers in gVisor still can exhaust to monitor specific resource consumption in the system.
the netns_ct->count of the host machine. Improving current isolation design. For the existing names-
paces, such as PID namespace, due to the mapping to the root
5.5 Summary namespace design, it cannot defend against the resource exhaus-
tion attacks. As detailed in §3.2.2, the Linux kernel allocates an
For the self-deployed shared-kernel container environments, two
extra idr in the root PID namespace for any idr allocated in other
of them are vulnerable to 6 attacks, one is vulnerable to 5 attacks,
PID namespaces. As a result, the root PID namespace is still glob-
and the other one is vulnerable to 4 attacks. Surprisingly, gVisor
ally shared. The attacker can still easily exhaust the PID in the
runtime is also vulnerable to 2 attacks—the nr_files attack and the
root PID namespace, causing DoS attacks. For the similar reason,
netns_ct->count attack. We have reported these attacks to all the
nf-conntrack count netns_ct->count can be attacked even if it is
four vendors. All of them confirmed that the problems exist in their
isolated by network namespace. Therefore, Linux community needs
shared-kernel container environments.
to review the namespace design, eliminating the namespace depen-
Though the top vendors use virtual machines to isolate the con-
dencies to improve the isolation.
tainers for different tenants, abstract resource attack is still practical
New kernel containment mechanisms. The Linux kernel com-
for several reasons. First, as demonstrated on Linux, FreeBSD, and
munity and the container community need to put more effort into
Fuchsia, abstract resource attack is intrinsic to OS-level virtualiza-
the protection of abstract resources. Actually, we report this prob-
tion and thus is a broad class of attacks. Second, inexperienced users
lem to the Docker security team. The feedback is that “Linux con-
may not understand the risks of the shared-kernel and may use
tainers can only use available kernel isolation mechanisms. If there
containers for sand-boxing [62]. Our paper would help to improve
are no kernel mechanisms to control the limit, the container can-
the awareness of the risks. Third, even within the same tenant, the
not do anything to restrict it”. Therefore, we first need a thorough
competing teams might attack each other by exploiting abstract
analysis of all container shareable abstract resources, so that we
resources. Therefore, it is still necessary to monitor and mitigate
can understand and more importantly, clear up their data depen-
such attacks.
dencies. This requires comprehensive kernel domain knowledge
and substantial kernel code changes. Moreover, the Linux kernel
6 MITIGATION DISCUSSIONS is not initially designed for supporting OS-level virtualization. Its
In this paper, we reveal that other than physical resources, con- resource isolation and containment are incomplete. Therefore, new
tainers are also sharing the abstract resources of the underlying namespace and control groups are needed.
running kernel. These abstract resources are easy to attack and the More restrictive system call blocking. From the container side,
consequences are severe. In the following, we give out multiple currently, even with seccomp enforced, the applications in the con-
strategies for mitigating the risks introduced by abstract resources. tainers can still access about 250 system calls. Before we understand
Using PAM for per-user quota restrictions. As mentioned in §4.2, the data dependency of those system calls, it is suggested to enforce
the Linux kernel provides interfaces allowing the user to load user- a more strict seccomp profile to block more unnecessary system
customized PAM. PAM is able to limit 18 resources, 5 of which are calls. The container users can use techniques in [13, 25, 26, 38, 55]
for abstract resources, including maxlogin/maxsyslogins, nofile,

775
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

to get tighter seccomp profiles, to reduce the potential of abstract However, they mainly focus on information leakage problem or
resource attacks. attacking physical resources such as CPU, IO, not abstract resources.
Lin et al. show that containers cannot isolate kernel vulnerabil-
7 RELATED WORK ities [40]. Another work uses static analysis to analyze Docker’s
In this section, we present the studies that are related to virtualiza- code in order to find differences between the vulnerable and the
tion, resource isolation, and container security. patched code [16]. However, these works focus on existing vulner-
abilities and exploits. On the contrary, our work introduces new
7.1 Virtualization Techniques attacks targeting the shared abstract resources.
There are also works on securing containers. Lei et al. propose
There are two mainstream virtualization techniques used in the
a container security mechanism called SPEAKER to reduce the
cloud environment, VM-based virtualization and OS-level virtu-
application’s available system calls inside container [38]. Sun et
alization. Compared with the VM-based virtualization, OS-level
al. develop security namespaces that provide the security policy
virtualization is becoming popular for enabling full application ca-
isolation for each container[60]. Another work uses Intel SGX to se-
pability with light-weight virtualization. To fully understand the
cure containers [4], which provides a small trusted computing base
performance advantages, researchers have conducted a series of
with low-performance overhead. Brady et al. implement a security
studies. Felter et al. show that Docker can achieve better perfor-
assessment system of container images [8]. However, containers in
mance than KVM in all cases by using a set of benchmarks covering
all of these works still rely on the kernel for various services and
multiple resources [17]. Joy et al. make a comparison between
thus are still vulnerable to abstract resource attacks.
Linux containers and virtual machines in terms of performance
and scalability [32]. Zhang et al. show that the container has better
performance than virtual machines in big data environment [69].
8 CONCLUSION
All these works demonstrate that OS-level virtualization has bet- In this paper, we reveal a new attack surface introduced by the
ter performance than traditional VM-based virtualization. However, shared-kernel in OS-level virtualization. The containers are directly
none of them pay attention to the potential influence of underlying and indirectly sharing thousands of abstract resources, which can
kernel abstract resources. Our paper reveals the new attack surfaces be exhausted easily to cause DoS attacks against other containers.
introduced by abstract resources. To show the importance of confining abstract resources, we have
conducted abstract resource attacks, targeting abstract resources on
7.2 Resource Isolation different aspects of the operating system kernel. The results show
that attacking abstract resources is highly practical and critical.
Linux uses capabilities [43] to prohibit processes without certain ca-
Abstract resources are inherently hard to contain. To understand
pabilities from accessing resource instances of corresponding types.
the attack surfaces, we take an initial trial by conducting a sys-
Researchers have proposed approaches that are based on Linux
tematic analysis to identify vulnerable abstract resources in the
capabilities, such as Wedge [7], Capsicum [65], and ACES [10].
Linux kernel. Our tool successfully detects 501 dynamically trig-
These works enforce more fine-grained capability control to mit-
gered abstract resources, in which we pick 7 ones and conduct the
igate memory corruption attacks. However, they cannot defend
attacking experiments in the self-deployed shared-kernel container
against our DoS attacks which exhaust accessible shared resources.
environments on the top 4 cloud vendors. The results show that
Memory address space isolation [56] is a typical resource space
all environments are vulnerable to our attacks. As a mitigation, we
isolation scheme, which avoids memory address resource from be-
provide several suggestions for container users and developers to
ing exhausted. Linux namespaces [47] isolate 8 types of resources
reduce the risks.
listed in §2.1. These schemes can isolate only limited types of re-
sources. Resource containers [6] propose to extend monolithic
kernel to isolate system resources and account for resources at ACKNOWLEDGMENTS
thread-level, which is similar to control groups. Using resource The authors would like to thank all reviewers for the insightful com-
containers to protect all abstract resources is impractical due to ments. Those comments helped to re-shape this paper. This work
the large performance overhead. EdgeOS [57] deploys OS with is partially supported by the National Natural Science Foundation
strong isolation for edge clouds. However, adopting a micro-kernel of China (Grants No. 62002317, 62032021, and 61772236), by the
without hardware supporting introduces more overhead than a National Key R&D Program of China (Grant No. 2020AAA0107700),
monolithic kernel. Faasm[58] uses software-fault isolation (SFI) for by the Key R&D Program of Shaanxi Province of China (Grant No.
memory isolation while uses namespaces to isolate the network 2019ZDLGY12-06), by the Leading Innovative and Entrepreneur
resource space in server-less computing. However, most shared Team Introduction Program of Zhejiang (Grant No. 2018R01005),
resources are still exposed to the threat of DoS attacks. and by the Ant Group Funds for Security Research.

7.3 Container Security REFERENCES


[1] Alibaba. 2020. Alibaba Cloud. https://us.alibabacloud.com/.
Besides resource isolation, there are studies on container security. [2] Amazon. 2020. Containers on AWS. https://aws.amazon.com/containers.
Gao et al. find that information leaks from /proc or /sys can be [3] Amazon. 2020. Pod security policy. https://docs.aws.amazon.com/eks/latest/
exploited to launch power attacks [22]. While the same research userguide/pod-security-policy.html.
[4] Sergei Arnautov, Bohdan Trach, Franz Gregor, Thomas Knauth, Andre Martin,
group also conducts five attacks to generate out-of-band workloads Christian Priebe, Joshua Lind, Divya Muthukumaran, Dan O’keeffe, Mark L
to break the resource constraints of Linux control groups [23]. Stillwell, et al. 2016. {SCONE }: Secure linux containers with intel {SGX }. In 12th

776
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

{USENIX } Symposium on Operating Systems Design and Implementation ( {OSDI } [31] 2020 The gVisor Authors. 2020. What is gVisor. https://gvisor.dev/docs.
16). USENIX Association, 689–703. [32] Ann Mary Joy. 2015. Performance comparison between linux containers and
[5] Jia-Ju Bai, Julia Lawall, Qiu-Liang Chen, and Shi-Min Hu. 2019. Effective static virtual machines. In 2015 International Conference on Advances in Computer
analysis of concurrency use-after-free bugs in Linux device drivers. In 2019 Engineering and Applications. 342–346.
{USENIX } Annual Technical Conference ( {USENIX } {ATC } 19). USENIX Associa- [33] Poul-Henning Kamp and Robert NM Watson. 2000. Jails: Confining the om-
tion, 255–268. nipotent root. In Proceedings of the 2nd International SANE Conference, Vol. 43.
[6] Gaurav Banga, Peter Druschel, and Jeffrey C Mogul. 1999. Resource containers: 116.
A new facility for resource management in server systems. In Proceedings of [34] Linux Kenrnel. 2020. Kernel source - nf-conntrack-core.c. https://
the Third USENIX Symposium on Operating Systems Design and Implementation elixir.bootlin.com/linux/v5.10/source/net/netfilter/nf_conntrack_core.c#L1480.
(OSDI), New Orleans, Louisiana, USA, February 22-25, 1999. USENIX Association, [35] Linux Kernel. 2020. Kernel source - nf-conntrack-standalone.c.
45–58. https://elixir.bootlin.com/linux/v5.10/source/net/netfilter/
[7] Andrea Bittau, Petr Marchenko, Mark Handley, and Brad Karp. 2008. Wedge: nf_conntrack_standalone.c#L614.
Splitting applications into reduced-privilege compartments. In 5th USENIX Sym- [36] Kubernetes. 2020. Kubernetes. https://kubernetes.io/.
posium on Networked Systems Design & Implementation,NSDI 2008, April 16-18, [37] Kubernetes. 2020. Kubernetes Namespaces. https://kubernetes.io/docs/concepts/
2008, San Francisco, CA, USA, Proceedings. USENIX Association, 309–322. overview/working-with-objects/namespaces/.
[8] Kelly Brady, Seung Moon, Tuan Nguyen, and Joel Coffman. 2020. Docker con- [38] Lingguang Lei, Jianhua Sun, Kun Sun, Chris Shenefiel, Rui Ma, Yuewu Wang,
tainer security in cloud computing. In 2020 10th Annual Computing and Commu- and Qi Li. 2017. SPEAKER: Split-phase execution of application containers. In
nication Workshop and Conference (CCWC). IEEE, 975–980. International Conference on Detection of Intrusions and Malware, and Vulnerability
[9] Thanh Bui. 2015. Analysis of docker security. arXiv preprint arXiv:1501.02967 Assessment (Lecture Notes in Computer Science, Vol. 10327). Springer, 230–251.
(2015). http://arxiv.org/abs/1501.02967 [39] GNU C Library. 2021. ulmit source code. https://sourceware.org/git/?p=
[10] Abraham A Clements, Naif Saleh Almakhdhub, Saurabh Bagchi, and Mathias glibc.git;a=blob_plain;f =sysdeps/posix/ulimit.c.
Payer. 2018. {ACES }: Automatic compartments for embedded systems. In 27th [40] Xin Lin, Lingguang Lei, Yuewu Wang, Jiwu Jing, Kun Sun, and Quan Zhou. 2018.
{USENIX } Security Symposium ( {USENIX } Security 18). USENIX Association, A measurement study on linux container security: Attacks and countermeasures.
65–82. In Proceedings of the 34th Annual Computer Security Applications Conference. ACM,
[11] Alibaba Cloud. 2020. Pod security policy. https://www.alibabacloud.com/help/ 418–429.
doc-detail/149547.html. [41] Linux. 2020. random read kernel function. https://elixir.bootlin.com/linux/v5.3.1/
[12] Theo Combe, Antony Martin, and Roberto Di Pietro. 2016. To docker or not to source/drivers/char/random.c#L1948.
docker: A security perspective. IEEE Cloud Computing 3, 5 (2016), 54–62. [42] Kangjie Lu and Hong Hu. 2019. Where does it go? refining indirect-call targets
[13] Nicholas DeMarinis, Kent Williams-King, Di Jin, Rodrigo Fonseca, and Vasileios P with multi-layer type analysis. In Proceedings of the 2019 ACM SIGSAC Conference
Kemerlis. 2020. Sysfilter: Automated system call filtering for commodity software. on Computer and Communications Security. ACM, 1867–1881.
In 23rd International Symposium on Research in Attacks, Intrusions and Defenses [43] Linux man-pages project. 2020. capabilities(7) — Linux manual page. https:
( {RAID } 2020). USENIX Association, 459–474. //man7.org/linux/man-pages/man7/capabilities.7.html.
[14] LTP Developers. 2021. Linux Test Project. https://linux-test-project.github.io/. [44] Linux man-pages project. 2020. cgroups - Linux control groups. http://man7.org/
[15] Docker. 2020. Seccomp security profiles for Docker. https://docs.docker.com/ linux/man-pages/man7/cgroups.7.html.
engine/security/seccomp/. [45] Linux man-pages project. 2020. getrlimit man page. https://man7.org/linux/man-
[16] Ana Duarte and Nuno Antunes. 2018. An empirical study of docker vulnerabilities pages/man2/getrlimit.2.html.
and of static code analysis applicability. In 2018 Eighth Latin-American Symposium [46] Linux man-pages project. 2020. Linux pty. https://man7.org/linux/man-pages/
on Dependable Computing (LADC). IEEE, 27–36. man7/pty.7.html.
[17] Wes Felter, Alexandre Ferreira, Ram Rajamony, and Juan Rubio. 2015. An updated [47] Linux man-pages project. 2020. namespace - Linux Namespace. https://man7.org/
performance comparison of virtual machines and linux containers. In 2015 IEEE linux/man-pages/man7/namespaces.7.html.
international symposium on performance analysis of systems and software (ISPASS). [48] Linux man-pages project. 2020. PAM limits conf man page. https://
IEEE Computer Society, 171–172. www.man7.org/linux/man-pages/man5/limits.conf .5.html.
[18] FreeBSD. 2021. freeBSD handbook. https://docs.freebsd.org/en/books/handbook/ [49] Linux man pages project. 2020. sysctl man page. https://man7.org/linux/man-
jails/. pages/man8/sysctl.8.html.
[19] Fuchsia. 2020. Zircon handles. https://fuchsia.dev/fuchsia-src/concepts/kernel/ [50] Linux man-pages project. 2020. ulimit man page. https://man7.org/linux/man-
handles. pages/man3/ulimit.3.html.
[20] Fuchsia. 2020. ZX RIGHTS BASIC. https://fuchsia.dev/fuchsia-src/concepts/ [51] Microsoft. 2020. Containers on Azure. https://azure.microsoft.com/en-us/
kernel/rights#zx_rights_basic. product-categories/containers/.
[21] Peter B Galvin, Greg Gagne, Abraham Silberschatz, et al. 2003. Operating system [52] Microsoft. 2020. Security policy on Azure. https://docs.microsoft.com/azure/aks/
concepts. John Wiley & Sons. developer-best-practices-pod-security.
[22] Xing Gao, Zhongshu Gu, Mehmet Kayaalp, Dimitrios Pendarakis, and Haining [53] FreeBSD Manual Pages. 2021. ezjail man page. https://www.freebsd.org/cgi/
Wang. 2017. ContainerLeaks: Emerging security threats of information leakages man.cgi?query=ezjail.
in container clouds. In 2017 47th Annual IEEE/IFIP International Conference on [54] FreeBSD Manual Pages. 2021. rctl man page. https://www.freebsd.org/cgi/
Dependable Systems and Networks (DSN). IEEE Computer Society, 237–248. man.cgi?query=rctl&sektion=8.
[23] Xing Gao, Zhongshu Gu, Zhengfa Li, Hani Jamjoom, and Cong Wang. 2019. [55] Shankara Pailoor, Xinyu Wang, Hovav Shacham, and Isil Dillig. 2020. Auto-
Houdini’s Escape: Breaking the Resource Rein of Linux Control Groups. In mated policy synthesis for system call sandboxing. Proceedings of the ACM on
Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Programming Languages 4, OOPSLA (2020), 135:1–135:26.
Security, CCS 2019, London, UK, November 11-15, 2019. ACM, 1073–1086. [56] James L Peterson and Abraham Silberschatz. 1985. Operating system concepts.
[24] Xing Gao, Benjamin Steenkamer, Zhongshu Gu, Mehmet Kayaalp, Dimitrios Addison-Wesley Longman Publishing Co., Inc.
Pendarakis, and Haining Wang. 2018. A study on the security implications of [57] Yuxin Ren, Guyue Liu, Vlad Nitu, Wenyuan Shao, Riley Kennedy, Gabriel Parmer,
information leakages in container clouds. IEEE Transactions on Dependable and Timothy Wood, and Alain Tchana. 2020. Fine-Grained Isolation for Scalable, Dy-
Secure Computing 18, 1 (2018), 174–191. namic, Multi-tenant Edge Clouds. In 2020 {USENIX } Annual Technical Conference
[25] Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Poly- ( {USENIX } {ATC } 20). USENIX Association, 927–942.
chronakis. 2020. Confine: Automated system call policy generation for container [58] Simon Shillaker and Peter Pietzuch. 2020. Faasm: lightweight isolation for efficient
attack surface reduction. In 23rd International Symposium on Research in Attacks, stateful serverless computing. In 2020 {USENIX } Annual Technical Conference
Intrusions and Defenses ( {RAID } 2020). USENIX Association, 443–458. ( {USENIX } {ATC } 20). USENIX Association, 419–433.
[26] Seyedhamed Ghavamnia, Tapti Palit, Shachee Mishra, and Michalis Polychronakis. [59] Solaris. 2020. Solaris Zones. https://docs.oracle.com/cd/E26502_01/html/E29024/
2020. Temporal system call specialization for attack surface reduction. In 29th toc.html.
{USENIX } Security Symposium ( {USENIX } Security 20). USENIX Association, [60] Yuqiong Sun, David Safford, Mimi Zohar, Dimitrios Pendarakis, Zhongshu Gu,
1749–1766. and Trent Jaeger. 2018. Security namespace: making linux security frameworks
[27] Google. 2020. GKE quick start. https://cloud.google.com/kubernetes-engine/ available to containers. In 27th {USENIX } Security Symposium ( {USENIX } Secu-
docs/quickstart. rity 18). USENIX Association, 1423–1439.
[28] Google. 2020. google compute engine of Containers. https://cloud.google.com/ [61] Sysdig. 2021. Sysdig Falco. https://sysdig.com/opensource/falco/.
compute/docs/containers. [62] William Viktorsson, Cristian Klein, and Johan Tordsson. 2020. Security-
[29] Google. 2021. Best practices for operating containers. https://cloud.google.com/ Performance Trade-offs of Kubernetes Container Runtimes. In 28th International
kubernetes-engine/docs/best-practices/enterprise-multitenancy. Symposium on Modeling, Analysis, and Simulation of Computer and Telecommu-
[30] Aaron Grattafiori. 2016. Understanding and hardening linux containers. Whitepa- nication Systems, MASCOTS 2020, Nice, France, November 17-19, 2020. IEEE, 1–4.
per, NCC Group (2016). https://doi.org/10.1109/MASCOTS50786.2020.9285946

777
Session 3B: Operating Systems CCS ’21, November 15–19, 2021, Virtual Event, Republic of Korea

[63] Dmitry V.Levin. 2020. pam model source code. https://github.com/linux-pam/ [68] Meng Xu, Chenxiong Qian, Kangjie Lu, Michael Backes, and Taesoo Kim. 2018.
linux-pam/releases/tag/v1.5.1. Precise and scalable detection of double-fetch bugs in OS kernels. In 2018 IEEE
[64] Dmitry V.Levin. 2021. setup_limits source code. https://github.com/linux-pam/ Symposium on Security and Privacy, SP 2018, Proceedings, 21-23 May 2018, San
linux-pam/blob/v1.5.1/modules/pam_limits/pam_limits.c#L984. Francisco, California, USA. IEEE Computer Society, 661–678. https://doi.org/
[65] Robert NM Watson, Jonathan Anderson, Ben Laurie, and Kris Kennaway. 2010. 10.1109/SP.2018.00017
Capsicum: Practical Capabilities for UNIX.. In USENIX Security Symposium, Vol. 46. [69] Qi Zhang, Ling Liu, Calton Pu, Qiwei Dou, Liren Wu, and Wei Zhou. 2018. A
USENIX Association, 2. https://doi.org/10.1109/MASCOTS50786.2020.9285946 comparative study of containers and virtual machines in big data environment.
[66] Wikipedia. 2020. Connection tracking. https://en.wikipedia.org/wiki/ In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD). IEEE
Netfilter#Connection_tracking. Computer Society, 178–185. https://doi.org/10.1109/CLOUD.2018.00030
[67] Wikipedia. 2020. OS-level virtualization. https://en.wikipedia.org/wiki/OS- [70] Tong Zhang, Wenbo Shen, Dongyoon Lee, Changhee Jung, Ahmed M Azab, and
level_virtualization. Ruowen Wang. 2019. Pex: A permission check analysis framework for linux
kernel. In 28th {USENIX } Security Symposium ( {USENIX } Security 19). USENIX
Association, 1205–1220.

778

You might also like