Linux Userspace Kernel Interaction
Linux Userspace Kernel Interaction
Abstract—System calls based on context switches from user to kernel supported in POSIX as well as System V style. A very
space are the established concept for interaction in operating systems. common principle for IPC are sockets, and pipes can be seen
On top of them the Linux kernel offers various paradigms for commu- as their most simple case. Besides the popular IP family with
nication and management of resources and tasks. The principles and TCP/UDP, the local Unix domain sockets play a big role
basic workings of system calls, interrupts, virtual system calls, special
for many applications, while Netlink sockets are specific
purpose virtual filesystems, process signals, shared memory, pipes,
Unix or IP sockets and other IPC methods like the POSIX or System V
to the Linux kernel and are not often found in user space
message queue and Netlink are are explained and related to each other applications due to portability to other operating systems.
in their differences. Because Linux is not a puristic project but home for There have been attempts to bring the D-BUS IPC into
many different concepts, only a mere overview is presented here with the kernel with a Netlink implementation, kdbus and then
focus on system calls. Bus1, but this will not be covered here. Also comparative
studies with other operating systems are out of scope.
1 I NTRODUCTION
2 K ERNEL AND U SER S PACE
K ERNELS in the Unix family are normally not initiating
any actions with outside effects, but rely on requests
from user space to perform these actions. In a similar way,
The processes have virtualized access to the memory as well
as the processor and its registers in the sense that they do
a user space program running without invoking kernel not have to care about other programs also making use of
services has no visible effect out of its internal computations. it because the kernel saves and restores the state [1]. In its
Therefore, both need to interact to produce results, and a x86_64 variant Linux is utilizing the memory management
common program execution trace consists of interwoven unit (MMU) to provide a flat memory layout of continuous
kernel and user space code. logical address space by mapping it to the physical memory
In a wide-spread end user operating system the man- in units of pages [2].
agement of resources needs to happen through a definded, The page table and its cache, the translation lookaside
stable interface to enhance portability of applications. An buffer (TLB), is always present and needs to be exchanged
operating system also needs to provide a security model if a different process is scheduled. The kernel memory is
based on priviliges if it is to execute untrusted code or mapped to the higher canonical address space of every
serves as a multi-user environment which should e.g. shield process and code execution there makes also use of logical
filesystem and network operations from each other. addresses. If a kernel thread is to be scheduled, the page
The interfaces between user and kernel space in Linux table of the previous process can thus remain [2].
will be introduced in the following sections with the in- The user space program is not allowed to directly access
tention to give insights in how they work and what they any logical address because not all is mapped or mapped
are used for. System calls as the primitives of interaction in for a special purpose like the kernel memory. Entries in the
Linux are mentioned first and all other concepts will involve page table are annotated with attributes like read, write and
system calls. It makes sense to look at interrupts in general execute permissions, presence of the mapping and if it is
and also the option of virtual system calls which avoid the meant for access from kernel or user space.
context switch. The program stack is used during the execution of user
The filesystem tree exposes certain kernel interfaces in an code and an additional kernel stack is maintained for exe-
accessible way at various points. While the idea everything is cution in kernel mode. This transition needs lifted priviliges
a file is not completely fulfilled in Linux there are still many to jump to into the kernel memory and is usually done by
possibilities to introspect or manage processes, resources registering interrupt handlers in the CPU.
and configurations. Some of these special purpose virtual Besides the hardware interrupts to handle device noti-
filesystems are commonly present on all systems, others can fications there are software interrupts which are triggered
be activated on demand. through execution of instructions. Common exceptions are
Since a process is always a possible subject to signals invalid memory access or invalid instruction usage or com-
they are a good entry to inter-process communication (IPC). putation errors [3]. In the Intel 32-bit architecture Linux
Semaphores, shared memory and message queues are both uses the software interrupt int 0x80 to trigger a system
2
call and in the 64-bit variant there are special system call Another circumstance which has an effect on system calls is
instructions to enter and leave a system call but the result the control group of the process, and various controllers can
is the same. The registers are saved, the kernel stack of this specify resource limits and access policies, cf. CGROUPS(7).
process used and the requested system call function invoked With Linux NAMESPACES(7) the different contexts can
and afterwards execution returns to user space code. also have an impact on available mount points, user and
Code in the kernel section can perform almost all func- process IDs or visible network and IPC resources. Finally,
tionality known from user space. This ranges from a simple SELinux policies or seccomp mode (allowing only reads and
printk() which is always allowed and issues a message writes) are to be mentioned and it is the kernel’s duty to
for the kernel log (accessible via the SYSLOG(2) syscall or check all restrictions if a system call is requested.
/proc/kmsg) up to e.g. an implementation of an in-kernel Now to the details of the call procedure starting from
TCP server since all system calls function are available. But the C standard library used by an application. In a POSIX
because context switches are expensive, kernel code avoids compatible system many functions are just wrappers for
touching the floating point and SIMD registers. the system calls. But linking against the kernel C functions
is not possible and an architecture specific system calling
convention has to be followed.
Memory Layout A call to e.g. fopen(3) or open(2) for file ac-
Kernel 0xffffffffffffffff-ffff800000000000 cess in the libc implementation uses a macro from
(c.f. x86/x86_64/mm.txt in [6]) linux/x86_64/sysdep.h to resolve the syscall name to
the syscall number from asm/unistd_64.h. The calling
• vsyscalls convention demands the arguments to be loaded into reg-
• kernel module mapping isters which involves inline assembly and the numer of the
• kernel code system call becomes the first argument. Then the system
• stacks call instruction is issued and execution continues in kernel
• kasan shadow memory space. The return value will afterwards be in the usual
• virtual memory map register as with normal calls, so there is no additional
• vmalloc/ioremap space assembly involved.
• direct mapping of all phys. memory If the system call should not be invoked by a wrapper
• guard hole, reserved for hypervisor function there is also the generic syscall(2) function
[Cannonical address sign extension hole] which only needs the number of the system call. In its
documentation one can find the calling convention by ar-
User 0x00007fffffffffff-0000000000000000 chitecture, which makes it easy to write it in plain assembly
(c.f. /proc/[PID]/maps, randomized) and monitor the system calls of the process with the strace
utility:
• virtual dynamic shared object (vDSO)
• stack
• dynamic linker global _start
• mmaps section .data
• other shared objects (libraries) fpath: db ’/dev/null’
• heap section .text
• binary executable _start:
mov rax, 2 ;; open(
mov rdi, fpath ;; fpath,
mov rsi, 0 ;; O_RDONLY)
Fig. 1. Virtual Address Space syscall
;; returns file descriptor in rax
mov rax, 60 ;; exit(
3 L INUX S YSTEM C ALLS mov rdi, 0 ;; 0)
syscall
System calls are the main primitives for communication
with the kernel. Together they define an abstraction interface
for the management of files, devices, processes and commu- $ nasm -f elf64 -o open.o open.S
nication with the advantage that e.g. writing to a file does $ ld -o open open.o
not need knowledge about the filesystem and disk drivers $ strace ./open
performing the write. execve("./open", ["./open"], [/* 62 vars */]) = 0
Also restrictions are set in place by the user permissions open("/dev/null", O_RDONLY) = 3
exit(0) = ?
the process is running under. In addition to these traditional
+++ exited with 0 +++
security model Linux offers capabilities, which are single
aspects of the hightest (root) privilige level and can be Fig. 2. Interfacing the Linux Kernel
supplied to processes either during runtime or attached
to the binary executable. The list in CAPABILITIES(7) A detail of process creation in Linux is that the cur-
covers network operations, chaning file ownership, killing rent process needs to be duplicated by fork and can
processes, mounting, loading kernel modules and more [4]. then be replaced with a new executable image by execve
3
as the first action in this new process. Some other well- 4 V IRTUAL S YSTEM C ALLS
known syscalls are open, read, write, close for file Not every request really needs a context switch and for com-
descriptors, clone for threads or tracking file events with monly used functions where shared data is accessed there
inotify_init. The full list of around 300 available system are two machanisms. One is the legacy vsyscall memory
calls is documented in SYSCALLS(2). mapping which contains simple pseudo-syscall functions
The system call interrupt lets the CPU fetch the position like gettimeofday() and a shared memory region which
of the handler function from a model specific register as new holds the data to return.
value for the instruction pointer and the privilige level is set Due to security reasons with the vsyscall statical map-
to kernel mode [3]. On the kernel side this handler activates ping a new machanism of a virtual ELF dynamic shared object
the kernel stack and saves the registers to it. Just for a short (vDSO) was developed because it supports address space
period interrupts have been disabled when the handler was layout randomization (ASLR). Where it resides is passed to
started, but are activated again in order to have preemptible the process as auxiliary vector variable, cf. VDSO(7).
system calls [2].
5 V IRTUAL F ILESYSTEMS
Unix shells are well-suited for file processing and exposing
/* linux/fs/open.c the operating system through file objects is a powerful idea.
* Copyright (C) 1991, 1992 Linus Torvalds The tree structure helps to find orientation and the set of
*/
[...] actions on files is comprehendible.
/* sys_open(const char __user *filename,
int flags, umode_t mode) /
* |-- dev/
*/ | |-- audio
SYSCALL_DEFINE3(open, const char __user *, | |-- null
filename, int, flags, | |-- sda
umode_t, mode) | |-- block/
| | ‘-- 8:0 -> ../sda
{ | |-- bus/
if (force_o_largefile()) | | ‘-- usb/
flags |= O_LARGEFILE; | ‘-- char/
| ‘-- 1:3 -> ../null
|-- proc/
return do_sys_open(AT_FDCWD, | |-- 12345/
filename, | | |-- cgroup
flags, mode); | | |-- cmdline
} | | |-- cwd
| | |-- environ
[...] | | |-- fd/
| | | ‘-- 0
| | |-- io
| | |-- mem
Fig. 3. Implementation of open(2) which is sys_open() internally [3] | | |-- mounts
| | |-- syscall
| | ‘-- tasks/
| |-- cgroups
System calls are written as C functions using the | |-- partitions
SYSCALL_DEFINE macro which takes care of the metadata. | ‘-- sys/
‘-- sys/
This ensures that the sys_call_table contains the func- |-- block/
tion pointers according to the syscall numbers, and defines |-- bus/
asmlinkage as calling convention, i.e. they receive their |-- class/
|-- devices/
arguments internally from the stack [2]. The system call |-- fs/
handler can then issue a normal call instruction to the | |-- cgroup/
|-- kernel/
address of the related syscall entry in the table. Inside the | |-- debug/
system call it is important to distinguish between pointers | |-- config/
| ‘-- cpuset/
into user space and in-kernel pointers but helper functions |-- module/
like copy_from_user() and copy_to_user() are pro- ‘-- power/
vided.
When the function is finished the execution returns to Fig. 4. Parts of the virtual filesystem tree
the user process with restored state and privilige level, and The devtmpfs virtual filesystem is filled by the kernel
the return value of the system call is available. with all device nodes requested by drivers. It is normally
Normally system calls have a single purpose but mounted in /dev and also managed by the udev service
ioctl(2) sends requests to special device files and is used in addition. These device files represent block or character
for a variety of operations instead of new system calls since stream devices (non necessarily physical) which are handled
they need common sense and recompilation of the kernel. in the kernel. They can also be created with mknod(1) and
Nowadays it is preferred to expose attributes in the sysfs. are determined by a major and a minor ID.
In contrast to calls to the kernel form user space Examples for character devices are the random genera-
Linux also supports shifting tasks to the user space with tor or the null device. The common storage volumes and
call_usermodehelper() which provides a wrapper their partitions are accessible as block devices and can be
function to spawn a user process and wait for the result. manipulated like files.
4
The most simple case is piping with anonymous pipes, In comparison with e.g. microkernel operating systems
a form of FIFO buffers. A single pipe implies unidirectional that necessarily feature an appropriate IPC mechanism,
communication. They connect two processes through linked Netlink does not fill this gap for Linux. With the raise of
file descriptors, e.g. one for the standard input and the other containers there might be more attempts to implement the
for the output stream. Then there are named pipes which are functionality of D-BUS in the kernel space.
located in the filesystem and can be created with mknod(1) But it is unlikely that system calls based on context
or mkfifo(1). switches will be replaced soon by parallel execution of
Concerning sockets there are various families available kernel and user space. The presence of multiple CPU cores
and it is to be distinguished between datagram and stream has promoted the use of asynchronous calls within user
mode which gives a guaranteed ordering without the notion space already.
of packages.
TCP/UDP could cover many use cases but also comes
R EFERENCES
with additional overhead and complexity. If there is no need
to route the packages through a network then the local Unix [1] Robert Love, Linux System Programming, 1st ed. O’Reilly Media,
Inc., 2007, ISBN 978-0-596-00958-8
domain sockets tend to be used. They can be anonymous [2] Robert Love, Linux Kernel Development, 3rd ed. Addison-Wesley
or named, which either means an abstract identifier or a Professional, 2010, ISBN 978-0-672-32946-3
special file in the filesystem. Through socket(2) they [3] Alexander Kuleshov, Linux Insides, 2017, Commit 3410012,
can be created with the AF_UNIX domain family and are https://0xax.gitbooks.io/linux-insides/content/index.html
[4] Various, The Linux man-pages project, 1994-2017, https://www.
configured with setsockopt(2). kernel.org/doc/man-pages/
More similar to AF_INET IP sockets with UDP, but not [5] Ariane Keller, Kernel Space, User Space Interfaces, 2008, Rev. #11,
intended for network usage, are Netlink sockets. Netlink http://wiki.tldp.org/Kernel_userspace_howto
[6] Various, Linux Kernel 4.9 Documentation Files, 2017, https://www.
provides a uni- and multicast message bus with general or kernel.org/doc/Documentation/
special purpose protocols [8]. The NETLINK_ROUTE proto- [7] S. Maliye, S. Krishnaswamy and H. Gajula, Quick access of sysfs en-
col is used in the IP network routing stack of the kernel, tries through custom system call, MicroCom, 2016, DOI: 10.1109/Mi-
others are NETLINK_FIREWALL and NETLINK_FILTER. croCom.2016.7522511
[8] Neil Horman, Understanding and Programming with Netlink Sockets,
In the AF_NETLINK domain the number of protocols is 2004, http://people.redhat.com/nhorman/papers/netlink.pdf
limited to 32 and thus the generic GeNetlink multiplexer has [9] Pablo Neira-Ayuso, Rafael M. Gasca and Laurent Lefevre, Commu-
a special role [9]. Through it 65520 families with multicast nicating between the kernel and user-space in Linux using Netlink sockets,
Softw. Pract. Exper., 2010, DOI: 10.1002/spe.981
groups are available to be used as special protocols for
kernel and user space communication, yet all using one sin-
gle bus. An implementation can define message attributes
as well as commands which have a callback function [5].
Libraries like libnl for user space applications exist.
9 C ONCLUSION
The different concepts for interaction between the Linux
kernel and user space have been briefly explained. System
calls are the core mechanism and all others involve system
calls. Through the evolution of Unix operating systems and
the pragmatic approach in the Linux project there are many
overlaps between them and historic luggage.
For new implementations it is wanted that they are not
based on legacy concepts, but the borders are not always
clear and the decision on what to use depends heavily on
the purpose.
Adding system calls does not need to many changes but
has disadvantages in the compile workflow and resulting
portability. In their basic principle system calls only assume
the user space to initiate communication and are not very
extensible.
All other approaches can mostly be implemented in ad-
ditional kernel modules and may thus provide a quicker de-
velopment workflow. The use of filesystems for information
exposure is a proven common practice. Particulary during
development there are many ways to ease debugging with
special filesystems.
For dynamic exchange the Netlink message bus is to
be recommended since consumers can attach to it and the
kernel is able to start communication. It is extensible and
also suited for transport of larger data amounts.