Michal Sekletar Systemd
Michal Sekletar Systemd
Michal Sekletár
[email protected]
2 / 53
Agenda
systemd recap
Cgroup v2 and resource management
Service sandboxing
3 / 53
PART I
4 / 53
What is systemd?
5 / 53
Components of systemd
6 / 53
Units
7 / 53
Unit types
service
target
socket
mount
automount
swap
device
path
timer
slice
scope
See man systemd.service, systemd.socket, . . . , for more information.
8 / 53
Unit files
9 / 53
Unit file – example
# /usr/lib/systemd/system/cups.service
[Unit]
Description=CUPS Scheduler
Documentation=man:cupsd(8)
After=network.target
[Service]
ExecStart=/usr/sbin/cupsd -l
Type=notify
[Install]
Also=cups.socket cups.path
WantedBy=printer.target
10 / 53
Unit files – Hierarchy of configuration
1
systemd-analyze unit-paths
11 / 53
Difference between unit and unit file
12 / 53
Dependency model in systemd
13 / 53
Relational dependencies
14 / 53
Ordering dependencies
15 / 53
Transactions
16 / 53
Transactions
17 / 53
Interesting options related to dependencies
18 / 53
Service management – Basics
2
You don’t actually need to type .service, because service is default unit type
19 / 53
Service management – Managing unit files
Enable service to start after a reboot,
systemctl enable httpd.service
Make service disabled, i.e. systemd won’t attempt to start it after
reboot,
systemctl disable httpd.service
Reset to default unit file state,
systemctl preset httpd.service
List all unit files,
systemctl list-unit-files
Determine current enablement state,
systemctl is-enabled httpd.service
Mask a unit file. Note that masked units can’t be started, even
when they are requested as dependencies,
systemctl mask httpd.service
Notice that operations acting on unit files create or remove symlinks in
the filesystem. To achieve the same end result you could create
symlinks on your own.
20 / 53
Service management – Unit file [Install] section
Let’s consider this example [Install] section,
[Install]
WantedBy=multi-user.target
Also=sysstat-collect.timer
Also=sysstat-summary.timer
Alias=monitoring.service
23 / 53
Service management – Service types
24 / 53
PART II
25 / 53
Resource management – Control groups
Process tracking
Resource distribution
26 / 53
Resource management – Control groups - terminology
27 / 53
Resource management – cgroup v1 and cgroup v2
Service – Normal service units. Each service has its own cgroup.
Scope – Similarly to services, scope’s processes are also part of the
cgroup. However, scope processes are not children of systemd
Slice – Services and scopes can be further partitioned into slices.
To get an overview of current cgroup hierarchy on your system, you
can run systemd-cgls command.
30 / 53
Resource management – Control groups hierarchy
Control group /:
-.slice
user.slice
user-0.slice
session-6.scope
27 login -- root
34 -bash
52 systemd-cgls
53 systemd-cgls
[email protected]
init.scope
28 /usr/lib/systemd/systemd --user
29 (sd-pam)
init.scope
1 /usr/lib/systemd/systemd
system.slice
dbus.service
23 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile
systemd-logind.service
22 /usr/lib/systemd/systemd-logind
systemd-resolved.service
21 /usr/lib/systemd/systemd-resolved
systemd-journald.service
15 /usr/lib/systemd/systemd-journald
31 / 53
Resource management – CPU
32 / 53
Resource management – Memory
Partitioning available memory with systemd and cgroup v2 memory
controller is rather complicated. Multiple options are available,
MemoryMin – Hard memory protection. If memory usage is
below the limit the cg memory won’t be reclaimed.
MemoryLow – Soft memory protection. If memory usage is below
the limit the cg memory can be reclaimed only if there is no
memory to be reclaimed from unprotected cgroups.
MemoryHigh – Memory throttle limit. If memory usage goes
above the limit the processes in the cgroup are throttled and put
under heavy reclaim pressure.
MemoryMax – Hard limit for memory usage. You can use K, M,
G, T suffixes (e.g. MemoryMax=1G).
MemorySwapMax – Hard limit on swap usage.
After you exhaust your memory limit then service is very likely to get
killed by OOM killer. To prevent that you need to adjust
OOMScoreAdjust value as well.
33 / 53
Resource management – Block I/O
Block I/O controller in cgroup v2 allows for quite fine grained tuning.
systemd provides following options for configuring this subsystem,
IOWeight – Set the default IO weight
IODeviceWeight – Set the IO weight for a specific block device
(e.g. IODeviceWeight=/dev/sda 200)
IOReadBandwidthMax, IOWriteBandwidthMax – Absolute
per device (or mount point) bandwidth. E.g.
IOWriteBandwith=/var/log 5M
IOReadIOPSMax, IOWriteIOPSMax – Same as the above,
except that bandwith is configured in IOPS
IOLatency – Define the per device I/O latency target (e.g.
IOLatency=/dev/sda 10ms)
34 / 53
Resource management – CPU and NUMA placement
35 / 53
Resource management – Task limits
Using the pid cgroup controller you can limit number of processes that
unit is allowed to spawn,
TasksMax – Set the maximum number of processes that unit can
create using fork() or clone().
36 / 53
Resource management – Dynamic reconfiguration
37 / 53
Resource management – Excercise: Database and low
priority batch job
38 / 53
Resource management – Solution
39 / 53
Resource management – Excercise: Critical workload
You have a mission critical workload running on the server and you
want to make sure that it runs undisturbed whenever possible. Our
goals are,
Workload is running isolated on a subset of CPUs
Workload can use all memory on NUMA nodes corresponding to
those CPUs
System services are allowed to consume only 1GB of system
memory until memory reclaim pressure is applied
40 / 53
Resource management – Solution
41 / 53
PART III
42 / 53
Sandboxing – Linux Namespaces
43 / 53
Sandboxing – Linux Namespaces
# ls -l /proc/self/ns
total 0
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 cgroup -> ’cgroup:[4026531835]’
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 ipc -> ’ipc:[4026531839]’
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 mnt -> ’mnt:[4026531840]’
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 net -> ’net:[4026531969]’
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 pid -> ’pid:[4026531836]’
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 user -> ’user:[4026531837]’
lrwxrwxrwx. 1 root root 0 Nov 6 09:09 uts -> ’uts:[4026531838]’
44 / 53
Sandboxing – Mount Namespace
45 / 53
Sandboxing – PID Namespace
46 / 53
Sandboxing – User Namespace
47 / 53
Sandboxing – Network Namespace
unshare -n /bin/bash
Virtualization of network related system resources,
Interfaces
IPv4 stack
IPv6 stack
Routing tables
Ports
veth pair to create tunnel between namespaces
48 / 53
Sandboxing – Other Kernel Namespaces
IPC
Isolation of SystemV IPC resources and POSIX message queues
unshare -i /bin/bash
UTS
Virtualization of hostname and NIS domain name
unshare -u /bin/bash
Cgroup
Virtualization of a cgroup tree view
unshare -C /bin/bash
49 / 53
Sandboxing
systemd provides a lot of options that help you further constrain and
secure services running on your system. In most cases the only thing
you need to do is to enable given feature in a unit file.
PrivateTmp – Service has its own /tmp and /var/tmp
ProtectHome – /home, /root and /run/user will appear empty
ProtectSystem – Directories /usr and /boot are mounted
read-only (if ”full” also /etc is ro, on ”strict” the entire filesystem
is read-only)
ReadOnlyDirectories – Service will have read-only access the
listed directories
InaccessibleDirectories – Listed directories will appear empty
and will have 0000 access mode
RootDirectory – Runs the service in chroot()-ed environment
PrivateDevices – Service gets its own /dev with only basic device
nodes, e.g /dev/null. CAP MKNOD capability is disabled.
50 / 53
Sandboxing
NoNewPrivileges – Ensures that service can never gain new
privileges
SystemCallFilter – You can whitelist or blacklist allowed system
call (note: systemd-analyze syscall-filter
[syscall-group])
PrivateNetwork – Completely isolate service from network access
(network namespace with only loopback)
JoinsNamespaceOf – Enables multiple units to share
PrivateTmp & PrivateNetwork
CapabilityBoundingSet – List of capabilities to be included in
the capability bouding set of the executed process
AmbientCapabilities – List of capabilities to be included in
ambient capability set
TemporaryFileSystem – List of mount points where to mount
tmpfs
51 / 53
Sandboxing
PrivateUsers – Run the service in its own user-namespace
mapping root user to itself and everybody else to the ”nobody”
ProtectKernelTunables – Protect directories containing kernel
runtime variables (e.g. /proc/sys, /sys)
ProtectKernelModules – Disable the ability to load and unload
the kernel modules
ProtectControlGroups – Mount /sys/fs/cgroup read-only
RestrictAddressFamilies – White-list address families (e.g.
AF UNIX) that unit is allowed to use
RestrictNamespaces – Limit access to namespace manipulation
system calls (e.g. unshare, setns)
MemoryDenyWriteExecute – Disable memory mapping that is
simultaneously writable & executable
PrivateMounts – Execute the service in its own mount
namespace and turn off mount propagation towards the host’s
mount namespace
52 / 53
Sandboxing
53 / 53