0% found this document useful (0 votes)
3K views38 pages

xv6 Containers, Namespaces and Cgroups

This document describes how containers are implemented in the Linux and xv6 operating systems. In Linux, containers rely on namespaces like PID, mount, network, and IPC namespaces to isolate processes, as well as control groups (cgroups) to limit and account for resource usage. xv6 implements a simpler version of containers using PID and mount namespaces along with a cgroup mechanism via the "pouch" command line utility, which allows users to create and manage isolated containers. The document then provides technical details on how namespaces and cgroups are implemented in both Linux and xv6.

Uploaded by

yyakovian083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views38 pages

xv6 Containers, Namespaces and Cgroups

This document describes how containers are implemented in the Linux and xv6 operating systems. In Linux, containers rely on namespaces like PID, mount, network, and IPC namespaces to isolate processes, as well as control groups (cgroups) to limit and account for resource usage. xv6 implements a simpler version of containers using PID and mount namespaces along with a cgroup mechanism via the "pouch" command line utility, which allows users to create and manage isolated containers. The document then provides technical details on how namespaces and cgroups are implemented in both Linux and xv6.

Uploaded by

yyakovian083
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

xv6 cgroup, namespaces and containers

Table of Contents

Introduction 3
How containers are implemented in Linux 4
How containers are implemented in xv6 8

Functional Specification 9
pouch - the command line utility for container management in xv6 9
Commands explained 10
pouch start 10
pouch connect 10
pouch disconnect 10
pouch destroy 11
pouch info 11
pouch list 11
pouch cgroup 12
pouch help 13
PID namespaces in xv6 13
Mount namespaces in xv6 14
cgroup mechanism in xv6 17
mount / umount cgroup pseudo-filesystem 17
cgroup core and subsystems 17
Core part 17
CPU Subgroup 20
Memory subgroup 20
Understanding cgroup hierarchy 22
Practical use examples 23
Starting from a clean file system image 23
Attaching a shell to newly created cgroup 23
Enable CPU controller 24
Restring processes in a cgroup to use 50% of the PCU 24

Technical (implementational) specification 24


General 24
pouch - the command line tool for container management in xv6 26
tty devices 26
How the pouch utility manages containers 27
pouch start 30
pouch connect 31
pouch disconnect 31
pouch destroy 32
pouch info 33
pouch list 33
pouch cgroup 33
namespaces in xv6 34
PID namespaces in xv6 34
Mount namespaces in xv6 36
Mountpoint related structures in xv6 37

Other learning resources 40

Contributors
Alon Yoav
Bolotin Alexander
Kordon Oleg
Meiran Niv
Saraf Gahl
Sariel David
Simkin Michael
Strauss Michael
Zabernin-Frenk Daniel
Zedaka Eyal
Introduction
Container technologies popularity increased through the past decade and this tendency is
expected to continue. The influence of containers is prominent in many ways spanning every
aspect of the programming product lifecycle from the design and architecture considerations,
followed by the programming language adoption and ending with the product deployment
and production environment maintenance.

Looking under the hood to understand how containers are implemented, being familiar with
the operating system features that container technologies rely on, is compelling. This
awareness is also significant in order to assess all the pros and cons at early stages of a
product design. Looking at the implementation of the operating system’s features necessary
for containers is much easier based on operating systems intended for educational use. One
of them, the xv6, is simple enough, yet contains the important concepts and organisation of
a Unix like operating system. As for 2021, Linux code contains 31M lines of code (~14% of
them is the kernel code) compared with only 18K in xv6, thus going from ‘simple to complex’
seems to be a desired stage.

This whitepaper describes the way containers were added to the xv6 operating system. The
design and user interface was formed based on the Linux operating system. Therefore, we
will start from a general overview of the features that container implementations depend on
in the Linux operating system and then continue with the description of the functional and
technical details related to the feature subset used for the lean containers implementation in
xv6.
How containers are implemented in Linux
A Linux container is a set of 1 or more processes isolated from the rest of the system. It
provides resource management through control groups and resource isolation via
namespaces.

Linux kernel has several types of namespaces and each of them may be considered to be a
kernel feature. Let’s examine them one by one.

PID namespace - provides processes with an independent set of process IDs (PIDs)
separated from other namespaces. Processes inside the child PID namespace are visible
from the parent PID namespace. The process with PID 8 is a direct descendant of the
process with PID 6. But inside the child PID namespace they are organized in a ‘parallel’
hierarchy. The process with the PID 8 is the init process inside that parallel universe and,
therefore, referred to as a process with PID 1. Processes with PIDs 1-3 in the child PID
namespace have no knowledge of other processes' existence while the parent PID
namespace processes retain the visibility on the processes with PIDs 8-10.

Pic 1 (from ‘Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces’)

Mount namespace - isolates and controls mount points. Global mountpoints view can be
altered by children mount namespaces. Ad depicted below, the first and second child
namespaces refer to the same virtual disk where their root filesystem is located, which is
different from the root filesystem visible from the global (initial) mount namespace. In
addition, children mount namespaces refer to different filesystems mounted on their
respective ‘/mnt’ mountpoints, therefore providing an individual view of the tree hierarchy for
them.
Pic 2 (from ‘Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces’)
Network namespace - isolates system networking resources. Global networking resources
view is altered by the child net namespaces and processes in those namespaces are
bestowed with a (presumingly) different set of network interfaces.

Pic 3 (from ‘Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces’)

Additional namespaces exhibited by Linux include:

UTS namespaces - to isolate host and domain names, meaning that different processes
may appear as running on different hosts and domains while running on the same system.
IPC namespaces - to isolate interprocess communications. E.g. processes in different IPC
namespaces will be able to use the same identifiers for a shared memory region and produce
two such distinct regions.

User namespaces - to isolate user and group ID spaces. This namespace is found useful
when one needs to have the root user with ID 0 inside the namespace while the actual user ID
for that user in the global namespace differs from 0.

Time namespace - to isolate machine time, allowing processes in different time namespaces
to see different system times.

All the namespaces mentioned so far provide different means of resource isolation, but
unless inclined to grant an unlimited amount of system resources to the processes that
utilize the namespace segregation, resource accounting and limitation is required.
Therefore, Linux kernel is featured with a cgroup mechanism providing limiting,
prioritization, accounting and control features with regard to a collection of processes:

Resource limiting - group of processes can be set to not exceed CPU, memory, disk I/O,
network limits.
Prioritization - some process groups may get a larger share of resources than others.
Accounting - measures a group's resource usage.
Control - facilitates freezing, checkpointing and restarting of groups of processes.

Pic 4 (Resources allocated to the group1-web and groups2-db cgroups and associated sets of
processes)
Control groups and resource isolation via namespaces empower isolation of processes and
facilitate creation of containers. Containers belong to the type of virtualization also known
as a ‘system level virtualization’. This type of virtualization is also called a C-type
virtualization (C stands for ‘Container’). While VMs (Virtual Machines) running on top of
hypervisors provide higher security level at the expense of heavier resource consumption
and (to some extent) slower performance, the system level virtualization is much more
lightweight (resource wise) while sacrificing several security aspects as a tradeoff.

Pic 5 (from ‘What is a Linux container’)

How containers are implemented in xv6


Like in Linux, xv6 container is an instance facilitating an isolation of the xv6 operating system
resources, their accounting, limitation and control. Since xv6 has limited amount of features
the isolation is based on PID and mount namespaces augmented with cgroup mechanism.
Userspace command line utility called ‘pouch’ lets xv6 users easily create and manage xv6
containers.

In the following section the pouch utility, cgroup, PID and mount namespaces are described
from users perspective while the chapter next to it dives into the implementation details.

Functional Specification

pouch - the command line utility for container management in xv6


The purpose of the ‘pouch’ command line utility is to create containers, to apply cgroup
limitations, to get into the running container and perform operations inside of it (this
operation is also called an “attachment”), to exit the attached container, to get information
about the running container and, finally, to destroy it.

For simplicity only up to 3 containers are allowed to be created by the ‘pouch’ utility and, for
the same reason, no nested containers allowed. The former limitation implied from the
number of tty devices created by xv6 during the boot (3 tty devices) and the simplifying
assumption that rigidly ties tty devices to the created containers. It will be a nice exercise to
break this rigid dependency and to allocate tty devices only upon the attachment to a
container rather than alotting them on a container creation. The latter limitation is implied
from the fact that the implementation of the PID namespace, the xv6 container isolation is
based on, has no support for nesting. ‘Pouch’ utility users are able to create and destroy
containers only in the ‘detached’ mode. Only a limited set of ‘pouch’ utility commands are
available in the ‘attached’ mode. All commands run from the shell while being in the attached
mode create processes isolated from other containers.

The table below summarizes the supported commands according to the mode:

Attached mode: ‘pouch’ commands Detached mode: ‘pouch’ commands available


available inside a container outside a container

pouch start {name}

pouch disconnect pouch connect {name}

pouch info pouch info {name}

pouch list all

pouch destroy {name}

pouch cgroup {name} {state-object} [value]

pouch --help pouch --help

Tab 1. ‘pouch’ commands according to the mode

Commands explained

pouch start
Name:
pouch start - creates and starts a container.

Synopsis:
pouch start { name }
{ name } - a container identification string

Description:
Pouch container unshares pid and mount namespaces. By default no cgroup limitations are
applied at this stage. Limitations have to be explicitly specified in a separate command.
Nesting containers are not supported.

Output messages:
● “Pouch: {name} starting” - successfully started a container.
● “{name} is already started” - trying to create a container but a container with the
same identification string already exists.
pouch connect
Name:
pouch connect - attach user terminal to a running container using the container’s
identification sting

Synopsis:
pouch connect { name }
{ name } - a container identification string

Description:
User terminal is connected to the tty device that is allocated to the container. The connection
happens transparently to the user. The user gets a command line interface (shell) and is
capable of launching processes in an isolated container’s environment. When connected,
only the subset of ‘pouch’ utility commands is available (see Tab 1).

Output messages:
● “Pouch: { name } connecting. tty{n} connected” - The container was successfully
connected to the preallocated tty.
● “There is no container {name} in the starting stage. Pouch: operation failed.” - Trying
to connect to the container but no container with {name} identifier exists.

pouch disconnect
Name:
pouch disconnect - detattach user terminal from a running container

Synopsis:
pouch disconnect { name }
{ name } - a container identification string

Description:
A user will be disconnected from a running container back to the console.

Output messages:
● “Pouch: {name} disconnecting. Console connected” - The container was successfully
disconnected, a console is now connected to the user terminal.

pouch destroy
Name:
pouch destroy - stops and destroys a running container identified by the identifier string

Synopsis:
pouch destroy { name }
{ name } - a container identification string
Description:
Stops and removes a running container from the system. Detaches tty, removes a group that
corresponds for the {name} container from the cgroup filesystem. The command is available
only in detached mode.

Output messages:
● “There is no container {name} in a started stage. Pouch: operation failed.” - Trying to
destroy a non existing container.
● “Pouch: {name} destroying. Exiting container. Zombie!” - The container was
successfully removed from the system. Xv6 prints “Zombie!” to console when a
process got killed.

pouch info
Name:
pouch info - gets information about a container and it’s state

Synopsis:
pouch info { name }
{ name } - a container identification string

Description:
Pouch info gets information about a container and it’s state

Output messages:
“There is no container {name} in a started stage” - Trying to get information on a non existing
container without.

pouch list
Name:
pouch list - get a status information about all running containers

Synopsis:
pouch list all

Description:
Gets a status information about all running containers. This command is available only from
in the detached mode.

Output messages:
● “Pouch containers: None.” - There are no running containers.
pouch cgroup
Name:
pouch cgroup - limit, account or control resources associated with a container that is
specified with an identification string

Synopsis:
pouch cgroup { name } { state-object } [ value ]
{ name } - a container identification string
{ state-object } - specified the state object name.
[value] - specify the value to assign to the state object. Note: xv6 shell doesn’t treat a
string with spaces enclosed by quotes as a single argument. Thus, multiple values
have to be separated using commas (see examples below).

Description:
Sets the value of a state-object (e.g. ‘cpu.max’) in the container’s cgroup for the
corresponding subsystem (e.g. ‘cpu’). Cpu controller is the only one cgroup controller
verified at this stage. Refer to the chapter on cgroup for more information.

Output messages:
● “Incorrect cgroup object-state provided. Not applied.” - if the cgroup object-state
doesn’t exist. Refer to the chapter on cgroup and select one that is implemented.
● “There is no container: {name} in a started stage” - Trying to apply limitations on a
non existing container.

Examples:
“pouch cgroup c1 cpu.max 10000” - updates cpu.max property to 10000, leaving the
period default.
“pouch cgroup c1 cpu.max 10000,20000” - updates cpu.max property to 10000 and
sets period to 20000.

pouch help
“pouch --help” - displays all available pouch commands according to the mode
(attached/detached).
PID namespaces in xv6
PID namespaces facilitate creation of an independent set of process IDs (PIDs) separated
from other namespaces in such a manner that processes inside the child PID namespace
are visible from the parent PID namespace but not vice versa.

In order to put a newly created process in a separate pid namespace the system call
unshare must be called prior to the fork. A parameter passed to the unshare system call
that indicates a PID namespace segregation is going to happen is a PID_NS. I.e. the
unshare system call puts a calling process in the state of a namespace separation that will
happen for the child process upon the actual call to the fork. The child forked after the call to
unshare(PID_NS) function gets PID=1 in a newly created PID namespace and all it’s
descendants will belong to that namespace. PID namespaces in xv6 do not support nesting.

If a process in a PID namespace is voluntarily or involuntarily terminated while having live


descendants, they will be reparented to the process with PID=1. If the process with PID=1
dies, every other process in the pid namespace will be forcibly terminated and the
namespace will be cleaned up.

Creating a new PID namespace is fairly easy as it can be observed from the practical use
example below:

if(unshare(PID_NS) != 0){
printf(stderr, "Cannot create pid namespace\n");
exit(1);
}

pid = fork();
if(pid == -1){
printf(stderr, "FAILURE: fork\n");
exit(1);
}

if(pid == 0)
printf(stdout, "New namespace. PID=%d \n", getpid());
else
printf(stdout, "Parent’s perspective on the child. PID=%d \n", pid);

Tab 2. Forking a process in a new PID namespace

Compiling and running the code snippet from above will demonstrate that a child process
runs is a “separate” hierarchy with PID=1 while from the parent process’s perspective child’s
PID=4 (also see Pic. 1)
Mount namespaces in xv6
Mount namespaces facilitate an isolation of mount points. In order to achieve a mountpoint
segregation the unshare system call with MOUNT_NS parameter must be called by the
process that is inclined to have a separate (hidden from other processes) view on mount
points. Descendants of that process are going to inherit that separated view preserving a
mountpoint segregation.

Lets’ see how new mount namespace is created from the practical use example below:

void createNwrite(char *path, char *str, int strlen){

int fd;
if ((fd = open(path,O_CREATE|O_RDWR)) <= 0){
printf(stderr, "open failed\n");
exit(1);
}

if (write(fd, str, strlen) != 10){


printf(stderr, "write failed\n");
exit(1);
}

close (fd);
}

// ******************************************************************************
// create a child process with a separate mount namespace
// create a mount point and mount on it a preformatted internal_fs_a
// create a file on the mounted file system

int pid = fork();


if (pid < 0){
printf(stderr, fork failed\n");
exit(1);
}

if (pid == 0){
if(unshare(MOUNT_NS) != 0){
printf(stderr, "Cannot create mount namespace\n");
exit(1);
}

if (mkdir(“dirA”) != 0){
printf(stderr, mkdir failed\n");
exit(1);
}

if (mount(“internal_fs_a”,”dirA”,0) != 0){
printf(stderr, “mount failed\n");
exit(1);
}

createNwrite(“dirA/file.txt”, “123456789\n”,10);
}

// ******************************************************************************

// ******************************************************************************
if (pid > 0){
// make sure child process runs first to create a new ns
sleep(10000);

// create a mount point and mount on it a preformatted internal_fs_b


if (mkdir(“dirB”) != 0){
printf(stderr, mkdir failed\n");
exit(1);
}

if (mount(“internal_fs_b”,”dirB”,0) != 0){
printf(stderr, “mount failed\n");
exit(1);
}

createNwrite(“dirB/file.txt”, “987654321\n”,10);
}

// ******************************************************************************
// both processes will sleep for a while to enable each
// other to reach this point
sleep(10000);

// ******************************************************************************
// at this point it is guaranteed that the child process is able to access
// dirA/file.txt while the parent process is able to access dirB/file.txt but
// not vice versa. We just need to check it.
if(pid == 0){
if (open(“dirA/file.txt”,O_RDONLY) < 0){
printf(stderr, "open was about to succeed but failed\n");
exit(1);
}
if (open(“dirB/file.txt”,O_RDONLY) >= 0){
printf(stderr, "open was about to fail but succeeded\n");
exit(1);
}
}
else{
if (open(“dirB/file.txt”,O_RDONLY) < 0){
printf(stderr, "open was about to succeed but failed\n");
exit(1);
}
if (open(“dirA/file.txt”,O_RDONLY) >= 0){
printf(stderr, "open was about to fail but succeeded\n");
exit(1);
}
}

exit(0);

Tab 3. Creating a new mount namespace

Compiling and running the code snippet from above will create a mount namespace for a
child process by unshared it from the global mount namespace. Upon the namespace
creation, the child process mounts the internal_fs_a, a device with a preformatted file system
on it, on the dirA mountpoint. The parent process mounts the internal_fs_b device on dirB
respectively.

At this stage the root directory contains dirA and dirB subfolders. But only the child process
is able to see the file.txt that was created on the internal_fs_a filesystem, while only the
parent process is able to access the file.txt that was created on the internal_fs_b.

After the call to unshare we would have:

Global mount namespace New mount namespace

Mounted device Mount point Mounted device Mount point

internal_fs_b dirB internal_fs_a dirA

The root file system contains an empty dirA The root file system contains an empty dirB
Tab 4. Mount namespaces provide different “views”
cgroup mechanism in xv6
cgroup mechanism in xv6 is a leaner version of it’s Linux counterpart that allows processes
to be organized into hierarchical groups. Resource usage can be limited and monitored for
each group of processes in the hierarchy. These groups are sometimes also called cgroups.
Each group has several resource controllers also simply called controllers or sometimes
referred as resource control subsystems which in turn are simply called subsystems.
Controllers provide means to limit and account for different system resources and that's why
they are also dubbed “subsystems”.

mount / umount cgroup pseudo-filesystem

The cgroup interface is provided through a pseudo-filesystem that in case of xv6 has to be
mounted prior to it’s usage on a pre-created mount point.

$mkdir cgroup
$mount /cgroup -t cgroup

Prior to dismounting the cgroup file system it can be considered a best practice to change
the cwd to somewhere outside. E.g:

$cd /
$umount /cgroups

cgroup core and subsystems

Files prefixed with “cgroup.” (e.g cgroup.stat) belong to the core part of the cgroup
mechanism responsible for hierarchy organization. Files related to different subsystems start
with a controller name: “cpu.”, “memory.” etc. The following table summarizes control and
configuration options that xv6 core and subsystems supply.

Core part

cgroup.procs A read-write file that exists in every cgroup.


When cgroup.procs is read, the list of all the
PIDs for the processes which belong to the
cgroup that contains it is listed (see Pic 3).

cgroup.controllers A read only file that exists in non-root Inside


cgroups. It “contains” a space separated not non-root
ordered list of all controllers available to the cgroups
cgroup.
cgroup.subtree_control A read-write file that exists in non-root Inside
cgroups. The file is empty when cgroup fs is non-root
mounted. A controller name prefixed with '+' cgroups
or '-' can be written to the file in order to
enable/disable controllers. When read, it
shows a space separated list of the
controllers which are enabled to the cgroup
and all it’s descendants.

cgroup.events A read-only file that exists in non-root Inside


cgroups. A file contents are key-value pairs non-root
(delimited by newline characters, with the key cgroups
and value separated by spaces) providing
state information about the cgroup:

populated - The value of this key is either 1, if


this cgroup or any of its descendants has
member processes, or otherwise 0.

frozen - The value of this key is 1 if this


cgroup is currently frozen, or 0 if it is not.

cgroup.freeze A read-write file that exists in non-root Inside


cgroups. Allowed values are “0” and “1” (“0” is non-root
the default). cgroups

Writing “1” to the file causes freezing of the


cgroup and all its descendants.

From here
This means that all belonging processes will
be stopped and will not run until the cgroup
will be explicitly unfrozen. After freezing, the
“frozen” value in the cgroup.events control file
will be updated to “1”.

A cgroup can be frozen either by its own


settings, or by the settings of any ancestor
cgroups. If any ancestor cgroup is frozen, the
cgroup will remain frozen.

cgroup.max.descendants A read-write file that exists in every cgroup.


Contains a number indicating the maximum
allowed amount of descendent cgroups. An
attempt to create a new descendant cgroup in
the hierarchy will fail if the hierarchy depth is
going to exceed the maximum allowed
amount of a descendent cgroups .This
feature is not implemented (yet).

cgroup.max.depth A read-write file that exists in every cgroup.


Contains a number indicating the maximum
allowed amount of descendent cgroups under
the current cgroup. An attempt to create a
new descendant cgroup in the hierarchy will
fail if the hierarchy depth is going to exceed it.
This feature is not fully functional (yet).

cgroup.stat A read-only file that exists in every cgroup


and “contains” nr_descendants - the total
number of descendant cgroups and
nr_dying_descendants - the total number of
dying descendant groups. A cgroup becomes
dying after being deleted by a user. The
cgroup will remain in a dying state for some
undefined time (which can depend on system
load) before being completely destroyed.

CPU Subgroup

cpu.stat A read-only file that exists in non-root Inside


cgroups. This file exists whether the controller non-root
is enabled or not. cgroups

It always reports the following three stats:


- usage_usec
- user_usec
- system_usec

and the following three when the controller is


enabled:
- nr_periods
- nr_throttled
- throttled_usec

cpu.weight A read-write file that exists in non-root Inside


cgroups. The feature is not functional (yet). non-root
cgroups
A parent's resource is distributed by adding
up the weights of all active children and
giving each the fraction matching the ratio of
its weight against the sum. All weights are in
the range [1, 10000] with the default at 100.

cpu.max A read-write file that exists in non-root Inside


cgroups. Contains 2 key-value entries: (max, non-root
max_value) and (period, period_value). The cgroups
value of “max” indicates the maximal cpu time
in microseconds that a group can consume
during the time divided in periods indicated by
the “period_value” (also measured in
microseconds).

Possible values that can be written to the file:


- max,100000 - no limit over the
100000 ms periods
- max - no limit over the previously
specified period
- 10000,20000 - limit the group to 50%
of the CPU bandwidth

Memory subgroup

memory.current A read-only file that exists in non-root


cgroups. It contains the total amount of
memory currently being used by the cgroup
and its descendants.

memory.max A read-write file that exists in non-root


cgroups The default is the maximum amount
of memory a userland process can have.
Used to set a memory usage hard limit. If a
cgroup’s memory usage reaches this limit
processes that belong to the group will be
denied from any additional memory
allocation.
Tab 4. cgroup mechanism controls and subsystems

Understanding cgroup hierarchy

cgroups form a tree structure and every process in the system belongs to one and
only one cgroup. Upon a creation new processes tied to the same cgroup that the
parent process belongs to. A process migration is allowed (to another cgroup) and it
doesn't affect the process’ ancestors/descendants cgroup attachment. Subsystem
controllers may be enabled or disabled for a cgroup. When a subsystem is
enabled/disabled, it affects all processes that belong to the cgroups. Nested cgroups
were added to xv6. Nested cgroup creation is performed by the mkdir command.
Upon a nested cgroup creation files that are responsible for cgroup management
(the core part) are created along with the subsystem controller files that were
enabled in the parent cgroup. Changes in a parent cgroup affect (or may have an
influence) on it’s descendant (nested) cgroups. Note that at this stage memory
controller functionality is limited and no min/max recalculations performed across
nested cgroups hierarchy .

Pic 6. cgroup hierarchy


Practical use examples

The easiest way to understand cgroups is through some practical use examples as
described below.

Starting from a clean file system image

It is highly recommended to start from a clean file system since cgroup implementation has
no full test coverage.

From the host CLI:

# delete old fs.img


$make clean
#build clean fs.img
$make qemu

From the quest CLI:

$mkdir cgroup
$mount /cgroup -t cgroup
$cd /cgroup
$ls

Attaching a shell to newly created cgroup

#create an additional shell process


$sh
$cd cgroup
#observe processes attached to the root group: init (PID=1), sh (PID=2), new sh (PID=7)
#Note: PID=10 is the ‘cat’ that is still attached to the cgroup as a sh(PID=2) child.
$cat cgroup.procs

#create a nested group under the root


$mkdir group1
$ls group1
#observe that there is no process attached to the group1 (empty cgroup.procs)
$cd group1
$cat cgroup.procs
#attach new sh process (PID=7) to the group1
$ctrl_grp 7 cgroup.procs
#observe that the new shell was attached
$cat cgroup.procs

#observe that the new shell (PID=7) was detached from the root cgroup (/cgroup)
$cd ..
$cat cgroup.procs

Enable CPU controller

#As a follow up on the previous example, a cpu controller will be enabled for the group1.
$cd /cgroup/group1
#observe that cgroup.subtree_control is empty
$cat cgroup.subtree_control
$ctrl_grp +cpu cgroup.subtree_control
#observe that more cpu subsystem cortrol files were added to the group1
$ls

#observe that cgroup.subtree_control indicates that cpu controller is enabled


$cat cgroup.subtree_control
Restring processes in a cgroup to use 50% of the PCU

#observe what are the default values in cpu.max


$cat cpu.max

#all processes attached to the group1 to use CPU for 10000 out of every 20000 μs
$ctrl_grp 10000,20000 cpu.max
#make sure the setting was commanded
$cat cpu.max
Technical (implementational) specification

General
This chapter is devoted to the implementation details and is mainly related to the
amendments required for C-type virtualization support in xv6 with an emphasis on the
reasoning behind the modification made.

pouch - the command line tool for container management in xv6

tty devices
Upon xv6 boot 3 tty devices are created with mknod syscall. All devices are controlled by the
same device driver and have a common major device number. The major number is the
offset into the kernel’s device driver table, which tells the kernel what kind of device driver to
use. The minor number tells the kernel special characteristics of the device to be accessed.

As it was mentioned already 3 tty devices were added. Tty devices are initialized in main.c
by ttyinit().

Pic. 7. NTTY=3 tty devices were added with same functionally as a console device

ttywrite() and ttyread() functions are wrappers of the existing consoleread()/consolewrite()


functions. They implement actual writing and reading to the user terminal for the tty that is
active at the moment. ‘flags’ property added to ‘devsw’ to support different operations on tty
devices.

Operations on tty devices were defined in fcntl.h and the corresponding functions are
implemented in tty.c.
Pic. 8 tty operations

The ‘ioctl’ syscall was added to control tty devices allowing to connect / disconnect / attach /
detach tty devices as long as to set/get their properties.

How the pouch utility manages containers


Containers in xv6 are identified by name (at least on this stage). The identifier name is used
for a container specification by (almost) all the commands that pouch utility supply.

Every container started with the pouch utility has a file /name where the name corresponds
to the container identification string as it was specified at the container creation stage. The
file holds which tty the container is attached to, and what is the PID of the process that
forked the shell running inside the container. Additionally, tty.c{0|1|2} files specify a container
identification string of the container that is tied to the corresponding tty device.

E.g. listing the files of the root (/) directory on xv6 right after the ‘pouch start
myFirstContainer’ command is successfully completed reveals that a file named
myFirstContainer was created. The content of the myFirstContainer affirms that the
container is tied to the tty0 and the parent process ID that forked the shell inside the
container is 5 (see Pic. 9). And tty.c0 contains the myFirstContainer identification sting.
Pic. 9 tty.c0 and /myFirstContainer files used by the pouch utility

Processes running inside xv6 containers are organized by the pouch utility in a flat cgroup
hierarchy. Pouch mounts cgroup fs on the /cgroup mountpoint, creates a directory
/cgroup/name for a container identified by the name identification string and takes an
advantage of the cgroups mechanism control means to allocate resources for the processes
running inside the container. The directory is removed from the croup hierarchy when the
container is destroyed. Pouch utility is limiting the container hierarchy to be flat i.e nesting is
not supported. The layout of /cgroup is depicted on below in continuation of the previous
example (myFirstcontainer):

Pic. 10 myFirstContainer related files under cgroup fs


pouch start
pouch start command translates to the pouch_fork function in pouch.c. The table below
describes the process of a container creation:

1. Find an available tty device. If find_tty(tty_name) fails no more tty devices are available
and no more containers can be created.

2. Check if the container with the same name is already running. Try to
open(container_name). If the operation succeeded no container can be created (either a
running container exists or a system cleanup is required).

3. Create a directory under /cgroup for the container using the create_pouch_cgroup
function.

4. Update the tty.cN using write_to_pconf function.

5. fork a child process that invokes a call to unshare(PID_NS) followed by another fork
that will create an sh process inside the container. Attach a tty and replace the executable
by calling attach_tty(tty_fd) and exec("sh", argv).

6. Write the PID of the sh process to cgroup.procs in the corresponding cgroup to attach
the process to the group.

7. Finally, update the /name file where name is the container identification string via the
call to the write_to_cconf function.
Tab. 5 pouch_fork described

pouch connect
pouch connect command simply translates to the call of the connect_tty function that
eventually causes the appropriate devsw device to be connected (and others be
disconnected).

Pic. 11 mark devsw for the corresponding tty as connected


pouch disconnect
pouch disconnect command translates to the call of the disconnect_tty function that causes
the devsw[CONSOLE] to be connected back and disconnects the devsw for the tty that was
associated with a container.

Pic. 12 mark devsw for the CONSOLE as connected

pouch destroy
pouch destroy command translates to several operations described below: describes the
process of a container creation:

1. Attaching the process with the PID number that was obtained by the
read_from_cconf(container_name, ...) call to the root cgroup
(/cgroup/cgroup.procs) and terminating the process (kill(pid))
2. Removing the /container_name file (from the root directory)
3. Removing the directory corresponding to the container cgroup from the cgroup fs
by calling to unlink(cg_cname).
4. Removing the container name from the corresponding /tty.cN file (the call to the
remove_from_pconf(tty_name) function).
5. Finally, detaching the corresponding tty (the call to the detach_tty(tty_fd) function).
if(remove_from_pconf(tty_name) < 0)
return -1;
Tab. 6 pouch destroy described

pouch info
pouch info command can be invoked both in attached and detached stages. The command
call translates to the print_cinfo function in pouch.c. The information is obtained according to
the following description, where the name stands for a container identification string.
tty and pid From /name

cpu.max From the /cgroup/name/cpu.max

cpu.stat From the /cgroup/name/cpu.stat

connected/disconnected Based on the parent process ID


Tab. 7 Source of information for ‘pouch info’ command

Pouch info output for cccc container is illustrated below.


Pic 13. Pouch info - an example

pouch list
pouch list command reads tty.c{0|1|2} files in order to obtain the information about the
running containers. Based on the assumption that a container identification string for running
containers is held by tty.cN file. Although running containers belong to separate mount
namespaces, they still share the root from the global mount namespace.

pouch cgroup
pouch cgroup command call translates to the pouch_limit_cgroup function in pouch.c. The
table below describes how it works:

1. Open /name where name stands for a container identification string. If open fails,
print error message indicating that there is no running container identified as name.
2. Open /cgroup/name (name is a container identification string)
3. Write the limitation to the appropriate file under /cgroup/name. The limitation and
the filename passed from as a pouch cgroup command parameters.

Tab. 7 how pouch cgroup works

For example, the command pouch cgroup myFirstContainer cpu.max 10000,20000


translates to a write operation that puts “10000,20000” string into
/cgroup/myFirstContainer/cpu.max. And the command pouch cgroup myFirstContainer
cpu.weight 50 translates into a write operation that puts “50” string into
/cgroup/myFirstContainer/cpu.weight.
Note, that according to the functional specification of the cgroup mechanism cpu.max and
cpu.weight files appears only if cpu controller is enabled. The function create_pouch_cgroup
enables the cpu controller for every container launched by the pouch utility.

namespaces in xv6
Namespaces implementation in xv6 resembles the way they are implemented in Linux. The
xv6 counterpart of Linux’s task_struct the proc holds a pointer to the namespace proxy
object struct nsproxy *nsproxy containing references to the namespaces that the respective
process belongs to:

Pic. 14 struct proc (proc.h) and struct nsproxy (namespace.h)

xv6 has an upper limit of NNAMESPACE namespaces that can be created in the system.
The global namespacetable (defined in namespace.c) holds the information of all the xv6
namespaces. Access to the namespacetable is secured by a spinlock. struct nsproxy
contains a reference counter and points to struct mount_ns and struct pis_ns.
Pic. 15 namespaces data structures in xv6

PID namespaces in xv6


New PID namespace is created by the unshare(PID_NS) system call. The calling process
has a new PID namespace for its children which is not shared with any previously existing
process. The calling process is not moved into the new namespace. The first child created
by the calling process will have the process ID 1 and will assume the role of init process in
the new namespace. The pseudocode of unshare function is:

● Reserve a row for a new namespace in the global namespacetable using


allocnsproxyinternal function. If the number of namespaces exceeds
NNAMESPACE the call results in ENOMEM1.
● Increase the reference count of the mount_ns and pid_ns structs. Note that
myproc()->nsproxy points to the same namespaces (just increased the count).
● Reserve a new pid namespace (pid_ns_new function) and update the
myproc()->child_pid_ns field to ensure that all calling process children will execute
in a newly created namespace.

Tab. 8 how unshare works for PID ns

1
Currently if the number of namespaces exceeds NNAMESPACE the call results in kernel panic. The
problem is reported in https://trello.com/c/4TN0ovsq/80-maman-12-system-call-error-conidtions.
The pid_ns_new function reserves a row in a pidnstable (pid_ns.c). Actually all pid_ns
structs are preallocated and as Pic 15 depicts, nsproxy[i] simply holds a pointer to the
specific row in a global pidnstable.

To complete the picture changes required in fork, kill and wait functions that become pid
namespace aware need to be mentioned. kill and wait will only operate using the pid that is
visible in the namespace. fork, will create a new process as a PID namespace leader (init
role) if myproc()->child_pid_ns is set by the unshare system call prior to fork.

fork
Changes required in fork are related to the implementation of process ID mapping. xv6 PID
namespaces implement the support of up to 4 nested namespaces. struct pid_entry pids[4]
field in a per-process state describes the mapping. Let’s reveal how nesting is implemented
based on the following example:

Pic 16. Another level of a nested namesapce (reiterated based on Pic1)

As one can observe, process IDs in the third namespace start from PID=1. However, for the
namespace at the second level it is known as a PID=4, while in the parent PID namespace it
holds PID= 11. Pic. 16 describes how the array pids[4] of struct pid_entry is holding the
numbers.

Pic 16. How struct pid_entry pids is used to implement nested PID namspaces system call

To make fork PID namespaces aware the following changes were introduced (red color is
used to indicate completely new code, orange color is used for partially overlapped lines):

xv6-public fork - pseudocode PID namespace aware fork - pseudocode

● Set struct proc current to point the ● Fail if curproc->child_pid_ns &&


current process curproc->child_pid_ns->pid1_ns_kill
● Allocate process with allocproc for a ed
child ● Check if cgroup limit was reached
● Copy process state from proc. Given when pid controller is enabled
a parent process's page table, ● Check if cgroup reached its memory
create a copy of it for a child. limit when memory controller is
Update a parent process, state and enabled
stack for a child process. ● Set struct proc current to point the
● Clear %eax so that fork returns 0 in current process
the child. ● Allocate process with allocproc for a
● For every open file in the parent child
process Increment ref count using ● Copy process state from proc. Given
filedup a parent process's page table,
● Update cwd inode for a new process create a copy of it for a child.
using idup Update a parent process, state and
● Copy parent’s name[16] to the stack for a child process.
child’s proc struct proc using ● Clear %eax so that fork returns 0 in
safestrcpy (name is used for the child.
debugging purposes only) ● For every open file in the parent
● Set pid to be np->pid process Increment ref count using
● Update np->state to be RUNNABLE filedup
(ptable.lock has to be acquired) ● Update cwd inode for a new process
● Return pid using idup
● Copy cwdp from the parent using
safestrcpy
● Increase reference to the
curproc->cwdmount using mntdup
● Update np->nsproxy
● For each one of
MAX_PID_NS_DEPTH pid_ns
update the corresponding pid (see
pic 16)
● Copy parent’s name[16] to the
child’s proc struct proc using
safestrcpy (name is used for
debugging purposes only)
● Set pid according to the namespace
using get_pid_for_ns
● Update np->state to be RUNNABLE
(ptable.lock has to be acquired)
● Return pid
Mount namespaces in xv6
Mount namespaces facilitate an isolation of mount points. New mount namespace is created
by the unshare(MOUNT_NS) system call.The calling process has a new mount namespace
for its children which is not shared with any previously existing process. To handle mount
namespaces the pseudocode of the unshare function described in Tab 8. is amended with
the following step:

● Create a new mount namespace (copymount_ns function) and update the


myproc()->nsproxy->mount_ns field to ensure the calling process has a
segregated view on mountpoints.

Tab. 9 mount ns amendment to the unshare system call

The copymount_ns function reserves a row in a mountnstable (mount_ns.c). Actually all


mount_ns structs are preallocated and as Pic 15 depicts, nsproxy[i] simply holds a pointer to
the specific row in a global mountnstable.

Mountpoint related structures in xv6


Mount namesapces in xv6 are preallocated and accessible via the global mountnstable:
#define NNAMESPACE 20 // maximum number of namespaces

struct {
struct spinlock lock;
struct mount_ns mount_ns[NNAMESPACE];
} mountnstable;

The following diagram depicts what happens in the xv6 kernel when the command line
mount utility call is issued on a preformatted file system.
Pic 17. Mount system call trace2

fsinit method results in reading a superblock from a device to a preallocated slot of


superblocks.

Pic 18. xv6 supports up to NIDEDEVS IDE devices and up to NLOOPDEVS loopback
devices (preformatted internal_fs_a/b/c file can mounted as a block device)

2
Probably getorallocatedevice better describes what getorcreatedevice method aims to do since it
operates on a preallocated dev_holder strut that holds superblocks for xv6 devices.
Sources and other learning resources
1. cgroups(7) — Linux manual page
https://man7.org/linux/man-pages/man7/cgroups.7.html

2. cgroupv2: Linux's new unified control group hierarchy (QCON London 2017)
https://www.youtube.com/watch?v=ikZ8_mRotT4

3. Containerization Mechanisms: Cgroups


https://blog.selectel.com/containerization-mechanisms-cgroups/

4. Containerization Mechanisms: Namespaces


https://blog.selectel.com/containerization-mechanisms-namespaces/

5. Control Group v2
https://www.kernel.org/doc/Documentation/cgroup-v2.txt

6. RELATIONSHIPS BETWEEN SUBSYSTEMS, HIERARCHIES, CONTROL GROUPS


AND TASKS
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/reso
urce_management_guide/sec-relationships_between_subsystems_hierarchies_contr
ol_groups_and_tasks
7. https://stackoverflow.com/questions/49971604/how-does-xv6-write-to-the-terminal

You might also like