2016-12-14

Introspecting namespace relationships

One of the interesting new features added in the just-released Linux 4.9 kernel is the ability to introspect namespace relationships. Two kinds of relationship can be discovered: the parent-child relationships for hierarchical namespace types (i.e. PID namespaces and user namespaces), and the ownership relationship between a non-user namespace and its associated user namespace.

There are various uses for this sort of introspection. One is to answer the question: what capabilities does process X have in namespace Y? The rules that determine the answer to that question have been documented in the user_namespaces(7) manual page for quite a while, but until now, there was no way of empirically answering that question with respect to a particular process and a particular namespace on a running system. This changes in Linux 4.9, thanks to work that Andrei Vagin did after I asked about this possibility on the Linux kernel mailing list back in July.

The solution, suggested by Eric Biederman, is rather elegant (even if implemented as ioctl() operations), and is based on returning file descriptors referring to objects in the (unmounted) namespace filesystem (NSFS). Given a file descriptor, fd, that refers to one of the /proc/PID/ns/xxxx symbolic links, two operations can be performed:

  • ioctl(fd, NS_GET_USERNS): Returns a file descriptor that refers to the owning user namespace for the namespace referred to by fd.
  • ioctl(fd, NS_GET_PARENT): Returns a file descriptor that refers to the parent namespace for the namespace referred to by by fd. This operation can be applied only to hierarchical namespaces (PID namespaces and user namespaces). This operation may fail if the parent namespace is outside the namespace scope of the caller. This might be the case if, for example, the parent of a PID namespace is an ancestor namespace of the caller's PID namespace. In addition, this error can occur when trying to find the parent of the initial PID or username space. When working our way backward through the chain of ancestors of a namespace, this fact can be used to determine whether we have reached the initial namespace.
By applying fstat() to a file descriptor returned by either of these operations, one can discover the device ID and inode number of the NSFS object referred to by the descriptor. By comparing these two values with the values for another namespace file descriptor, we can determine whether the two file descriptors refer to the same namespace.

Another possible use of this feature is to introspect across all processes on the system to discover the PID and user namespace hierarchies on a live system. (And also to discover the relationship of non-user namespaces to their owning user namespaces.)

The following Go program provides an example of such introspection. It inspects the /proc/PID/ns/user files for all processes on the system and builds up a map of the user namespace hierarchy along with the processes that reside in each namespace.

The program is fairly well commented, so without further explanation, I'll just present the code. (I should add that this is my first attempt at using Go (a nice language!), so the code may not be idiomatic, and may also have some errors, but it should serve to illustrate what's going on.) An example run is shown below. The program code can be found in the code tarball available for down on my website.

 /* userns_overview.go  
   
   Display a hierarchical view of the user namespaces on the  
   system along with the member processes for each namespace.  
   This requires features new in Linux 4.9. See the  
   namespaces(7) man page.  
   (http://man7.org/linux/man-pages/man7/namespaces.7.html)  
 */  
   
 package main  
   
 import (  
     "fmt"  
     "io/ioutil"  
     "os"  
     "sort"  
     "strconv"  
     "strings"  
     "syscall"  
     "unsafe"  
 )  
   
 // A namespace is identified by device ID and inode number  
   
 type NamespaceID struct {  
     device  uint64 // dev_t  
     inode_num uint64 // ino_t  
 }  
   
 // A namespace has associated attributes: a set of  
 // child namespaces and a set of member processes  
   
 type NamespaceAttribs struct {  
     children []NamespaceID // Child namespaces  
     pids   []int     // Member processes  
 }  
   
 // The following map records all of the namespaces that  
 // we find on the system  
   
 var NSList = make(map[NamespaceID]*NamespaceAttribs)  
   
 // Along the way, we'll discover the ancestor of all user  
 // namespaces (the root of the user namespace hierarchy).  
   
 var initialNS NamespaceID  
   
 // AddNamespace adds a PID to the list of PIDs associated with  
 // the user namespace referred to by 'namespaceFD'.  
 //  
 // The set of namespaces is recorded in the 'NSList' map.  
 // If the map does not yet contain an entry corresponding to  
 // 'namespaceFD', then an entry is created. This process is  
 // recursive: if the parent of the user namespace referred  
 // to by 'namespaceFD' does not have an entry in 'NSList'  
 // then an entry is created for the parent, and the namespace  
 // referred to by 'namespaceFD' is made a child of that namespace.  
 //  
 // When called recursively to create the ancestor namespace  
 // entries, this function is called with 'pid' as -1, meaning  
 // that no PID needs to be added for this namespace entry.  
 //  
 // The return value of the function is the ID of the namespace  
 // entry (i.e., the device ID and inode number corresponding to  
 // the user namespace file referred to by 'namespaceFD').  
   
 func AddNamespace(namespaceFD int, pid int) NamespaceID {  
     const NS_GET_PARENT = 0xb702 // ioctl() to get namespace parent  
     var sb syscall.Stat_t  
     var err error  
   
     // Obtain the device ID and inode number of the namespace  
     // file. These values together form the key for the 'NSList'  
     // map entry.  
   
     err = syscall.Fstat(namespaceFD, &sb)  
     if err != nil {  
         fmt.Println("syscall.Fstat(): ", err)  
         os.Exit(1)  
     }  
   
     ns := *new(NamespaceID)  
     ns = NamespaceID{sb.Dev, sb.Ino}  
   
     if _, fnd := NSList[ns]; fnd {  
   
         // Namespace already exists; nothing to do  
   
     } else {  
   
         // Namespace entry does not yet exist; create it  
   
         np := new(NamespaceAttribs)  
         NSList[ns] = np  
   
         // Get file descriptor for parent user namespace  
   
         r, _, e := syscall.Syscall(syscall.SYS_IOCTL,  
             uintptr(namespaceFD), uintptr(NS_GET_PARENT), 0)  
         parentFD := (int)((uintptr)(unsafe.Pointer(r)))  
   
         if parentFD == -1 {  
             switch (e) {  
             case syscall.EPERM:  
                 // This is the initial NS; remember it  
                 initialNS = ns  
             case syscall.ENOTTY:  
                 fmt.Println("This kernel doesn't support " +  
                         "namespace introspection");  
                 os.Exit(1)  
             default:  
                 // Unexpected error; bail  
                 fmt.Println("ioctl()", e)  
                 os.Exit(1)  
             }  
   
         } else {  
   
             // We have a parent user namespace; make sure it  
             // has an entry in the map. No need to add any  
             // PID for the parent entry.  
   
             par := AddNamespace(parentFD, -1)  
   
             // Make the current namespace entry ('ns') a child of  
             // the parent namespace entry  
   
             NSList[par].children = append(NSList[par].children, ns)  
   
             syscall.Close(parentFD)  
         }  
     }  
   
     // Add PID to PID list for this namespace entry  
   
     if pid > 0 {  
         NSList[ns].pids = append(NSList[ns].pids, pid)  
     }  
   
     return ns  
 }  
   
 // ProcessProcFile processes a single /proc/PID entry, creating  
 // a namespace entry for this PID's /proc/PID/ns/user file  
 // (and, as necessary, namespace entries for all ancestor namespaces  
 // going back to the initial user namespace).  
 // 'name' is the name of a PID directory under /proc.  
   
 func ProcessProcFile(name string) {  
     var namespaceFD int  
     var err error  
   
     // Obtain a file descriptor that refers to the user namespace  
     // of this process  
   
     namespaceFD, err = syscall.Open("/proc/"+name+"/ns/user",  
         syscall.O_RDONLY, 0)  
   
     if namespaceFD < 0 {  
         fmt.Println("Open: ", namespaceFD, err)  
         os.Exit(1)  
     }  
   
     pid, _ := strconv.Atoi(name)  
   
     AddNamespace(namespaceFD, pid)  
   
     syscall.Close(namespaceFD)  
 }  
   
 // DisplayNamespaceTree() recursively displays the namespace  
 // tree rooted at 'ns'. 'level' is our current level in the  
 // tree, and is used for producing suitably indented output.  
   
 func DisplayNamespaceTree(ns NamespaceID, level int) { 
     prefix := strings.Repeat(" ", level*4)  
   
     // Display the namespace ID (device ID + inode number)  
   
     fmt.Print(prefix)  
     fmt.Println(ns)  
   
     // Print a sorted list of the PIDs that are members of this  
     // namespace. We do a bit of a dance here to produce a list  
     // of PIDs that is suitably wrapped, rather than a long  
     // single-line list.  
   
     sort.Ints(NSList[ns].pids)  
     base := len(prefix) + 25  
     col := base  
     for i, p := range NSList[ns].pids {  
         if i == 0 || col >= 80 && col > base+32 {  
             col = base  
             if i > 0 {  
                 fmt.Println()  
             }  
             fmt.Print(prefix)  
             fmt.Print("      ")  
             if i == 0 {  
                 fmt.Print("PIDs: ")  
             } else {  
                 fmt.Print("   ")  
             }  
         }  
         fmt.Print(strconv.Itoa(p) + " ")  
         col += len(strconv.Itoa(p)) + 1  
     }  
     fmt.Println()  
   
     // Recursively display the children namespaces  
   
     for _, v := range NSList[ns].children {  
         DisplayNamespaceTree(v, level+1)  
     }  
 }  
   
 func main() {  
   
     // Fetch a list of files from /proc  
   
     files, err := ioutil.ReadDir("/proc")  
     if err != nil {  
         fmt.Println("ioutil.Readdir(): ", err)  
         os.Exit(1)  
     }  
   
     // Process each /proc/PID (PID starts with a digit)  
   
     for _, f := range files {  
         if f.Name()[0] >= '0' && f.Name()[0] <= '9' {  
             ProcessProcFile(f.Name())  
         }  
     }  
   
     // Display the namespace tree rooted at the initial  
     // user namespace  
   
     DisplayNamespaceTree(initialNS, 0)  
 }  

The following (abbreviated) output shows what happens when we run the program on a system where there are a few user namespaces. (We must run the program with privilege so that we can access the /proc/PID/ns/user files of all users' processes.)

 $ sudo go run userns_overview.go   
 {3 4026531837}  
       PIDs: 1 2 3 5 7 8 9 10 11 12 13 14 15 16 17 18 20 21 22 23 24   
          25 26 28 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45   
          ...  
          27101 27225 27245 27971 28142 28619 28870 28922 28995 29043   
          29109 29209 29279 29455 29466 29481 29489 29532 29533 29550   
    {3 4026532459}

        {3 4026532663}
                    PIDs: 29745 29749 29823 29847 
        {3 4026532450}

            {3 4026532662}
                        PIDs: 29746 

The output of the program is somewhat primitive, but employs indentation to show the hierarchical relationships between the user namespaces. In all, there are five user namespaces shown above.

The first few lines show the initial user namespace and its member processes. The other user namespaces were created by an instance of the Google Chrome browser. The namespace with the inode number 4026532459 is a child of the initial user namespace. That namespace in turn has two descendants (4026532663 and 4026532450), and the last of those namespaces in turn has a descendant (4026532662).

The output also shows the PIDs of the processes that reside in each namespace. Two of the namespaces (inode numbers 4026532663 and 4026532662) have no member processes (but are pinned to existence by the presence of descendant user namespaces).

Some more details about the namespace introspection feature, as well as a simpler example program (in C) can be found in the namespaces(7) manual page.