NFV Performance
NFV Performance
NFV Performance
Tobias Rang
Bachelor’s Project
NFV performance benchmarking with OVS and
Linux containers
Approved, 170605
iii
Abstract
One recent innovation in the networking industry, is the concept of Network Function
Virtualization (NFV). NFV is based on a networking paradigm in which network functions,
which have typically been implemented in the form of dedicated hardware appliances in the
past, are implemented in software and deployed on commodity hardware using modern
virtualization techniques. While the most common approach is to place each virtual network
function in a virtual machine - using hardware-level virtualization – the growing influence
and popularity of Docker and other container-based solutions has naturally led to the idea of
containerized deployments. This is a promising concept, as containers (or operating system
level virtualization) can offer a flexible and lightweight alternative to hardware-level
virtualization, with the ability to use the resources of the host directly. The main problem with
this concept, is the fact that the default behavior of Docker and similar technologies is to rely
on the networking stack of the host, which typically isn’t performant enough to handle the
performance requirements associated with NFV. In this dissertation, an attempt is made to
evaluate the feasibility of using userspace networking to accelerate the network performance
of Docker containers, bypassing the standard Linux networking stack by moving the packet
processing into userspace.
v
Contents
1 Introduction............................................................................................................................1
1.1 Motivation......................................................................................................................2
1.2 Scope..............................................................................................................................2
1.3 Outline............................................................................................................................3
2 Related Work..........................................................................................................................3
2.1 The traditional Linux networking stack.........................................................................3
2.1.1 High-level overview
2.1.2 Performance limitations and potential optimizations
2.2 Containers......................................................................................................................6
2.2.1 Historical overview
2.2.2 Container isolation
2.2.3 Container networking
2.2.4 Containers and traditional virtual machines
2.3 Network function virtualization.....................................................................................9
2.3.1 NFV and its relation to Software Defined Networking
2.4 DPDK...........................................................................................................................10
2.5 CloudLab......................................................................................................................11
2.6 Open vSwitch...............................................................................................................11
2.6.1 Vanilla OVS
2.6.2 OVS-DPDK
2.7 MoonGen.....................................................................................................................14
2.7.1 Overview
2.7.2 Timestamping and latency measurements
3 Methodology.........................................................................................................................15
3.1 Approach.....................................................................................................................15
3.2 Test environment..........................................................................................................17
3.3 Experiments.................................................................................................................20
4 Results...................................................................................................................................21
4.1 Baseline performance...................................................................................................22
4.1.1 Throughput
4.1.2 Latency
4.2 Test case 1....................................................................................................................24
4.2.1 Throughput
4.2.2 Latency
4.3 Test case 2....................................................................................................................26
4.3.1 Throughput
4.3.2 Latency
4.4 Test case 3....................................................................................................................28
4.4.1 Throughput
4.4.2 Latency
vi
5 Conclusion.............................................................................................................................30
5.1 Future work..................................................................................................................31
vii
List of Figures
viii
List of Tables
Table 1: Software versions........................................................................................................16
Table 2: Balanced DPDK configuration...................................................................................19
Table 3: Standard OVS - merged result from multiple invocation of the latency benchmark
(microseconds)..........................................................................................................................23
Table 4: Merged result from multiple invocation of the latency benchmark (microseconds). .25
Table 5: Throughput and standard deviation for 64 byte packets with increasingly limited
CPU access................................................................................................................................26
Table 6: Latency and standard deviation for different packet sizes with increasingly limited
CPU access................................................................................................................................27
Table 7: Merged result from multiple invocation of the latency benchmark (microseconds). .29
9
List of Scripts
Script 1: snippet from start_single_container.py - container command line arguments...........17
Script 2: setup_ovs.sh...............................................................................................................18
Script 3: Kernel options............................................................................................................18
x
1 Introduction
Another concept that has grown in influence and popularity over the last few years, is that of
operating system level virtualization – better known as containers or container-based
technologies. Compared to traditional Virtual Machines (VMs), containers can offer a more
flexible and lightweight form of virtualization, since they are able to use the resources of the
host directly [3][4].
The maturation of container-based technologies, such as Docker [5], has naturally led to the
idea of using these technologies as a more lightweight way of deploying Virtual Network
Functions (VNFs), as compared to the standard approach of using VMs [6]. One of the main
problems with such a solution, however, is the fact that the default behavior of Docker and
similar technologies, is to rely on the networking stack of the host Operating System (OS) [7].
As NFV puts an almost unheard-off level of demands on the network performance of the
underlying systems [8], this quickly becomes a bottleneck when containers are relying on the
standard Linux networking stack. Various attempts have been made to find a way around this
limitation, by accelerating or simply bypassing the default networking behavior of the
1
container engines, in a number of ways. One such optimization strategy – commonly known
as userspace networking – is focused on bypassing the standard Linux networking stack by
moving the bulk of the packet processing from kernel space to userspace, and thereby
avoiding costly context switching [9][10][11]. This dissertation is a continuation of these
previous efforts, but focused specifically on userspace networking as a way of bypassing the
default networking stack.
1.1 Motivation
Several schemes for optimizing the network performance of virtual hosts have been
developed over the years, in response to growing workloads. These include statically
assigning a Network Interface Controller (NIC) connected to the host to a guest OS, through
techniques such as PCI-passthrough and VFIO [2][12]. While this type of solution can deliver
native performance to virtual environments, it is both costly and limiting, as it requires
dedicating a physical NIC to a single guest [13]. Technologies such as SR-IOV and macvlan
attempt to mitigate these limitations, by allowing multiple guests to share a single NIC
connected to the host [14][6]. An alternative solution is to place a virtual switch – a software-
based layer 2 switch – such as the Linux bridge or Open vSwitch (OVS) [15] between the
physical NIC and the VMs.
1.2 Scope
As mentioned in the previous section, the primary purpose of this study is to evaluate the
potential benefits of containerized deployments of VNFs leveraging userspace networking.
There are several prominent frameworks aimed at enabling userspace networking out there
2
[18], but the scope of this study is limited to Intel’s DPDK. In a similar manner, there are
several well-established container engines in existence [19][4], but this study is exclusively
focused on the Docker container engine. The main technologies and tools used during the
work with this dissertation project obviously fall within its scope, and are – as such –
described in some detail in this report. Further, the concept of packet processing in general is
within the scope, while the implementation details and inner workings of specific networking
protocols are beyond the scope.
1.3 Outline
The outline of this dissertation is divided into five separate chapters. The first chapter contains
– in addition to this outline - a brief introduction, and two sections dedicated to the
motivation and scope of this project. The second chapter provides detailed descriptions of
various tools and technologies relevant to this work. Chapter three contains the methodology
employed during this work, while the results and findings are presented in chapter four.
Chapter five, finally, contains the conclusions drawn and a discussion of these.
2 Related Work
The purpose of this chapter is to introduce and explore some of the underlying and enabling
technologies and concepts, on top of which NFV and container-based systems are built. The
first section of this chapter is a high-level overview of the Linux networking stack; this will
serve as a point of reference when making comparisons with the userspace-based approach
evaluated during this thesis. In addition to this, the main tools and technologies used during
the experimental phase of this project are similarly introduced.
3
Figure 1: Linux networking stack, packet RX overview [10]
As depicted in Figure 1, packets - when received at the hardware level – are loaded into main
memory by the corresponding device driver via Direct Memory Access (DMA). More
specifically, packets are loaded into a data structure known as a ring buffer, which is
maintained by the device driver. A ring buffer consists of a variable (device specific) number
of packet descriptors, each of which has a corresponding sk_buff structure – a structure
provided by the kernel to hold the data of a single packet. A second ring buffer is maintained
by the device driver, containing TX packets. Once the packet is accessable by the kernel and
ready for further processing, a soft interrupt (softirq) is scheduled by the device driver, and a
pointer to the NIC is appended to the poll queue of the CPU on which the interrupt was
scheduled. Upon triggering of said softirq, the CPU goes through its poll queue, fetching the
available packets from the ring buffer of each device by calling the poll function of the
corresponding device driver. This subsection of the Linux networking stack – the NIC and
device driver – correspond to the first and second layers of the OSI model [20].
At this point, the dequeued packets are processed by the kernel, as depicted in the Kernel
Protocol Stack section of Figure 1. All IP packets undergo the IP processing step, during
which the integrity of the packet is verified, and potential firewall rules are applied.
Subsequently, packets are forwarded to the next processing step based on protocol type – each
transport layer protocol has a corresponding handler function, which is invoked for every
packet of that type. These handler functions perform various forms of protocol-specific logic
on the packets passed to them - which won’t be explained further here, as the focus of this
thesis is on packet processing in general, as opposed to the internal mechanisms of specific
protocols. For the interested reader, Wenji Wu et al. provides a wonderfully detailed
description of this process in [10].
4
After going through the protocol-specific processing step, packets are placed in the receive
queue of the socket they belong to, finally ready to leave kernel space. The packet data isn’t
actually copied into userspace until the process associated with the socket just mentioned
requests to read data from it - by using one of the socket-related system calls. When this
happens, the userspace application passes a buffer and the number of bytes to be read to the
kernel, which – in turn - attempts to fill the buffer with the desired number of bytes from the
receive queue. If the number of bytes requested exceed the amount of data available in the
queue, all of the available data is copied into the supplied buffer; the return value of the
system call is used to indicate the number of bytes actually read [10][21].
5
Figure 2: Linux networking stack overview, categorized according to kernel
space / userspace [21]
2.2 Containers
Both of these concepts were implemented on top of the kernel-provided system call chroot,
which was introduced into to the Unix kernel as early as 1979. The chroot system call is used
to change or limit the view of the file system as seen by a particular process – and,
subsequently, the children of that process - by changing the root directory of that process.
This can be applied to any arbitrary directory within the file system hierarchy, and the affected
process and its children will thereafter view that directory as their root directory - /. Although
chroot is an excellent tool for providing the mechanism for which it was intended – restricting
6
a process to a particular subset of the file system – it alone cannot provide everything needed
to isolate a process from the rest of the system. This becomes apparent when one considers
the fact that it has no way of restricting other, potentially malicious processes from accessing
arbitrary parts of the file system [25].
Another of the kernel-based abstractions that would eventually help enable the creation of the
container-based technologies we know today, is the tool cgroups – which is an abbreviation of
Control Groups. The cgroups project was originally started by a group of Google-employed
engineers in 2006, and it was merged into the official Linux kernel in 2008. Using cgroups, a
set of processes can be treated as a single, logical unit – allowing the user to set configuration
options that apply to the entire group [26].
Truly wide-spread use and industry adoption of the container-based technologies seem to have
coincided with the release and subsequent maturation of the Docker container engine and
similar software projects a few years ago, which sparked a massive amount of interest in such
technologies throughout the industry [5][7].
7
2.2.3 Container networking
Container networking, typically, relies on the networking stack of the host. Various kernel
features and modules are used to connect the network namespace of the container to the main
namespace of the host, usually through a software switch such as the Linux Bridge. To
achieve this, pairs of virtual network interfaces are created and used to connect the namespace
of the container to the main namespace through such a bridge. As a concrete example of this,
the Docker engine, by default, creates a pair of virtual veth interfaces for every container –
one is attached to the network namespace of the container, and the other is connected the
standard docker0 bridge in the main networking namespace [28].
In addition to the bridged mode just described, Docker supports another networking mode
called host. A container started with this mode uses the main network namespace of the host
machine – meaning the container is not isolated from the rest of the host system through its
own namespace, as is the case when using the bridged mode. If a process running in a
container using the host networking mode starts some network-based service on a specific
port, that service will be available on the host system, on the same port. While the host mode
lacks the isolation present in bridged mode, it is superior from a performance perspective, as
the container can access the network resources of the host directly, without the indirection
associated with the bridged mode [21][28].
The network performance of the approach just described is sufficient for many use-cases for
which network I/O is not the primary priority. There are, however - as will be explained
below - plenty of use-cases for which network I/O is extremely important. In such cases,
alternative approaches to container networking must be applied; some of these will be
explained and evaluated in great detail in this dissertation [27].
8
copy of its entire OS [6]. Another important distinction between a VM and a container is the
level of abstraction targeted by the respective technology. One of the main problem-areas
addressed by container-based technologies is that of software packaging and distribution. As
such, container engines include features for managing the life cycle of the containerized
application, typically by providing functions for starting and stopping it, as well as for
upgrading it to a newer version if one exists. In addition to this, using a container-based
approach can reduce the complexity of managing external dependencies by a great deal, as the
containerized application is packaged along with all of its dependencies and any necessary
configuration [4].
9
network functions. This approach enables deployment of new specialized network functions,
while removing the need for purchasing and installing new hardware devices. Besides the
reduced equipment costs already mentioned, the main benefits of NFV include a significantly
shorter time-to-market for new technologies and the ability to perform experiments and test
scenarios on the same infrastructure that will be used when deployed to the actual production
environment [1][2].
2.4 DPDK
The Data Plane Development Kit (DPDK) is an open source framework provided by Intel,
released under the BSD software license [29]. As the name implies, it includes a number of
software components – libraries, drivers and the like – targeted at the processing of packets.
By leveraging the powerful tools provided by the DPDK framework, it is possible to write
extremely fast packet processing software. One of the major factors contributing to this
speedup, when compared to more traditional approaches to packet processing and network
I/O, is the fact that DPDK makes it possible to circumvent the standardized networking stack,
and instead perform the necessary processing without ever leaving userspace. This is a stark
contrast to the approach taken by the standard Linux networking stack, which – as explained
in section 2.1 - relies on interrupts to signal the presence of newly received packets, and
context switching into kernel mode and back again is required to process those packets. The
developers of DPDK have gone to great lengths to design ways of avoiding those context
10
switches, the most well known of these being the Poll Mode Driver (PMD). When using the
PMD, one or more CPU cores are dedicated to the task of constantly polling the
corresponding NIC for new packets. When new data is detected, the driver makes is available
to the proper DPDK application by accessing the RX descriptor of the NIC directly – thus
avoiding any indirection and any context switching.
During this study, the DPDK-enhanced version of OVS was used, in an attempt to increase
the network performance of containerized virtual network functions [16].
2.5 CloudLab
CloudLab is a service provided by a group of collaborating universities., based mainly at the
University of Utah, Clemson University and the University of Wisconsin. It enables academic
research and experimentation to be performed in a realistic environment, by providing access
to varying types of computing resources – including both virtual and real, bare-metal
hardware – to researchers. All of the experiments performed during this study were performed
on bare-metal hardware provided by CloudLab [31][32].
11
OpenFlow [34] switch from the beginning, meaning it can be dynamically re-programmed by
a SDN (OpenFlow) controller.
Newly received packets – from a physical or a virtual NIC – are first encountered by the
datapath module in the kernel. At this point, there are two possible outcomes – either a flow
table entry specifying how packets of this type should be handled has already been inserted
by the userspace daemon, or no such entry exists. If the former is true, the datapath module
simply executes the set of actions associated with the matching entry in the flow table and
forwards the packet accordingly; if the latter is true, ovs-vswitchd (the userspace daemon)
must be consulted. When this happens, ovs-vswitchd decides how packets of this particular
type is to be handled, and the packet is passed back to the datapath kernel module, along with
a set of actions to be applied – the flow table is subsequently updated with an entry describing
the packet type. Future packets of the same type can thus be handled without leaving the fast
path (the datapath module). While this approach reduces the amount of context-switching
between userspace and kernel space, the throughput is still limited by the forwarding capacity
of the Linux networking stack [35][36].
2.6.2 OVS-DPDK
The main idea behind the DPDK-accelerated version of OVS is to move the packet processing
entirely into userspace – effectively replacing the datapath kernel module of the standard
OVS implementation with a DPDK-based datapath. In practice, this is achieved by using a
DPDK PMD to deliver packets straight from the NIC to the OVS userspace daemon [36].
Figure 4 highlights the main architectural differences between standard OVS and OVS-
DPDK.
12
Figure 4: Standard OVS architecture and OVS-DPDK architecture [2]
OVS-DPDK makes forwarding decisions based on three switching tables – the Exact Match
Cache (EMC), the Datapath Classifier and the Ofproto Classifier - which are used to classify
incoming packets. An incoming packet is matched against these tables, in order, until a
matching entry is found or all tables have been searched; when a match is found, the
corresponding set of actions are performed and the packet is forwarded accordingly. When a
packet reaches the Ofproto classifier – meaning no matching entry was found in the EMC or
in the Datapath classifier – the OpenFlow controller is consulted for the appropriate action to
take. As illustrated in Figure 5, the lookup cost increases with each table miss.
13
2.7 MoonGen
2.7.1 Overview
MoonGen is a high-speed load generator implemented on top of DPDK and released under
the MIT software license [37]. An embedded Lua runtime is included, along with library code
– written in Lua – that provide convenient and high-level interfaces to some of the core
functionality of DPDK. Users of the load generator control it by writing Lua scripts,
leveraging the included libraries just mentioned. This architecture combines the flexibilty of a
high-level scripting language, with the extremely fast packet processing capabilities of
DPDK. As performance is critical in this type of application, the LuaJIT virtual machine is
used, rather then the standard Lua runtime. As the name implies, this is an implementation of
the Lua VM with a Just In Time (JIT) compiler, making it very performant - despite it being a
dynamic scripting language. In addition to the modern JIT compilation techniques, the
internals of the LuaJIT VM are heavily optimized for performance; the interpreter is written
entirely in assembler [38]. One potential source of performance pitfalls when using LuaJIT,
stems from the fact that - just like the vanilla Lua runtime – it uses Garbage Collection (GC)
rather than manual memory management. This should rarely be an issue for MoonGen,
however, as packet buffers are allocated and manipulated by DPDK, which means they are
beyond the scope of the Lua GC. As long as care is taken not to allocate new Lua objects
inside the main TX and RX loops of a MoonGen script, the GC should rarely be triggered at
all [39][13].
14
to configure the desired hardware interfaces and to perform any other initial setup necessary,
while the slave processes are used to perform the actual RX and TX tasks. As Lua has no
built-in support for multi-threading, MoonGen includes its own implementation, capable of
spawning new LuaJIT processes and passing serialized arguments to these. The standard
paradigm is for the master process to pass a hardware queue, along with any other arguments,
to a slave process as it is started, which is subsequently used for transmitting or receiving
packets in the slave process [39].
3 Methodology
3.1 Approach
All of the experiments and benchmarks conducted during this dissertation project, were
performed on bare-metal hardware provided by the CloudLab platform – which is described
briefly in section 2.5. The benchmarking environment consisted of two Dell R430 (aka
“d430”) [41][42] nodes. Each of the d430 nodes is equipped with two Intel E5-2630 Haswell
CPUs [43] running at 2.4 Ghz and with a total of 16 physical cores, a total of 64GB DDR4
RAM evenly distributed between the two CPU sockets, and a dual-port (or quad-port, in some
of the nodes) Intel X710 10GbE NIC [44]. The two nodes were connected by two 10GbE
15
links, used during the benchmarks, and also to a 1Gbps control network used for remote
access.
Both of the nodes ran Ubuntu 14.04 LTS, and the low-latency version of the 3.13.0-117 Linux
kernel was used in an attempt to make the results as deterministic and reproducible as
possible. In addition to this, care was to taken to pin DPDK PMD threads to separate cores on
the same Non-Uniform Memory Access (NUMA) [45] node, and to allocate all of the huge
page memory [46] used by DPDK from that same NUMA node. This, too, was done with
reproducibility in mind, as using a single NUMA node removes the need for
intercommunication between nodes – which can be a performance bottleneck [47][45][2].
Table 1 shows the versions of the core software components used during the benchmarking
process.
Software Version
OVS 2.7.90
DPDK 17.02
Docker 1.13.1
MoonGen Commit ef3aa3f
16
3.2 Test environment
Figure 7 depicts a high-level overview of the test environment – MoonGen was used on the
load generator node to send packets out on one interface, while receiving packets on a second
interface – both of which were bound to DPDK. The Device Under Test (DUT), meanwhile,
ran OVS-DPDK in a Docker container; a OpenFlow [34] flow rule was used to specify that
packets entering DPDK port 0 should be transmitted on DPDK port 1. UDP packets of
varying size were used duing all of the benchmarks.
The code snippet displayed in Script 1 shows the command line used to launch the Docker
container in which OVS-DPDK was run on the DUT. One of the limitations with running
OVS-DPDK in a container, is that the container must be granted root privileges – otherwise it
isn’t allowed to access the devices passed to it through the –device flags. One upside to this
approach, however, is that the container is able to access these devices directly, without any
indirection.
17
#! /bin/bash
GRUB_CMDLINE_LINUX_DEFAULT="default_hugepagesz=1GB
hugepagesz=1G hugepages=8 intel_iommu=on"
Script 2 shows the script used to setup OVS-DPDK with the configuration used in the
benchmarks presented in the following chapter. The intel_iommu kernel option, displayed in
script 3, is necessary for the VFIO-PCI driver to work properly [48] with DPDK interfaces.
The remaining kernel options are related to huge page memory; with these options, 4 1GB
huge pages are allocated on each of the two NUMA sockets available in the d430. Explicitly
specifying the huge page size in the kernel command line is necessary – according to the
author’s experience – to enable the allocation of 1GB huge pages. Without this option, the
size was implicitly set to 2MB.
18
Setting Value
Number of PMD threads 3
CPU cores dedicated to PMD threads Core 0, 2, 4 (all on NUMA node 0)
Number of RX queues 2
RX queue affinity RX queue 0 pinned to core 2
Huge pages 1024MB, allocated from NUMA node 0
DPDK driver VFIO-PCI
Table 2: Balanced DPDK configuration
Table 2 contains a summary of the DPDK configuration used during the benchmarks
presented in the following chapter; this configuration was chosen after extensive
experimentation with different settings. The settings were chosen in an attempt to lower the
latency as much as possible, without sacrificing any throughout – hence the term balanced
configuration. The following steps were taken to lower the latency:
• The number of RX queues was reduced to two, as lowering the amount of batching
and parallelism in the packet processing pipeline typically has a positive effect on
latency – although sometimes at the expense of throughput.
• One RX queue on the ingress interface of the DUT (DPDK port 0) was pinned to a
specific PMD thread, while the other two PMD threads were left to poll the remaining
queues in a round-robin fashion.
• Huge page memory was allocated from the same NUMA node, to avoid costly
communication across socket boundaries
• As mentioned in section 3.1, the low-latency kernel was used
The results presented in chapter 4 were all gathered while running DPDK with the balanced
configuration just mentioned. It is worth noting, however, that different configurations can –
unsurprisingly – lead to very different results. Some of the configurations that were tried, for
instance, were able to reduce the latency substantially when compared to the balanced
configuration – but never without sacrificing some throughput. Some of these results are
included in appendix B.
19
3.3 Experiments
The following three test scenarios were conducted:
1. In this test case, the latency and throughput of OVS-DPDK running in a container
with unrestricted access to the host’s CPU – as depicted in Figure 7 – were measured.
The same measurements were repeated with OVS-DPDK – using the same
configuration - running directly on the host, in an attempt to gauge the potential
performance penalty of the containerized version. Lastly, the benchmarks were
repeated for the DPDK-application L2fwd [49] configured to emulate the forwarding
behavior of OVS-DPDK – packets entering DPDK port 0 forwarded to and
transmitted from DPDK port 1. This was done to establish a baseline performance for
DPDK, as it involves neither OVS nor a container.
2. The purpose of this test case was to analyze how varying degrees of resource
limitations impact the performance. To achieve this, OVS-DPDK was run in a
container – again, as depicted in Figure 7 – and Docker’s built-in support for resource
limitation [50] was used to decrease the container’s access to the CPU in incremental
steps. Measurements of the throughput and latency were recorded for each increment.
3. Finally, a test case evaluating the performance impact of a noisy neighbor scenario
was conducted. Again, OVS-DPDK was placed inside a container (with no resource
limitations), and the Phoronix test suite [51] was used to launch a competing
benchmark. Specifically, the throughput and latency were measured while the Apache
benchmark included in the Phoronix test suite was running, The benchmarks were
repeated for OVS-DPDK configured with explicit core pinning and OVS-DPDK with
no explicit core pinning.
In addition to the three test cases just described, a general baseline performance was
established, by measuring the latency and throughput of the standard, kernel-based version of
OVS. These measurements were repeated with OVS running directly on the host and with
OVS running in a Docker container.
20
4 Results
In this chapter, the results of the three benchmark scenarios outlined in the previous chapter
are presented, in addition to the results of the baseline measurements. Latency and throughput
data is presented for each test case, in various forms. Throughput data is represented visually
in the form of (bar) graphs, displaying the measured values over a set of different packet
sizes. The packet sizes 64, 128, 256, 512, 1024 and 1500 bytes were chosen, and this same set
of sizes was used for all throughput benchmarks. Latency data is represented either as graphs
similar to those used for throughput, or in the form of Cumulative Distribution Function
(CDF) graphs. The latter of these were created by merging the data from 10 separate
invocations of the benchmark – for a single packet size (64 bytes, specifically). For any
arbitrary point along one the lines in these graphs, the x and y values of that point can be
interpreted as: y percent of the sampled packets corresponding to this particular line has a
latency less then or equal to x.
21
4.1 Baseline performance
4.1.1 Throughput
Figure 8 shows the throughput results for the kernel-based version of OVS, running directly
on the host and running in a container. For each packet size, very similar results were
measured, indicating no addition overhead for the containerized version. For smaller packet
sizes, this version of OVS wasn’t able to saturate the 10GbE link.
22
4.1.2 Latency
As depicted in Figure 9 and table 3, the latency benchmark did indicate a performance
overhead associated with running OVS in a container. In addition to a higher median latency,
the containerized version suffered from an increased amount of jitter – indicated by the
standard deviation.
23
4.2 Test case 1
4.2.1 Throughput
As indicated by Figure 10, there was no noticeable difference in the throughput of the three
setups tested; for each packet size, identical results were produced.
24
4.2.2 Latency
Each line in the cumulative distribution function graph, displayed in Figure 11, is a
visualization of the merged results of multiple invocations of the latency benchmark. As
depicted in the graph, L2fwd performed better – both in terms of latency and jitter – than the
two OVS-DPDK setups. For L2fwd, 45% of the sampled packets had a latency of 14.26us or
less, and 95% of the packets 15.6us or less. As for OVS-DPDK running in a container, 50% of
the packet had a latency less then or equal to 20.06us and 90% less or equal to 22.66us. OVS-
DPDK running directly on the host performed very similarly, as indicated in the graph.
25
4.3 Test case 2
4.3.1 Throughput
Figure 12: Throughput for different packet sizes with increasingly limited CPU access
Figure 12 and table 5 show the throughput data gathered for this test case, grouped according
to packet size. As expected, there is a substantial drop in throughput each time the restriction
on the container’s CPU access is increased.
26
4.3.2 Latency
Figure 13: Latency for different packet sizes with increasingly limited CPU access
As indicated by Figure 13 and table 6, increasing the limitation on the container’s CPU access
makes the latency less deterministic and increasingly sporadic.
27
4.4 Test case 3
4.4.1 Throughput
Figure 14 displays the throughput data gathered during test case 3 – the noisy neighbor
scenario. This test was conducted with and without explicit CPU core pinning (CPU affinity),
with a competing benchmark colocated on the same CPU. While there was no massive impact
on the throughput, the competing benchmark did introduce some variance – indicated by the
standard deviation - and there was a slight decrease in the throughput itself. It should be noted
that the balanced configuration used for these benchmarks had PMD threads on three separate
CPU cores. A configuration with, say, a single PMD thread would likely be more severely
impacted by the competing benchmark.
28
4.4.2 Latency
This benchmark, like the previous one, was conducted with a competing benchmarks running
on the system. As indicated by Figure 15 and table 7, the competing benchmark clearly
impacted the latency and increased the amount of jitter (delay variation), when compared to
the data presented in section 4.2.2. A slight but noticeable difference between the
configuration with pinning and the one without is also indicated; the pinned configuration
isn’t as heavily impacted by the competing benchmark.
29
5 Conclusion
Comparing the performance characteristics of the standard, kernel-based version of OVS with
that of OVS-DPDK, a number of interesting conclusions can be drawn. In terms of
throughput, the most obvious difference is the fact that OVS-DPDK is able to saturate the
10GbE link – even when using minimum-sized packets – which standard OVS isn’t capable
of. Interestingly, the latency benchmark (section 4.1.2) indicates a performance penalty
associated with running standard OVS in a container, which isn’t the case for OVS-DPDK.
Further, there is a significant difference in the average latency achieved by OVS and by OVS-
DPDK: 470.414us for OVS running in a container (section 4.1.2) compared to 21.134us for
OVS-DPDK running in a container (section 4.2.2).
According to previous research [13][6], the CPU is typically the primary performance
bottleneck in high-speed packet processing pipelines based around a virtual switch; the
performance data gathered during this work seem to confirm that statement. The results from
test case 2 (section 4.3) indicate a significant decrease in performance associated with each
incremental increase of the resource limitations placed upon the container – the throughput
drops and the latency grows increasingly sporadic and unpredicable (in other words, the
amount of jitter increases). One interesting detail is the fact that, for most packet sizes, the
throughput degradation follows the corresponding CPU limitation quite closely (i.e. 8.18us at
100%, 6.0us at 75%, 4.0us at 50%, 2.02us at 25% and 0.99us at 12.5% for 128 byte packets).
The noisy neighbor benchmark scenario conducted during test case 3 (section 4.4) reinforces
the fact that the CPU is critical to the performance of setups such as those tested, and it hints
at the importance of using explicit CPU pinning. One of the advantages of the busy-waiting
model used by DPDK PMD threads is the fact that the PMD thread is typically consuming
100% of the CPU core on which it is running. Because of this, a DPDK application is able to
handle scenarios in which it is competing for resources with other processes fairly well. How
much the DPDK application is impacted depends on factors including the number of PMD
threads and, as illustrated by test case 3, careful CPU pinning can make a difference.
Test case 1 (section 4.2) confirms that for each of the three setups tested, in terms of
throughput, the performance is bounded by the maximum line rate, rather than the resources
of the host. This would suggest that there is no significant (throughput) overhead associated
30
with running OVS-DPDK in a container – at least none that is noticable in the scenarios
tested. Running the same tests but over 40GbE links, or using bi-directional traffic, could
potentially change this situation - but this was never attempted during this study.
On the other hand, L2fwd does outperform OVS-DPDK in terms of latency. However, OVS-
DPDK running directly on the host and OVS-DPDK running in a container performed very
similarly, indicating that this performance difference is mainly related to OVS-DPDK and not
the container.
The main motivation behind this dissertation was to attempt to evaluate the feasibility of
using userspace networking – specifically, OVS-DPDK - to deploy VNFs in Docker
containers. The results presented in chapter 4 do provide some useful insights into the
performance characteristics of such a setup, and the metrics do indicate that containers could
be a viable alternative to full-fledged VMs in NFV deployments. However, further
investigation is necessary before the question can be answered definitively.
31
has been exposed to. Studying the feasibility of using SnabbSwitch to power containerized
VNFs could be something worth exploring [13].
Acknowledgments
First of all, I’d like to thank my supervisor, Prof. Andreas Kassler, for introducing me to
wonderful concepts such as SDN and NFV – not to mention the invaluable help and guidance
I’ve received while working on this dissertation. Secondly, I want to thank my girlfriend, Eva,
for being so patient through all of the long hours and late nights.
32
References
1. Margaret Chiosi et al. - Network Functions Virtualisation – Introductory White Paper, 2012
2. Ajay Simha - NFV reference architecture fordeployment of mobile networks, 2017
3. G. Calarco, M. Casoni - On the effectiveness of Linux containers for network
virtualization, 2013
4. Fernando Rodrııguez-Haro et al. - A summary of virtualization techniques, 2012
5. Docker Project - https://www.docker.com/. Accessed: 2017
6. Jason Anderson, Udit Agarwal, Hongda Li, Hongxin Hu, Craig Lowery, Amy Apon -
Performance Considerations of Network FunctionsVirtualization using Containers, 2016
7. Emiliano Casalicchio, Vanessa Perciballi - Measuring Docker Performance: What a
mess!!!, 2017
8. Joao Martins et al. - ClickOS and the Art of Network Function Virtualization, 2014
9. Chuanpeng Li et al. - Quantifying The Cost of Context Switch, 2007
10. Wenji Wu, Matt Crawford, Mark Bowden - The performance analysis of linux
networking – Packet receiving, 2006
11. Ivano Cerrato, Mauro Annarumma, Fulvio Risso - Supporting Fine-Grained Network
Functions through Intel DPDK, 2014
12. VFIO kernel documentation - https://www.kernel.org/doc/Documentation/vfio.txt.
Accessed: 2017
13. Michele Paolino et al. - SnabbSwitch user space virtual switch benchmarkand
performance optimization for NFV, 2015
14. P. Primer - An introduction to sr-iov technology, 2008
15. OVS website - http://openvswitch.org/. Accessed: 2017
16. DPDK website - http://dpdk.org/. Accessed: 2017
17. Yoshihiro Nakajima et al. - High-performance vNIC framework forhypervisor-based
NFV with userspace vSwitch, 2015
18. Sebastian Gallenmüller et al. - Comparison of Frameworksfor High-Performance
Packet IO,
19. LXD website - https://linuxcontainers.org/lxd/introduction/. Accessed: 2017
20. Yadong Li et al. - Research based on OSI model, 2011
21. Rohprimardho - Measuring The Impact of Docker on Network I/O Performance, 2015
33
22. Eliezer Tamir - net: low latency Ethernet device polling, 2013
23. Poul-Henning Kamp, Robert N. M. Watson - Jails: Confining the omnipotent root.,
2000
24. Slightly Skeptical View on Solaris Zones -
http://www.softpanorama.org/Solaris/Virtualization/zones.shtml. Accessed: 2017
25. Lee D. McFearin - Chroot Jail, 2011
26. Timothy Prickett Morgan - Google Leverages Container Expertise On Its Cloud , 2014
27. Joris Claassen, Ralph Koning, Paola Grosso - Linux containers networking:
performance andscalability of kernel modules, 2016
28. Docker Networking Guide -
https://docs.docker.com/engine/userguide/networking/#default-networks. Accessed: 2017
29. The 3-Clause BSD License - https://opensource.org/licenses/BSD-3-Clause. Accessed:
2017
30. DPDK documentation: SR-IOV - http://dpdk.org/doc/guides/nics/intel_vf.html.
Accessed: 2017
31. Kate Keahey, Robert Ricci - Enabling a new future for cloud computing, 2014
32. CloudLab website - https://cloudlab.us/. Accessed: 2017
33. Apache software license - http://www.apache.org/licenses/LICENSE-2.0.html.
Accessed: 2017
34. Nick McKeown, Tom Anderson, Hari Balakrishnan, Jennifer Rexford, Jonathan
Turner, Scott Shenker - OpenFlow: Enabling Innovation in Campus Networks, 2008
35. Ben Pfaff et al. - The Design and Implementation of Open vSwitch, 2015
36. Robin G. (Intel) - OvS-DPDK High-level Architecture, 2016
37. The MIT License - https://opensource.org/licenses/MIT. Accessed: 2017
38. LuaJIT website - http://luajit.org/. Accessed: 2017
39. Paul Emmerich et al. - MoonGen: A Scriptable High-Speed Packet Generator, 2015
40. Pang Fubin et al. - The accuracy of IEEE 1588 time synchronization protocol and its
improvement, 2015
41. Emulab Dell R430 (aka "d430") machines - specifications -
https://wiki.emulab.net/wiki/d430. Accessed: 2017
42. Dell PowerEdge R430 specifications - http://i.dell.com/sites/doccontent/shared-
content/data-sheets/en/Documents/Dell-PowerEdge-R430-Spec-Sheet.pdf. Accessed: 2017
43. Intel® Xeon® Processor E5-2630 v3 - http://ark.intel.com/products/83356/. Accessed:
2017
34
44. Intel® Ethernet Converged Network Adapters X710 10/40 GbE -
http://www.intel.com/content/www/us/en/ethernet-products/converged-network-
adapters/ethernet-x710-brief.html. Accessed: 2017
45. Christoph Lameter - NUMA (Non-Uniform Memory Access): An Overview, 2013
46. Mel Gorman - Huge pages part 1 (Introduction), 2010
47. DPDK FAQ - http://dpdk.org/doc/guides-2.0/rel_notes/faq.html. Accessed: 2017
48. DPDK documentation - building DPDK from source -
http://dpdk.org/doc/guides/linux_gsg/build_dpdk.html. Accessed: 2017
49. L2fwd documentation -
http://dpdk.org/doc/guides/sample_app_ug/l2_forward_real_virtual.html. Accessed: 2017
50. Docker Documentation - resource constraints -
https://docs.docker.com/engine/admin/resource_constraints/#configure-the-default-cfs-
scheduler. Accessed: 2017
51. Phoronix Test Suite website - https://www.phoronix-test-suite.com/. Accessed: 2017
52. Vhost-User Feature for QEMU -
http://www.virtualopensystems.com/en/solutions/guides/snabbswitch-qemu/. Accessed:
2017
35
A Appendix
36
B Low latency configuration results
Figure 17: Low latency configuration - throughput for different packet sizes
37
Figure 18: Low latency configuration - latency, cumulative distribution function, 64 byte packets
38
C Reproducibility
In an effort to increase the reproducibility of this research, the following steps were taken:
• The low-latency Linux kernel was used during all benchmarks
• DPDK PMD threads were pinned to cores on the same NUMA node
• The huge page memory used by the DPDK process was allocated from the same
NUMA node mentioned above
• All of the scripts created during this dissertation – along with all of the data gathered -
are publicly available at a Github repository created for the purpose
• The CloudLab experiment profile created for, and used during, this project has been
made public. This means that anyone with a CloudLab account can instantiate an
experiment from this profile – getting instant access to an environment identical to the
one used for this project, including all of the scripts and data gathered
• The Docker container used for benchmarks is publicly available on Docker hub
Resource Value
Github repository URL https://github.com/edagar/dissertation_scripts
CloudLab profile name two_connected_d430_nodes
39