Throughput and Latency of Virtual Switching With Open Vswitch: A Quantitative Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/318604719

Throughput and Latency of Virtual Switching with Open vSwitch: A


Quantitative Analysis

Article  in  Journal of Network and Systems Management · April 2018


DOI: 10.1007/s10922-017-9417-0

CITATIONS READS

28 3,608

5 authors, including:

Sebastian Gallenmüller Georg Carle


Technische Universität München Technische Universität München
26 PUBLICATIONS   262 CITATIONS    365 PUBLICATIONS   3,956 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

MEMPHIS View project

All content following this page was uploaded by Sebastian Gallenmüller on 17 September 2019.

The user has requested enhancement of the downloaded file.


Throughput and Latency of Virtual Switching with Open
vSwitch: A Quantitative Analysis
Paul Emmerich · Daniel Raumer · Sebastian Gallenmüller · Florian
Wohlfart · Georg Carle

Abstract Virtual switches, like Open vSwitch, have the traditional benefits of software switches: high flex-
emerged as an important part of today’s data centers. ibility, vendor independence, low costs, and conceptual
They connect interfaces of virtual machines and pro- benefits for switching without Ethernet bandwidth lim-
vide an uplink to the physical network via network itations. The most popular virtual switch implementa-
interface cards. We discuss usage scenarios for virtual tion – Open vSwitch (OvS [43]) – is heavily used in
switches involving physical and virtual network inter- cloud computing frameworks like OpenStack [7] and
faces. We present extensive black-box tests to quantify OpenNebula [6]. OvS is an open source project that is
the throughput and latency of software switches with backed by an active community, and supports common
emphasis on the market leader, Open vSwitch. Finally, standards such as OpenFlow, SNMP, and IPFIX.
we explain the observed effects using white-box mea- The performance of packet processing in software
surements. depends on multiple factors including the underlying
hardware and its configuration, the network stack of
Keywords Network measurement · Cloud · Perfor- the operating system, the virtualization hypervisor, and
mance evaluation · Performance characterization · traffic characteristics (e.g., packet size, number of flows).
MoonGen Each factor can significantly hurt the performance, which
gives the motivation to perform systematic experiments
to study the performance of virtual switching. We carry
1 Introduction out experiments to quantify performance influencing
factors and describe the overhead that is introduced
Software switches form an integral part of any virtu- by the network stack of virtual machines, using Open
alized computing setup. They provide network access vSwitch in representative scenarios.
for virtual machines (VMs) by linking virtual and also Knowing the performance characteristics of a switch
physical network interfaces. The deployment of software is important when planning or optimizing the deploy-
switches in virtualized environments has led to the ex- ment of a virtualization infrastructure. We show how
tended term virtual switches and paved the way for the one can drastically improve performance by using a dif-
mainstream adoption of software switches [37], which ferent IO-backend for Open vSwitch. Explicitly map-
did not receive much attention before. In order to meet ping virtual machines and interrupts to specific cores is
the requirements in a virtualized environment, new vir- also an important configuration of a system as we show
tual switches have been developed that focus on per- with a measurement.
formance and provide advanced features in addition to The remainder of this paper is structured as fol-
lows: Section 2 provides an overview of software switch-
P. Emmerich · D. Raumer · S. Gallenmüller · F. Wohlfart · ing. We explain recent developments in hardware and
G. Carle software that enable sufficient performance in general
Technical University of Munich, Department of Informatics,
Chair of Network Architectures and Services
purpose PC systems based on commodity hardware,
Boltzmannstr. 3, 85748 Garching, Germany highlight challenges, and provide an overview of Open
E-mail: {emmericp|raumer|gallenmu|wohlfart|carle} vSwitch. Furthermore, we present related work on per-
@net.in.tum.de formance measurements in Section 3. The following Sec-
2 Paul Emmerich et al.

tion 4 explains the different test setups for the measure- The performance of virtual data plane forwarding
ments of this paper. Section 5 and Section 6 describe capabilities is a key issue for migrating existing services
our study on the performance of software switches and into VMs when moving from a traditional data center
their delay respectively. Ultimately, Section 7 sums up to a cloud system like OpenStack. This is especially
our results and gives advice for the deployment of soft- important for applications like web services which make
ware switches. extensive use of the VM’s networking capabilities.
Although hardware switches are currently the dom-
inant way to interconnect physical machines, software
switches like Open vSwitch come with a broad sup-
2 Software Switches
port of OpenFlow features and were the first to sup-
port new versions. Therefore, pNIC to pNIC switching
A traditional hardware switch relies on special purpose
allows software switches to be an attractive alternative
hardware, e.g., content addressable memory to store the
to hardware switches in certain scenarios. For software
forwarding or flow table, to process and forward pack-
switches the number of entries in the flow table is just
ets. In contrast, a software switch is the combination
a matter of configuration whereas it is limited to a few
of commodity PC hardware and software for packet
thousand in hardware switches [49].
switching and manipulation. Packet switching in soft-
ware grew in importance with the increasing deploy-
ment of host virtualization. Virtual machines (VMs)
running on the same host system must be intercon- 2.1 State of the Art
nected and connected to the physical network. If the
focus lies on switching between virtual machines, soft- Multiple changes in the system and CPU architectures
ware switches are often referred to as virtual switches. significantly increase the packet processing performance
A virtual switch is an addressable switching unit of po- of modern commodity hardware: integrated memory
tentially many software and hardware switches span- controllers in CPUs, efficient handling of interrupts,
ning over one or more physical nodes (e.g., the ”One and offloading mechanisms implemented in the NICs.
Big Switch” abstraction [29]). Compared to the default Important support mechanisms are built into the net-
VM bridging solutions, software switches like OvS are work adapters: checksum calculations and distribution
more flexible and provide a whole range of additional of packets directly to the addressed VM [8]. NICs can
features like advanced filter rules to implement firewalls transfer packets into memory (DMA) and even into
and per-flow statistics tracking. the CPU caches (DCA) [4] without involving the CPU.
DCA improves the performance by reducing the num-
ber of main memory accesses [26]. Further methods
such as interrupt coalescence aim at allowing batch
VM VM VM VM style processing of packets. These features mitigate the
...
vNIC vNIC vNIC vNIC vNIC effects of interrupt storms and therefore reduce the num-
Software Switch
ber of context switches. Network cards support modern
pNIC pNIC pNIC
hardware architecture principles such as multi-core se-
Physical Switches & Crosslink Connections
tups: Receive Side Scaling (RSS) distributes incoming
Fig. 1 Application scenario of a virtual switch
packets among queues that are attached to individual
CPU cores to maintain cache locality on each packet
processing core.
Figure 1 illustrates a typical application for virtual These features are available in commodity hardware
switching with both software and hardware switches. and the driver needs to support them. These consider-
The software switch connects the virtual network in- ations apply for packet switching in virtual host envi-
terface cards (NIC) vNIC with the physical NICs pNIC. ronments as well as between physical interfaces. As the
Typical applications in virtualization environments in- CPU proves to be the main bottleneck [18, 35, 13, 47]
clude traffic switching from pNIC to vNIC, vNIC to features like RSS and offloading are important to re-
pNIC, and vNIC to vNIC. For example, OpenStack rec- duce CPU load and help to distribute load among the
ommends multiple physical NICs to separate networks available cores.
and forwards traffic between them on network nodes Packet forwarding apps such as Open vSwitch [43,
that implement firewalling or routing functionality [39]. 5], the Linux router, or Click Modular Router [31] avoid
As components of future network architectures packet copying packets when forwarding between interfaces by
flows traversing a chain of VMs are also discussed [33]. performing the actual forwarding in a kernel module.
Throughput and Latency of Virtual Switching with Open vSwitch 3

However, forwarding a packet to a user space applica- eral purpose software switch that connects physically
tion or a VM requires a copy operation with the stan- separated nodes. It supports OpenFlow and provides
dard Linux network stack. There are several techniques advanced features for network virtualization.
based on memory mapping that can avoid this by giv-
ing a user space application direct access to the mem-
ory used by the DMA transfer. Prominent examples of ovs-vswitchd

User space
frameworks that implement this are PF RING DNA [16], VM
VM ofproto
VM
netmap [46], and DPDK [9, 1]. E.g., with DPDK run- dpif
ning on an Intel Xeon E5645 (6x 2.4 GHz cores) an L3 vNIC
slow
forwarding performance of 35.2 Mpps can be achieved [9]. path

Kernel space
We showed in previous work that these frameworks not
only improve the throughput but also reduce the de- data-
fast path path OpenFlow
lay [21]. Virtual switches like VALE [48] achieve over
17 Mpps vNIC to vNIC bridging performance by uti- OpenFlow
Controller
lizing shared memory between VMs and the hypervi- pNIC pNIC pNIC

sor. Prototypes similar to VALE exist [45, 33]. Virtual


switches in combination with guest OSes like ClickOS [33]
achieve notable performance of packet processing in
Fig. 2 Open vSwitch architecture representing the data pro-
VMs. All these techniques rely on changes made to
cessing flows
drivers, VM environments, and network stacks. These
modified drivers are only available for certain NICs. Ex-
periments which combine OvS with the described high- Figure 2 illustrates the different processing paths
speed packet processing frameworks [44, 47] demon- in OvS. The two most important components are the
strate performance improvements. switch daemon ovs-vswitchd that controls the kernel
module and implements the OpenFlow protocol, and
the datapath, a kernel module that implements the ac-
2.2 Packet Reception in Linux
tual packet forwarding. The datapath kernel module
The packet reception mechanism implemented in Linux processes packets using a rule-based system: It keeps a
is called NAPI. Salim et al. [50] describe NAPI in detail. flow table in memory, which associates flows with ac-
A network device signals incoming traffic to the OS tions. An example for such a rule is forwarding all pack-
by triggering interrupts. During phases of high network ets with a certain destination MAC address to a spe-
load, the interrupt handling can overload the OS. To cific physical or virtual port. Rules can also filter pack-
keep the system reactive for tasks other than handling ets by dropping them depending on specific destination
these interrupts, a NAPI enabled device allows reducing or source IP addresses. The ruleset supported by OvS
the interrupts generated. Under high load one interrupt in its kernel module is simpler than the rules defined
signals the reception of multiple packets. by OpenFlow. These simplified rules can be executed
A second possibility to reduce the interrupts is of- faster than the possibly more complex OpenFlow rules.
fered by the Intel network driver. There the Interrupt So the design choice to explicitly not support all fea-
Throttling Rate (ITR) specifies the maximum number tures OpenFlow offers results in a higher performance
of interrupts per second a network device is allowed to for the kernel module of OvS [42].
generate. The following measurements use the ixgbe A packet that can be processed by a datapath-rule
driver, which was investigated by Beifuß et al. [11]. This takes the fast path and is directly processed in the ker-
driver has an ITR of 100,000 interrupts / second in nel module without invoking any other parts of OvS.
place for traffic below 10 MB/s (156.25 kpps), the ITR Figure 2 highlights this fast path with a solid orange
is decreased to 20,000 if the traffic hits up to 20 MB/s line. Packets that do not match a flow in the flow table
(312.5 kpps), above that throughput value the ITR is are forced on the slow path (dotted blue line), which
reduced to 8,000. copies the packet to the user space and forwards it to
the OvS daemon in the user space. This is similar to
the encapsulate action in OpenFlow, which forwards a
2.3 Open vSwitch packet that cannot be processed directly on a switch to
an OpenFlow controller. The slow path is implemented
Open vSwitch [5, 41, 42, 43] can be used both as a pure by the ovs-vswitchd daemon, which operates on Open-
virtual switch in virtualized environments and as a gen- Flow rules. Packets that take this path are matched
4 Paul Emmerich et al.

against OpenFlow rules, which can be added by an ex- Panda et al. consider the overhead of virtualization too
ternal OpenFlow controller or via a command line in- high to implement SFC and present NetBricks [40], a
terface. The daemon derives datapath rules for packets NFV framework for writing fast network functions in
based on the OpenFlow rules and installs them in the the memory-safe language Rust. Our work does not fo-
kernel module so that future packets of this flow can cus on NFV: we provide benchmark results for Open
take the fast path. All rules in the datapath are associ- vSwitch, a mature and stable software switch that sup-
ated with an inactivity timeout. The flow table in the ports arbitrary virtual machines.
datapath therefore only contains the required rules to The first two papers from the OvS developers [41,
handle the currently active flows, so it acts as a cache 42] only provide coarse measurements of throughput
for the bigger and more complicated OpenFlow flow performance in bits per second in vNIC to vNIC switch-
table in the slow path. ing scenarios with Open vSwitch. Neither frame lengths
nor measurement results in packets per second (pps)
nor delay measurements are provided. In 2015 they pub-
3 Related Work lished design considerations for efficient packet process-
ing and how they reflect in the OvS architecture [43].
Detailed performance analysis of PC-based packet pro- In this publication, they also presented a performance
cessing systems have been continuously addressed in evaluation with focus on the FIB lookup, as this is sup-
the past. In 2005, Tedesco et al. [51] presented mea- ported by hierarchical caches in OvS. In [12] the au-
sured latencies for packet processing in a PC and sub- thors measured a software OpenFlow implementation
divided them to different internal processing steps. In in the Linux kernel that is similar to OvS. They com-
2007, Bolla and Bruschi [13] performed pNIC to pNIC pared the performance of the data plane of the Linux
measurements (according to RFC 2544 [14]) on a soft- bridge-utils software, the IP forwarding of the Linux
ware router based on Linux 2.61 . Furthermore, they kernel and the software implementation of OpenFlow
used profiling to explain their measurement results. Do- and studied the influence of the size of the used lookup
brescu et al. [18] revealed performance influences of tables. A basic study on the influence of QoS treat-
multi-core PC systems under different workloads [17]. ment and network separation on OvS can be found in
Contributions to the state of the art of latency mea- [25]. The authors of [28] measured the sojourn time of
surements in software routers were also made by An- different OpenFlow switches. Although the main focus
grisani et al. [10] and Larsen et al. [32] who performed was on hardware switches, they measured a delay be-
a detailed analysis of TCP/IP traffic latency. However, tween 35 and 100 microseconds for the OvS datapath.
they only analyzed the system under low load while Whiteaker et al. [54] observed a long tail distribution
we look at the behavior under increasing load up to of latencies when packets are forwarded into a VM but
10 Gbit/s. A close investigation of the of latency in their measurements were restricted to a 100 Mbit/s net-
packet processing software like OvS is presented by Bei- work due to hardware restrictions of their time stamp-
fuß et al. [11]. ing device. Rotsos et al. [49] presented OFLOPS, a
In the context of different modifications to the guest framework for OpenFlow switch evaluation. They ap-
and host OS network stack (cf. Section 2.1), virtual plied it, amongst others, to Open vSwitch. Deployed on
switching performance was measured [48, 33, 15, 44, 47, systems with a NetFPGA the framework measures ac-
27] but the presented data provide only limited possibil- curate time delay of OpenFlow table updates but not
ity for direct comparison. Other studies addressed the the data plane performance. Their study revealed ac-
performance of virtual switching within a performance tions that can be performed faster by software switches
analysis of cloud datacenters [53], but provide less de- than by hardware switches, e.g., requesting statistics.
tailed information on virtual switching performance. We previously presented delay measurements of VM
Running network functions in VMs and connect- network packet processing in selected setups on an ap-
ing them via a virtual switch can be used to imple- plication level [21].
ment network function virtualization (NFV) with ser-
Latency measurements are sparse in the literature
vice function chaining (SFC) [23]. Martins et al. present
as they are hard to perform in a precise manner [20].
ClickOS, a software platform for small and resource-
Publications often rely on either special-purpose hard-
efficient virtual machines implementing network func-
ware, often only capable of low rates, (e.g., [13, 54])
tions [34]. Niu et al. discuss the performance of ClickOS [34]
or on crude software measurement tools that are not
and SoftNIC [24] when used to implement SFC [38].
precise enough to get insights into latency distributions
1
The “New API” network interface was introduced with on low-latency devices such as virtual switches. We use
this kernel version. our packet generator MoonGen that supports hardware
Throughput and Latency of Virtual Switching with Open vSwitch 5

timestamping on Intel commodity NICs for the latency


VM VM
evaluation here [20].
We addressed the throughput of virtual switches vNIC vNIC vNIC
in a previous publication [22] on which this paper is Switch Switch
pNIC pNIC pNIC
based. This extended version adds latency measure-
ments and new throughput measurements on updated
(a) pNIC to vNIC2 (b) pNIC to pNIC
software versions of the virtual switches.
through VM

VM VM
4 Test Setup
vNIC vNIC vNIC
The description of our test setup reflects the specific Switch Switch
pNIC pNIC pNIC
hardware and software used for our measurements and
includes the various VM setups investigated. Figure 3 (c) vNIC to vNIC (d) pNIC forward-
shows the server setup. ing

Fig. 3 Investigated test setups

4.1 Hardware Setup for Throughput Tests


As the PF RING based packet generator does not
Our device under test (DuT ) is equipped with an In- support delay measurements, we use our traffic genera-
tel X520-SR2 and an Intel X540-T2 dual 10 GbE net- tor MoonGen [20] for those. This tool also can generate
work interface card which are based on the Intel 82599 full line rate traffic with minimally sized UDP pack-
and Intel X540 Ethernet controller. The processor is ets. MoonGen uses hardware timestamping features of
a 3.3 GHz Intel Xeon E3-1230 V2 CPU. We disabled Intel commodity NICs to measure latencies with sub-
Hyper-Threading, Turbo Boost, and power saving fea- microsecond precision and accuracy. MoonGen also pre-
tures that scale the frequency with the CPU load be- cisely controls the inter-departure times of the gen-
cause we observed measurement artifacts caused by these erated packets. The characteristics of the inter-packet
features. spacings are of particular interest for latency measure-
In black-box tests we avoid any overhead on the ments as it can have significant effects on the processing
DuT through measurements, so we measure the offered batch size in the system under test [20].
load and the packet rate on the packet generator and
sink. The DuT runs the Linux tool perf for white-box
tests; this overhead reduces the maximum packet rate 4.3 Setup with Virtual Machines
by ∼ 1%.
Figure 3 shows the setups for tests involving VMs Figure 3a, 3b, and 3c show the setups for tests involv-
on the DuT (Figure 3a, 3b, and 3c) and a pure pNIC ing VMs on the DuT. Generating traffic efficiently di-
switching setup (Figure 3d), which serves as baseline rectly inside a VM proved to be a challenging prob-
for comparison. lem because both of our load generators are based on
packet IO frameworks, which only work with certain
NICs. Porting them to a virtualization-aware packet
4.2 Software Setup IO framework (e.g., vPF RING [15]), would circumvent
the VM-hypervisor barrier, which we are trying to mea-
The DuT runs the Debian-based live Linux distribution sure.
Grml with a 3.7 kernel, the ixgbe 3.14.5 NIC driver with The performance of other load generators was found
interrupts statically assigned to CPU cores, OvS 2.0.0 to be insufficient, e.g., the iperf utility only managed
and OvS 2.4.0 with DPDK using a static set of Open- to generate 0.1 Mpps. Therefore, we generate traffic ex-
Flow rules, and qemu-kvm 1.1.2 with VirtIO network ternally and send it through a VM. A similar approach
adapters unless mentioned otherwise. to load generation in VMs can be found in [2]. Running
The throughput measurements use the packet gen- profiling in the VM shows that about half of the time
erator pfsend from the PF RING DNA [16] framework. is spent receiving traffic and half of it is spent sending
This tool is able to generate minimally sized UDP pack- traffic out. Therefore, we assume that the maximum
ets at line rate on 10 Gbit interfaces (14.88 Mpps). The possible packet rate for a scenario in which a VM inter-
packet rate is measured by utilizing statistics registers nally generates traffic is twice the value we measured
of the NICs on the packet sink. in the scenario where traffic is sent through a VM.
6 Paul Emmerich et al.

4.4 Adoptions for Delay Measurements 8


1 Flow
7 2 Flows
Precise and accurate latency measurements require a 6 3 Flows

Packet Rate [Mpps]


4 Flows
synchronized clock on the packet generator and sink. 5
To avoid complex synchronization, we send the out-
4
put from the DuT back to the source host. The mea-
surement server generates traffic on one port, the DuT 3

forwards traffic between the two ports of its NIC and 2


sends it back to second port on the measurement server. 1
Therefore, the delay measurements are not possible on 0
64 256 512 768 1024 1280 1518
all Setups (cf. Figure 3). For the first delay measure-
Packet Size [byte]
ments the DuT forwards traffic between two pNICs as
depicted in Figure 3d. We use these results as baseline Fig. 4 Packet rate with various packet sizes and flows
to compare them with delays in the second setup, which
uses a VM to forward traffic between the two pNICs as
efficiency, and resource allocation. Hence, our main fo-
shown in Figure 3b.
cus is the widely deployed kernel forwarding techniques,
Our latency measurements require the traffic to be
but we also include measurements for DPDK vSwitch
sent back to the source. We used a X540 NIC for the la-
to show possible improvements for the next generation
tency tests because this interface was the only available
of virtual switches.
dual port NIC in our testbed.
Guideline 1 Use Open vSwitch instead of the Linux
bridge or router.
Open vSwitch proves to be the second fastest virtual
5 Throughput Measurements switch and the fastest one that runs in the Linux kernel.
The Linux bridge is slightly faster than IP forwarding
We ran tests to quantify the throughput of several soft- when it is used as a virtual switch with vNICs. IP for-
ware switches with a focus on OvS in scenarios involving warding is faster when used between pNICs. This shows
both physical and virtual network interfaces. Through- that OvS is a good general purpose software switch for
put can be measured as packet rate in Mpps or band- all scenarios. The rest of this section will present more
width in Gbit/s. We report the results of all experi- detailed measurements of OvS. All VMs were attached
ments as packet rate at a given packet size. via VirtIO interfaces.
There are two different ways to include VMs in
DPDK vSwitch: Intel ivshmem and vhost user with Vir-
5.1 Throughput Comparison tIO. Intel ivshmem requires a patched version of qemu
and is designed to target DPDK applications running
Table 1 compares the performance of several forwarding inside the VM. The latter is a newer implementation
techniques with a single CPU core per VM and switch. of the VM interface and the default in DPDK vSwitch
DPDK vSwitch started as a port of OvS to the user and works with stock qemu and targets VMs that are
space packet processing framework DPDK [3] and was not running DPDK.
later merged into OvS. DPDK support is a compile- Intel ivshmem is significantly faster with DPDK run-
time option in recent versions of OvS. We use the name ning inside the VM [2]. However, it was removed from
DPDK vSwitch here to refer to OvS with DPDK sup- DPDK in 2016 [19] due to its design issues and low
port enabled. number of users [52]. The more generic, but slower,
DPDK vSwitch is the fastest forwarding technique, vhost user API connects VMs via the stable and stan-
but it is still experimental and not yet ready for real- dardized VirtIO interface [36]. All our measurements
world use: we found it cumbersome to use and it was involving VMs in DPDK vSwitch were thus conducted
prone to crashes requiring a restart of all VMs to re- with vhost user and VirtIO.
store connectivity. Moreover, the administrator needs
to statically assign CPU cores to DPDK vSwitch. It
then runs a busy wait loop that fully utilizes these CPU 5.2 Open vSwitch Performance in pNIC to pNIC
cores at all times – there is 100% CPU load, even when Forwarding
there is no network load. There is no support for any
power-saving mechanism or yielding the CPU between Figure 4 shows the basic performance characteristics of
packets. This is a major concern for green computing, OvS in an unidirectional forwarding scenario between
Throughput and Latency of Virtual Switching with Open vSwitch 7

Table 1 Single Core Data Plane Performance Comparison

Mpps from pNIC to


Application pNIC vNIC vNIC to pNIC vNIC to vNIC
Open vSwitch 1.88 0.85 0.3 0.27
IP forwarding 1.58 0.78 0.19 0.16
Linux bridge 1.11 0.74 0.2 0.19
DPDK vSwitch 13.51 2.45 1.1 1.0

2
5

Cache Misses per Second [·107 ]


Normalized Packet Rate

8 1.9.3 1.9.3
Packet Rate [Mpps]

7 1.10.2 4 1.10.2
6 1.11.0 1.11.0

Packet Rate [Mpps]


5 2.0.0 3 2.0.0 1.5 1
4 2.4.0 2.4.0
2
3
2 1
1 1
0 0
1 2 3 4 1 2 3 4
Flows Flows 0.5
(a) Packet rates (b) Normalized to one 0.5 Packet Rate
flow L1 Misses
Fig. 5 Packet rate of different Open vSwitch versions, 1 to L2 Misses
4 flows 0 0
0 500 1000 1500
Active Flows
Fig. 6 Flow table entries vs. cache misses
two pNICs with various packet sizes and flows. Flow
Packet Rate per Stream [Mpps]

refers to a combination of source and destination IP 2


addresses and ports. The packet size is irrelevant until
the bandwidth is limited by the 10 Gbit/s line rate. We
ran further tests in which we incremented the packet
1
size in steps of 1 Byte and found no impact of packet
sizes that are not multiples of the CPU’s word or cache
line size. The throughput scales sub-linearly with the
number of flows as the NIC distributes the flows to dif- 0
1 2 3 4 5 6 7 8 9 10
ferent CPU cores. Adding an additional flow increases Output Streams
the performance by about 90% until all four cores of
Fig. 7 Effects of cloning a flow
the CPU are utilized.
As we observed linear scaling with earlier versions
of OvS we investigated further. Figure 5 compares the further tests are restricted to a single CPU core. The
throughput and scaling with flows of all recent versions throughput per core is 1.88 Mpps.
of OvS that are compatible with Linux kernel 3.7. Ver-
sions prior to 1.11.0 scale linearly whereas later ver-
sions only scale sub-linearly, i.e. adding an additional 5.3 Larger Number of Flows
core does not increase the throughput by 100% of the
single flow throughput. Profiling reveals that this is due We derive a test case from the OvS architecture de-
to a contended spin lock that is used to synchronize ac- scribed in Section 2.3: Testing more than four flows ex-
cess to statistics counters for the flows. Later versions ercises the flow table lookup and update mechanism in
support wild card flows in the kernel and match the the kernel module due to increased flow table size. The
whole synthetic test traffic to a single wildcarded data- generated flows for this test use different layer 2 ad-
path rule in this scenario. So all packets of the different dresses to avoid the generation of wild card rules in the
flows use the same statistics counters, this leads to a OvS datapath kernel module. This simulates a switch
lock contention. A realistic scenario with multiple rules with multiple attached devices.
or more (virtual) network ports does not exhibit this Figure 6 shows that the total throughput is affected
behavior. Linear scaling with the number of CPU cores by the number of flows due to increased cache misses
can, therefore, be assumed in real-world scenarios and during the flow table lookup. The total throughput drops
8 Paul Emmerich et al.

from about 1.87 Mpps2 with a single flow to 1.76 Mpps 2


pNIC to pNIC
with 2000 flows. The interrupts were restricted to a sin-

Packet Rate [Mpps]


pNIC to vNIC
gle CPU core.
Another relevant scenario for a cloud system is cloning
a flow and sending it to multiple output destinations, 1
e.g., to forward traffic to an intrusion detection system
or to implement multicast. Figure 7 shows that per-
formance drops by 30% when a flow is sent out twice
0
and another 25% when it is copied one more time. This 0 1 2
demonstrates that a large amount of the performance Offered Load [Mpps]
can be attributed to packet I/O and not processing.
Fig. 8 Offered load vs. throughput with pNICs and vNICs
About 30% of the CPU time is spent in the driver and
network stack sending packets. This needs to be con- 10
sidered when a monitoring system is to be integrated
8

CPU Usage [%]


into a system involving software switches. An intrusion
detection system often works via passive monitoring of 6
mirrored traffic. Hardware switches can do this without 4
overhead in hardware, but this is a significant cost for context switching
2
a software switch. vNIC queues
0
0 1 2
Offered Load [Mpps]
5.4 Open vSwitch Throughput with Virtual Network
Fig. 9 CPU load of context switching and vNIC queuing
Interfaces
1 8
Virtual network interfaces exhibit different performance
Packet Rate [Mpps]

characteristics than physical interfaces. For example, 6

CPU Usage [%]


dropping packets in an overload condition is done effi-
0.5 4
ciently and concurrently in hardware on a pNIC whereas
a vNIC needs to drop packets in software. We, there- Packet rate
2
fore, compare the performance of the pNIC to pNIC CPU load caused by copying packets
0 0
forwarding with the pNIC to vNIC scenario shown in 64 256 512 768 1024 1280 1518
Packet Size [Byte]
Figure 3a.
Figure 8 compares the observed throughput under Fig. 10 Packet size vs. throughput and memory copy over-
increasing offered load with both physical and virtual head with vNICs
interfaces. The graph for traffic sent into a VM shows
1
an inflection point at an offered load of 0.5 Mpps. The
Packet Rate [Mpps]

throughput then continues to increase until it reaches


0.85 Mpps, but a constant ratio of the incoming pack-
0.5
ets is dropped. This start of drops is accompanied by
a sudden increase in CPU load in the kernel. Profil-
ing the kernel with perf shows that this is caused by 0
increased context switching and functions related to 0 0.5 1 1.5 2 2.5

packet queues. Figure 9 plots the CPU load caused Offered Load [Mpps]

by context switches (kernel function switch to) and Fig. 11 Throughput without explicitly pinning all tasks to
functions related to virtual NIC queues at the tested CPU cores
offered loads with a run time of five minutes per run.
This indicates that a congestion occurs at the vNICs
Packet sizes are also relevant in comparison to the
and the system tries to resolve this by forcing a context
pNIC to pNIC scenario because the packet needs to
switch to the network task of the virtual machine to
be copied to the user space to forward it to a VM.
retrieve the packets. This additional overhead leads to
Figure 10 plots the throughput and the CPU load of
drops.
the kernel function copy user enhanced fast string,
2
Lower than the previously stated figure of 1.88 Mpps due which copies a packet into the user space, in the for-
to active profiling. warding scenario shown in Figure 3a. The throughput
Throughput and Latency of Virtual Switching with Open vSwitch 9

100 of CPU usage by interrupts, which resolves this conflict.


perf (cycle counter)
mpstat However, this option is not enabled by default in the
80
Linux kernel because it slows down interrupt handlers,
which are designed to be executed as fast as possible.
Guideline 3 CPU load of cores handling interrupts
CPU Load [%]

60
should be measured with hardware counters using perf.

40 We conducted further tests in which we sent exter-


nal traffic through a VM and into a different VM or to
20
another pNIC as shown in Figure 3b and 3c in Section 4.
The graphs for the results of more detailed tests in these
scenarios provide no further insight beyond the already
0
0 0.5 1 1.5 2 2.5 discussed results from this section because sending and
Offered Load [Mpps] receiving traffic from and to a vNIC show the same
performance characteristics.
Fig. 12 CPU load when forwarding packets with Open
vSwitch

5.5 Conclusion
drops only marginally from 0.85 Mpps to 0.8 Mpps until
it becomes limited by the line rate with packets larger Virtual switching is limited by the number of packets,
than 656 Byte. Copying packets poses a measurable but not the overall throughput. Applications that require a
small overhead. The reason for this is the high mem- large number of small packets, e.g., virtualized network
ory bandwidth of modern servers: our test server has a functions, are thus more difficult for a virtual switch
memory bandwidth of 200 Gbit per second. This means than applications relying on bulk data transfer, e.g.,
that VMs are well-suited for running network services file servers. Overloading virtual ports on the switch can
that rely on bulk throughput with large packets, e.g., lead to packet loss before the maximum throughput is
file servers. Virtualizing packet processing or forward- achieved.
ing systems that need to be able to process a large num- Using the DPDK backend in OvS can improve the
ber of small packets per second is, however, problem- throughput by a factor of 7 when no VMs are involved.
atic. With VMs, an improvement of a factor of 3 to 4 can be
We derive another test case from the fact that the achieved, cf. Table 1. However, DPDK requires stati-
DuT runs multiple applications: OvS and the VM re- cally assigned CPU cores that are constantly being uti-
ceiving the packets. This is relevant on a virtualiza- lized by a busy-wait polling logic, causing 100% load on
tion server where the running VMs generate substan- these cores. Using the slower default Linux IO backend
tial CPU load. The VM was pinned to a different core results in a linear correlation between network load and
than the NIC interrupt for the previous test. Figure 11 CPU load, cf. Figure 12.
shows the throughput in the same scenario under in-
creasing offered load, but without pinning the VM to a
6 Latency Measurements
core. This behavior can be attributed to a scheduling
conflict because the Linux kernel does not measure the
In another set of measurements we address the packet
load caused by interrupts properly by default. Figure 12
delay introduced by software switching in OvS. There-
shows the average CPU load of a core running only OvS
fore, we investigate two different scenarios. In the first
as seen by the scheduler (read from the procfs pseudo
experiment, traffic is forwarded between two physical
filesystem with the mpstat utility) and compares it to
interfaces (cf. Figure 3d). For the second scenario the
the actual average load measured by reading the CPU’s
packets are not forwarded between the physical inter-
cycle counter with the profiling utility perf.
faces directly but through a VM as it is shown in Fig-
Guideline 2 Virtual machine cores and NIC inter- ure 3b.
rupts should be pinned to disjoint sets of CPU cores.
The Linux scheduler does not measure the CPU load
caused by hardware interrupts properly and therefore 6.1 Forwarding between Physical Interfaces
schedules the VM on the same core, which impacts the
performance. CONFIG IRQ TIME ACCOUNTING is a kernel Figure 13 shows the measurement for a forwarding be-
option, which can be used to enable accurate reporting tween two pNICs by Open vSwitch. This graph features
10 Paul Emmerich et al.

104 rates lower than 312.5 kpps, and to 8k per second above
99th Percentile
that. These packet rates equal to the step into the next
75th Percentile
50th Percentile plateau of the graph.
25th Percentile
103
At the end of the third level the latency drops again
right before the switch is overloaded. Note that the drop
Latency [µs]

in latency occurs at the point at which the Linux sched-


102 uler begins to recognize the CPU load caused by inter-
rupts (cf. Section 5.4, Figure 12). The Linux scheduler
is now aware that the CPU core is almost fully loaded
101 with interrupt handlers and therefore stops scheduling
other tasks on it. This causes a slight decrease in la-
tency. Then the fourth level is reached and the latency
P1

P2

P3

100
increases to about 1 ms as Open vSwitch can no longer
0 0.5 1 1.5 2 cope with the load and all queues fill up completely.
Offered Load [Mpps]
We visualized the distributions of latency at three
Fig. 13 Latency of packet forwarding between pNICs
measurement points P1 – P3 (cf. Figure 13 and Ta-
ble 2). The distributions at these three measurements
Probability [%]

6 P1 (44.6 kpps)

4
are plotted as histogram with bin width of 0.25 µs in
2 Figure 14. The three selected points show the typical
0 shapes of the probability density function of their re-
0 10 20 30 40 50 60 70 80 90 100 110 120 130
Latency [µs] spective levels.
Probability [%]

0.8 P2 (267.9 kpps)


0.6 At P1 the distribution shows the behavior before
0.4
0.2 the interrupt throttle rate affects processing, i.e. one
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130
interrupt per packet is used. The latencies are approxi-
Latency [µs] mately normally distributed as each packet is processed
Probability [%]

0.6 P3 (1161.7 kpps) independently.


0.4
0.2 The distribution at P2 demonstrates the effect of
0
0 10 20 30 40 50 60 70 80 90 100 110 120 130 the ITR used by the driver. A batch of packets accu-
Latency [µs]
mulates on the NIC and is then processed by a single
Fig. 14 Latency distribution for forwarding between pNICs interrupt. This causes a uniform distribution as each
packet is in a random position in the batch.

four different levels of delay. The first level has an aver- For measurement P3 the distribution depicts a high
age latency of around 15 µs and a packet transfer rate load at which both the interrupt throttle rate and the
of up to 179 kpps. Above that transfer rate the sec- poll mechanism of the NAPI affect the distribution. A
ond level has a delay value of around 28 µs and lasts significant number of packets accumulates on the NIC
up to 313 kpps. Beyond that rate the third level of- before the processing is finished. Linux then polls the
fers a latency of around 53 µs up to a transfer rate of NIC again after processing, without re-enabling inter-
1.78 Mpps. We selected three different points (P1 - P3) rupts in between, and processes a second smaller batch.
– one as a representative for each of the levels before This causes an overlay of the previously seen uniform
the system becomes overloaded (cf. Figure 13). Table 2 distribution with additional peaks caused by the NAPI
also includes these points to give typical values for their processing.
corresponding level.
Overloading the system leads to an unrealistic ex-
The reason for the shape and length of the first cessive latency of ≈ 1 ms and its exact distribution is of
three levels is the architecture of the ixgbe driver as little interest. Even the best-case 1st percentile shows
described by Beifuß et al. [11]. This driver limits the a latency of about 375 µs in all measurements during
interrupt rate to 100k per second for packet rates lower overload conditions, far higher than even the worst-case
than 156.2 kpps, which relates the highest transfer rate of the other scenarios.
measured for the first level in Figure 13. The same ob-
servation holds for the second and the third level. The Guideline 4 Avoid overloading ports handling latency-
interrupt rate is limited to 20k per second for transfer critical traffic.
Throughput and Latency of Virtual Switching with Open vSwitch 11

104 in Table 2. The development of the latency under in-


vNIC 99th Percentile
vNIC 75th Percentile
creasing load shows the same basic characteristics as in
vNIC 50th Percentile the pNIC scenario due to the interrupt-based process-
vNIC 25th Percentile ing of the incoming packets. However, the additional
103
pNIC 99th Percentile
pNIC 75th Percentile work-load and the emulated NICs smooth the sharp in-
Latency [µs]

pNIC 50th Percentile flection points and also increase the delay.
pNIC 25th Percentile
102 The histograms for the latency at the lowest investi-
gated representatively selected packet rate – P1 in Fig-
ure 14 and V1 in Figure 16 – have a similar shape.
The shape of the histogram in V3 is a long-tail dis-
101
tribution, i.e. while the average latency is low, there
is a significant number of packets with a high delay.
V1

V2

V3
This distribution was also observed by Whiteaker et
100 al. [54] in virtualized environments. However, we could
0 50 100 150 200 250 300 350 400
Offered Load [kpps] only observe this type of traffic under an overload sce-
Fig. 15 Latency of packet forwarding through VM
nario like V3. Note that the maximum packet rate for
this scenario was previously given as 300 kpps in Ta-
ble 1, so V3 is already an overload scenario. We could
Probability [%]

3 V1 (39.0 kpps)
2
not observe such a distribution under normal load. The
1 worst-case latency is also significantly higher than in
0 the pNIC scenario due to the additional buffers in the
0 50 100 150 200 250 300 350
Latency [µs] vNICs.
Guideline 5 Avoid virtualizing services that are sen-
Probability [%]

0.8 V2 (283.3 kpps)


0.6
0.4
sitive to a high 99th percentile latency.
0.2
0
0 50 100 150 200 250 300 350
Latency [µs]
6.3 Improving Latency with DPDK
Probability [%]

0.3 V3 (322.3 kpps)


0.2

0.1 Porting OvS to DPDK also improves the latency. Fig-


0 ure 17 visualizes representative histograms of latency
0 50 100 150 200 250 300 350
Latency [µs] probability distributions of forwarding between physi-
Fig. 16 Latency distribution of packet forwarding through
cal and virtual interfaces.
VM DPDK uses a busy-wait loop polling all NICs in-
stead of relying on interrupts, resulting in the previ-
ously mentioned constant CPU load of 100% on all as-
6.2 Forwarding Through Virtual Machines signed cores. The resulting latency is thus a normal
distribution and there are no sudden changes under
For delay measurements of VMs we use our setup as increasing load as no interrupt moderation algorithms
depicted in Figure 3b. There the traffic originating from are used. We measured a linearly increasing median la-
the measurement server is forwarded through the VM tency from 7.8 µs (99th percentile: 8.2 µs) at 0.3 Mpps
and back to the measurement server. to 13.9 µs (99th percentile: 23.0 µs) at 13.5 Mpps when
Figure 15 compares the latency in this scenario with forwarding between two physical network interfaces.
the pNIC forwarding from Section 6.1. Note that the Adding virtual machines leads again to a long-tail
plot uses a logarithmic y-axis, so the gap between the distribution as the traffic is processed by multiple differ-
slowest and the fastest packets for the vNIC scenario is ent processes on different queues. The virtual machine
wider than in the pNIC scenario even though it appears still uses the VirtIO driver which also does not feature
smaller. sophisticated interrupt adaptation algorithms. Hence,
The graph for the vNICs does not show plateaus of the distribution stays stable regardless of the applied
steady latency like the pNIC graph but a rather smooth load. The overall median latency stayed with the range
growth of latency. Analogous to P1 – P3 in the pNIC of 13.5 µs to 14.1 µs between 0.03 Mpps and 1 Mpps.
scenario, we selected three points V1 – V3 as depicted However, the 99th percentile increases from 15 µs Mpps
in Figure 15. Each point is representative to the relat- to 150 µs over the same range, i.e. the long tail grows
ing levels of latency. The three points are also available longer as the load increases.
12 Paul Emmerich et al.

Table 2 Comparison of Latency

Scenario Load Load* Average Std.Dev. 25th 50th 95th 99th


[kpps] [%] [µs] [µs] Perc. [µs] Perc. [µs] Perc. [µs] Perc. [µs]
P1 (pNIC) 44.6 2.3 15.8 4.6 13.4 15.0 17.3 23.8
P2 (pNIC) 267.9 14.0 28.3 11.2 18.6 28.3 37.9 45.8
P3 (pNIC) 1161.7 60.5 52.2 27.0 28.5 53.0 77.0 89.6
V1 (vNIC) 39.0 11.4 33.0 3.3 31.2 32.7 35.1 37.3
V2 (vNIC) 283.3 82.9 106.7 16.6 93.8 105.5 118.5 130.8
V3 (vNIC) 322.3 94.3 221.1 49.1 186.7 212.0 241.9 319.8
*) Normalized to the load at which more than 10% of the packets were dropped, i.e.
a load ≥ 100% would indicate an overload scenario
Probability [%]

4 DP1 (4.1 Mpps) is used. The load caused by processing packets on the
2 hypervisor should also be considered when allocating
0
CPU resources to VMs. Even a VM with only one vir-
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 tual CPU core can load two CPU cores due to virtual
Latency [µs]
switching. The total system load of Open vSwitch can
Probability [%]

VP1 (0.8 Mpps)


0.4 be limited by restricting the NIC’s interrupts to a set
0.2 of CPU cores instead of allowing them on all cores. If
0 pinning all tasks is not feasible, make sure to measure
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Latency [µs] the CPU load caused by interrupts properly.
Fig. 17 Latency distribution of forwarding with DPDK Guideline 3 CPU load of cores handling inter-
rupts should be measured with hardware counters using
perf. The kernel option CONFIG IRQ TIME ACCOUNTING
6.4 Conclusion can be enabled despite its impact on the performance of
interrupt handlers, to ensure accurate reporting of CPU
Latency depends on the load of the switch. This effect utilization with standard tools, cf. Figure 12. Note that
is particularly large when OvS is running in the Linux the performance of OvS is not impacted by this option
kernel due to interrupt moderation techniques leading as the Linux kernel prefers polling over interrupts under
to changing latency distributions as the load increases. high load.
Overloading a port leads to excessive worst-case laten-
Guideline 4 Avoid overloading ports handling
cies that are one to two orders of magnitude worse than
latency-critical traffic. Overloading a port impacts la-
latencies before packet drops occur. Virtual machines
tency by up to two orders of magnitude due to buffering
exhibit a long-tail distribution of the observed laten-
in software. Hence, bulk traffic should be kept separated
cies under high load. These problems can be addressed
from latency-critical traffic.
by running OvS with DPDK which exhibits a more con-
Guideline 5 Avoid virtualizing services that are
sistent and lower latency profile.
sensitive to a high 99th percentile latency. Latency dou-
bles when using a virtualized application compared to
7 Conclusion a natively deployed application, cf. Section 6. This is
usually not a problem as the main share of latency is
We analyzed the performance characteristics and limi- caused by the network and not by the target server.
tations of the Open vSwitch data plane, a key element However, the worst-case latency (99th percentile) for
in many cloud environments. Our study showed good packets increases by an order of magnitude for packets
performance when compared to other Linux kernel for- processed by a VM, cf. Section 6.2. This can be prob-
warding techniques, cf. Section 5.1. A few guidelines for lematic for protocols with real-time requirements.
system operators can be derived from our results: Virtualizing services that rely on bulk data trans-
Guideline 1 To improve performance, OvS should fer via large packets, e.g., file servers, achieve a high
be preferred over the default Linux tools when using throughput measured in Gbit/s, cf. Figure 10. The per-
cloud frameworks. Consider DPDK als backend for OvS packet overhead dominates over the per-byte overhead.
for future deployments. Services relying on smaller packets are thus more diffi-
Guideline 2 Virtual machine cores and NIC in- cult to handle. Not only the packet throughput suffers
terrupts should be pinned to disjoint sets of CPU cores. from virtualization: latency also increases by a factor
Figure 11 shows performance drops when no pinning of 2 and the 99th percentile even by an order of mag-
Throughput and Latency of Virtual Switching with Open vSwitch 13

nitude. However, moving packet processing systems or 10. Angrisani, L., Ventre, G., Peluso, L., Tedesco, A.:
virtual switches and routers into VMs is problematic Measurement of Processing and Queuing Delays In-
because of the high overhead per packet that needs to troduced by an Open-Source Router in a Single-
cross the VM/host barrier and because of their latency- Hop Network. IEEE Transactions on Instrumenta-
sensitive nature. tion and Measurement 55(4), 1065–1076 (2006)
The shift to user space packet-processing frameworks 11. Beifuß, A., Raumer, D., Emmerich, P., Runge,
like DPDK promises substantial improvements for both T.M., Wohlfart, F., Wolfinger, B.E., Carle, G.: A
throughput (cf. Section 5.1) and latency (cf. Section 6). Study of Networking Software Induced Latency. In:
DPDK is integrated, but disabled by default, in Open 2nd International Conference on Networked Sys-
vSwitch. However, the current version we evaluated still tems 2015. Cottbus, Germany (2015)
had stability issues and is not yet fit for production. 12. Bianco, A., Birke, R., Giraudo, L., Palacin,
Further issues with the DPDK port are usability as M.: Openflow switching: Data plane performance.
complex configuration is required and the lack of de- In: International Conference on Communications
bugging facilities as standard tools like tcpdump are (ICC). IEEE (2010)
currently not supported. Intel is currently working on 13. Bolla, R., Bruschi, R.: Linux Software Router: Data
improving these points to get DPDK vSwitch into pro- Plane Optimization and Performance Evaluation.
duction [30]. Journal of Networks 2(3), 6–17 (2007)
14. Bradner, S., McQuaid, J.: Benchmarking Method-
ology for Network Interconnect Devices. RFC 2544
Acknowledgments (Informational) (1999)
15. Cardigliano, A., Deri, L., Gasparakis, J., Fusco, F.:
This research has been supported by the DFG (German vPFRING: Towards WireSpeed Network Monitor-
Research Foundation) as part of the MEMPHIS project ing using Virtual Machines. In: ACM Internet Mea-
(CA 595/5-2) and in the framework of the CELTIC surement Conference (2011)
EUREKA project SENDATE-PLANETS (Project ID 16. Deri, L.: nCap: Wire-speed Packet Capture and
C2015/3-1), partly funded by the German BMBF (Pro- Transmission. In: IEEE Workshop on End-to-
ject ID 16KIS0460K). The authors alone are responsible End Monitoring Techniques and Services, pp. 47–55
for the content of the paper. (2005)
17. Dobrescu, M., Argyraki, K., Ratnasamy, S.: To-
ward Predictable Performance in Software Packet-
References
Processing Platforms. In: USENIX Conference
on Networked Systems Design and Implementation
1. Intel DPDK: Data Plane Development Kit. http:
(NSDI) (2012)
//dpdk.org. Last visited 2016-03-27
18. Dobrescu, M., Egi, N., Argyraki, K., Chun, B., Fall,
2. Intel DPDK vSwitch. https://01.org/sites/
K., Iannaccone, G., Knies, A., Manesh, M., Rat-
default/files/page/intel_dpdk_vswitch_
nasamy, S.: RouteBricks: Exploiting Parallelism To
performance_figures_0.10.0_0.pdf. Last
Scale Software Routers. In: 22nd ACM Symposium
visited 2016-03-27
on Operating Systems Principles (SOSP) (2009)
3. Intel DPDK vSwitch. https://github.com/
19. DPDK Project: DPDK 16.11 Release Notes.
01org/dpdk-ovs. Last visited 2016-03-27
http://dpdk.org/doc/guides/rel_notes/
4. Intel I/O Acceleration Technology. http:
release_16_11.html (2016). Last visited 2016-
//www.intel.com/content/www/us/en/
03-27
wireless-network/accel-technology.html.
20. Emmerich, P., Gallenmüller, S., Raumer, D., Wohl-
Last visited 2016-03-27
fart, F., Carle, G.: MoonGen: A Scriptable High-
5. Open vSwitch. http://openvswitch.org. Last
Speed Packet Generator. In: 15th ACM SIGCOMM
visited 2016-03-27
Conference on Internet Measurement (IMC’15)
6. OpenNebula. https://opennebula.org. Last vis-
(2015)
ited 2016-03-27
21. Emmerich, P., Raumer, D., Wohlfart, F., Carle,
7. OpenStack. https://openstack.org. Last visited
G.: A Study of Network Stack Latency for Game
2016-03-27
Servers. In: 13th Annual Workshop on Network
8. Virtual Machine Device Queues: Technical White
and Systems Support for Games (NetGames’14).
Paper (2008)
Nagoya, Japan (2014)
9. Impressive Packet Processing Performance Enables
Greater Workload Consolidation (2013)
14 Paul Emmerich et al.

22. Emmerich, P., Raumer, D., Wohlfart, F., Carle, 33. Martins, J., Ahmed, M., Raiciu, C., Olteanu, V.,
G.: Performance Characteristics of Virtual Switch- Honda, M., Bifulco, R., Huici, F.: ClickOS and the
ing. In: 2014 IEEE 3rd International Conference Art of Network Function Virtualization. In: 11th
on Cloud Networking (CloudNet’14). Luxembourg USENIX Symposium on Networked Systems De-
(2014) sign and Implementation (NSDI 14), pp. 459–473.
23. ETSI: Network Functions Virtualisation (NFV); USENIX Association, Seattle, WA (2014)
Architectural Framework, V1.1.1 (2013) 34. Martins, J., Ahmed, M., Raiciu, C., Olteanu, V.,
24. Han, S., Jang, K., Panda, A., Palkar, S., Han, Honda, M., Bifulco, R., Huici, F.: Clickos and the
D., Ratnasamy, S.: Softnic: A software nic to aug- art of network function virtualization. In: 11th
ment hardware. Tech. Rep. UCB/EECS-2015-155, USENIX Symposium on Networked Systems De-
EECS Department, University of California, Berke- sign and Implementation (NSDI 14), pp. 459–473.
ley (2015) USENIX Association, Seattle, WA (2014)
25. He, Z., Liang, G.: Research and evaluation of net- 35. Meyer, T., Wohlfart, F., Raumer, D., Wolfinger,
work virtualization in cloud computing environ- B., Carle, G.: Validated Model-Based Prediction of
ment. In: Networking and Distributed Computing Multi-Core Software Router Performance. Praxis
(ICNDC), pp. 40–44. IEEE (2012) der Informationsverarbeitung und Kommunikation
26. Huggahalli, R., Iyer, R., Tetrick, S.: Direct Cache (PIK) (2014)
Access for High Bandwidth Network I/O. In: Pro- 36. Michael Tsirkin Cornelia Huck, P.M.: Virtual I/O
ceedings of the 32nd Annual International Sympo- Device (VIRTIO) Version 1.0 Committee Specifica-
sium on Computer Architecture, pp. 50–59 (2005) tion 04. OASIS (2016)
27. Hwang, J., Ramakrishnan, K.K., Wood, T.: Netvm: 37. Munch, B.: Hype Cycle for Networking and Com-
High performance and flexible networking using munications. Report, Gartner (2013)
virtualization on commodity platforms. In: 11th 38. Niu, Z., Xu, H., Tian, Y., Liu, L., Wang, P.,
USENIX Symposium on Networked Systems De- Li, Z.: Benchmarking NFV Software Dataplanes
sign and Implementation (NSDI 14), pp. 445–458. arXiv:1605.05843 (2016)
USENIX Association, Seattle, WA (2014) 39. OpenStack: Networking Guide: Deployment Sce-
28. Jarschel, M., Oechsner, S., Schlosser, D., Pries, narios. http://docs.openstack.org/liberty/
R., Goll, S., Tran-Gia, P.: Modeling and perfor- networking-guide/deploy.html (2015). Last vis-
mance evaluation of an OpenFlow architecture. In: ited 2016-03-27
Proceedings of the 23rd International Teletraffic 40. Panda, A., Han, S., Jang, K., Walls, M., Rat-
Congress. ITCP (2011) nasamy, S., Shenker, S.: Netbricks: Taking the v
29. Kang, N., Liu, Z., Rexford, J., Walker, D.: Optimiz- out of nfv. In: 12th USENIX Symposium on Oper-
ing the ”One Big Switch” Abstraction in Software- ating Systems Design and Implementation (OSDI
defined Networks. In: Proceedings of the Ninth 16), pp. 203–216. USENIX Association, GA (2016)
ACM Conference on Emerging Networking Exper- 41. Pettit, J., Gross, J., Pfaff, B., Casado, M., Crosby,
iments and Technologies, CoNEXT ’13, pp. 13– S.: Virtual Switching in an Era of Advanced Edges.
24. ACM, New York, NY, USA (2013). DOI In: 2nd Workshop on Data Center Converged and
10.1145/2535372.2535373. URL http://doi.acm. Virtual Ethernet Switching (DC-CAVES) (2011)
org/10.1145/2535372.2535373 42. Pfaff, B., Pettit, J., Koponen, T., Amidon, K.,
30. Kevin Traynor: OVS, DPDK and Software Dat- Casado, M., Shenker, S.: Extending Networking
aplane Acceleration. https://fosdem.org/ into the Virtualization Layer. In: Proc. of workshop
2016/schedule/event/ovs_dpdk/attachments/ on Hot Topics in Networks (HotNets-VIII) (2009)
slides/1104/export/events/attachments/ovs_ 43. Pfaff, B., Pettit, J., Koponen, T., Jackson, E.,
dpdk/slides/1104/ovs_dpdk_fosdem_16.pdf Zhou, A., Rajahalme, J., Gross, J., Wang, A.,
(2016). Last visited 2016-03-27 Stringer, J., Shelar, P., Amidon, K., Casado, M.:
31. Kohler, E., Morris, R., Chen, B., Jannotti, J., The design and implementation of open vswitch. In:
Kaashoek, M.F.: The Click Modular Router. ACM 12th USENIX Symposium on Networked Systems
Transactions on Computer Systems (TOCS) 18(3), Design and Implementation (NSDI 15). USENIX
263–297 (2000). DOI 10.1145/354871.354874 Association (2015)
32. Larsen, S., Sarangam, P., Huggahalli, R., Kulka- 44. Pongracz, G., Molnar, L., Kis, Z.L.: Removing
rni, S.: Architectural Breakdown of End-to-End La- Roadblocks from SDN: OpenFlow Software Switch
tency in a TCP/IP Network. International Journal Performance on Intel DPDK. Second Euro-
of Parallel Programming 37(6), 556–571 (2009) pean Workshop on Software Defined Networks
Throughput and Latency of Virtual Switching with Open vSwitch 15

(EWSDN’13) pp. 62–67 (2013) and 2012. He is concerned with device performance measure-
45. Ram, K.K., Cox, A.L., Chadha, M., Rixner, S.: ments with relevance to Network Function Virtualization as
part of Software-defined Networking architectures.
Hyper-Switch: A Scalable Software Virtual Switch-
ing Architecture. In: Presented as part of the 2013
USENIX Annual Technical Conference (USENIX
ATC 13), pp. 13–24. USENIX, San Jose, CA (2013) Sebastian Gallenmüller is a Ph.D. student at the Chair
of Network Architectures and Services at Technical Univer-
46. Rizzo, L.: netmap: a novel framework for fast sity of Munich. There he received his M.Sc. in Informatics in
packet I/O. In: USENIX Annual Technical Con- 2014. He focuses on the topic of assessment and evaluation of
ference (2012) software systems for packet processing.
47. Rizzo, L., Carbone, M., Catalli, G.: Transparent
Acceleration of Software Packet Forwarding using
Netmap. In: INFOCOM, pp. 2471–2479. IEEE Florian Wohlfart is a Ph.D. student working at the Chair of
(2012) Network Architectures and Services at Technical University
of Munich. He received his M.Sc. in Informatics at Technical
48. Rizzo, L., Lettieri, G.: VALE, a switched ethernet University of Munich in 2012. His research interests include
for virtual machines. In: C. Barakat, R. Teixeira, software packet processing, middlebox analysis, and network
K.K. Ramakrishnan, P. Thiran (eds.) CoNEXT, performance measurements.
pp. 61–72. ACM (2012)
49. Rotsos, C., Sarrar, N., Uhlig, S., Sherwood, R.,
Moore, A.W.: Oflops: An Open Framework for Georg Carle is professor at the Department of Informat-
OpenFlow Switch Evaluation. In: Passive and Ac- ics at Technical University of Munich, holding the Chair of
Network Architectures and Services. He studied at University
tive Measurement, pp. 85–95. Springer (2012)
of Stuttgart, Brunel University, London, and Ecole Nationale
50. Salim, J.H., Olsson, R., Kuznetsov, A.: Beyond Superieure des Telecommunications, Paris. He did his Ph.D.
Softnet. In: Proceedings of the 5th annual Linux in Computer Science at University of Karlsruhe, and worked
Showcase & Conference, vol. 5, pp. 18–18 (2001) as postdoctoral scientist at Institut Eurecom, Sophia Antipo-
lis, France, at the Fraunhofer Institute for Open Communi-
51. Tedesco, A., Ventre, G., Angrisani, L., Peluso, L.:
cation Systems, Berlin, and as professor at the University of
Measurement of Processing and Queuing Delays In- Tübingen.
troduced by a Software Router in a Single-Hop Net-
work. In: IEEE Instrumentation and Measurement
Technology Conference, pp. 1797–1802 (2005)
52. Thomas Monjalon: dropping librte ivshmem.
http://dpdk.org/ml/archives/dev/2016-June/
040844.html (2016). Mailing list discussion
53. Wang, G., Ng, T.E.: The impact of virtualization
on network performance of amazon ec2 data center.
In: INFOCOM, pp. 1–9. IEEE (2010)
54. Whiteaker, J., Schneider, F., Teixeira, R.: Explain-
ing Packet Delays under Virtualization. ACM SIG-
COMM Computer Communication Review 41(1),
38–44 (2011)

Author Biographies

Paul Emmerich is a Ph.D. student at the Chair for Network


Architectures and Services at Technical University of Munich.
He received his M.Sc. in Informatics at the Technical Univer-
sity of Munich in 2014. His research interests include packet
generation as well as software switches and routers.

Daniel Raumer is a Ph.D. student at the Chair of Network


Architectures and Services at Technical University of Munich,
where he received his B.Sc. and M.Sc. in Informatics, in 2010

View publication stats

You might also like