0% found this document useful (0 votes)
113 views10 pages

Comparison of Frameworks-For High-Performance Packet IO2015

Uploaded by

Bename Doost
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views10 pages

Comparison of Frameworks-For High-Performance Packet IO2015

Uploaded by

Bename Doost
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Comparison of Frameworks

for High-Performance Packet IO

Sebastian Gallenmüller, Paul Emmerich, Florian Wohlfart,


Daniel Raumer, and Georg Carle
Technische Universität München, Department of Informatics
Chair for Network Architectures and Services
Boltzmannstr. 3
85748 Garching b. München, Germany
{gallenmu | emmericp | wohlfart | raumer | carle}@net.in.tum.de

ABSTRACT packet sizes. Software frameworks for high-speed packet


Network stacks currently implemented in operating systems IO, e.g. netmap [1], Intel DPDK [2], or PF_RING ZC [3],
can no longer cope with the packet rates offered by 10 Gbit promise to fix this issue by offering a stripped-down alter-
Ethernet. Thus, frameworks were developed claiming to native to the Linux network stack. Their performance in-
offer a faster alternative for this demand. These frame- crease allows using commodity hardware systems as routers
works enable arbitrary packet processing systems to be built and (virtual) switches [4, 5], network middleboxes like fire-
from commodity hardware handling a traffic rate of several walls [6], or network monitoring systems [7]. Motivated by
10 Gbit interfaces, entering a domain previously only avail- the potential gain we analyzed the performance characteris-
able to custom-built hardware. tics of these frameworks.
In this paper, we survey various frameworks for high-per- Initially, we investigate the factors network bandwidth,
formance packet IO. We introduce a model to estimate and CPU performance, PCI express bandwidth, and connection
assess the performance of these packet processing frame- to main memory influencing the performance of arbitrary
works. Moreover, we analyze the performance of the most packet processing applications. These factors are comprised
prominent frameworks based on representative measurements into a model describing this type of program based on the
in packet forwarding scenarios. Therefore, we quantify the previously mentioned frameworks. Various measurements
effects of caching and look at the tradeoff between through- show the applicability of this model, with packet forwarding
put and latency. as the basic test scenario. Each measurement is designed
to investigate the influence of a specific factor on the for-
warding throughput, e.g. the clock speed of the CPU, the
Categories and Subject Descriptors number of processed packets per call or the cache utilization.
C.2 [Computer-Communication Networks]: Miscella- Moreover, the latency of packet forwarding is also reviewed.
neous; C.4 [Performance of Systems]: Measurement tech- All measurements are designed to ensure the comparabil-
niques ity of our test results: running on the same CPU, equipped
with the same 10 Gbit NICs while using packet forwarders
applying the same algorithm for each of three frameworks
General Terms respectively to provide a fair comparison between them.
Measurement, Performance The paper is organized as follows: In Section 2, we de-
scribe the state of the art in fast packet processing. Related
Keywords work that serves as basis for our research is presented in
Section 3. Section 4 identifies bottlenecks for packet pro-
netmap; PF_RING ZC; DPDK; software packet processing; cessing in software and derives a model describing such ap-
performance measurement plications. Subsequently, Section 5 presents our comparison
of techniques for fast packet processing. We conclude with
1. INTRODUCTION a summary of our results in Section 6.
Nowadays, 10 Gbit Ethernet adapters are commonly used This paper is based on a master’s thesis [8] offering a
in servers. However, due to overhead imposed by the net- more comprehensive discussion including the usability of the
work stacks’ architectural design the CPU quickly becomes frameworks and a closer description of the APIs.
the bottleneck, so that packet handling – even without any
complex processing – is impossible at line speed for small
2. PACKET PROCESSING IN SOFTWARE
Network traffic processing performance backed by com-

c 2015 IEEE. Personal use of this material is permitted. Permission from modity hardware systems has increased continuously in the
IEEE must be obtained for all other uses, in any current or future media, last years. The increase came both from software optimiza-
including reprinting/republishing this material for advertising or promo- tions and hardware developments like the move from 1 Gbit
tional purposes, creating new collective works, for resale or redistribution
to servers or lists, or reuse of any copyrighted component of this work in to 10 Gbit Ethernet, multicore CPUs and offloading features
other works. that save CPU cycles.
2.1 Utilization of Hardware Features • Preallocating packet buffers at the start of an applica-
On the hardware side the performance increase, beside the tion with no further allocation or deallocation of mem-
higher bandwidth, came from offloading features that allow ory during execution of an application.
shifting workload from the CPU to the Network Interface
• No copying of data between user and kernel memory
Card (NIC). Checksum offloading, for instance, relieves the
space as a packet is copied once to memory via DMA
CPU from CRC checksum handling. Now the NIC takes care
by the NIC and this memory location is used by pro-
of calculating the CRC checksum and adding it to the packet
cessing framework and applications alike.
before transfer. On the receiving end, the NIC validates the
checksum and drops the packet in case of an error, without • Processing batches of packets with one API call on
involving the CPU. [9] reception and sending.
Direct Memory Access (DMA) allows the NIC to write
or read packets directly into or from RAM bypassing the
CPU. Modern NICs can even copy packets directly into the netmap.
cache of the CPU, which leads to a further increase in per- netmap [1] exposes packet buffers to the application and
formance [10]. uses standard system calls, like poll() or ioctl(), to initi-
Another part of the speed-up comes from increased CPU ate the data transfer. The work behind these system calls is
performance. In addition to higher clock rates, the number reduced compared to a default network stack. These system
of CPU cores has changed from one to a growing number calls only update the packet buffers and check the data pro-
of cores. NICs need to support multi-core architectures ex- vided by user programs for their validity to prevent crashes.
plicitly by distributing incoming packets of different traffic The network drivers of netmap are based on regular Linux
flows to different cores. One of these techniques is Receive drivers. As long as no netmap application is active the
Side Scaling (RSS), which allows scaling packet processing driver works transparently for OS and traditional applica-
with the number of cores. [9] tions. Upon starting a netmap-enabled application the NIC
is put into in a special “netmap mode”, i.e. the NIC becomes
2.2 Linux Network Stack inactive for the OS and no packets are delivered to the stan-
On the software side the performance increase came, de- dard OS interfaces and traditional applications. Instead,
spite the support of the new hardware features, from a more the packets are transferred to netmap specific data struc-
efficient way to handle incoming traffic. The first approach tures where they are available to the netmap-enabled ap-
of generating one interrupt per incoming packet was unsuit- plication. When closing this application the driver switches
able for high packet rates due to livelocks caused by inter- back to transparent mode. Maintaining this compatibility in
rupt storms [11]. In such a livelock the system is almost fully the driver allows for an easy integration into general-purpose
occupied with handling the overhead caused by interrupts operating systems. netmap has already been integrated into
instead of processing packets. the FreeBSD kernel [16] and inclusion in the Linux kernel
NAPI, a network driver API introduced in kernel 2.4.20, has been discussed [17].
reduces the number of interrupts generated by incoming Multiple applications have shown increased performance
traffic with the ability to switch to polling for packets dur- by adapting netmap: Click [18] a software router, the virtual
ing phases of high load, effectively reducing system over- switch VALE [4], and the FreeBSD firewall ipfw [6].
head [11]. The NAPI-based network stack is sufficiently A notable difference of the different network APIs is the
powerful to scale software routers to multiple Gbit/s [12,13]. usage of system calls. Linux does the entire packet han-
But even though performance improved, the Linux network dling in kernel space to ensure a high degree of security
stack primarily focuses on offering a fully featured general- and robustness. DPDK and PF_RING ZC perform their
purpose network stack for an operating system (OS) rather packet processing entirely in user space to provide high per-
than providing an interface for optimal performance needed formance. To provide robust and fast packet processing
for software router applications [14]. netmap combines both approaches. Most of the workload,
i.e. packet processing, is done in user space. System calls
2.3 High-Speed Packet Processing perform only basic checks on the packet buffers to initiate
Compared with a general-purpose network stack like im- reception and transfer of packets.
plemented in Linux, high-speed packet IO frameworks of-
fer only basic IO functionality: Layer 3 and above must PF_RING ZC.
be implemented by the application whereas the Linux net- PF_RING ZC does not use standard system calls but
work stack handles layer 3 and 4 protocols like IP and TCP. offers its own functions. The API of PF_RING ZC em-
As a benefit, these frameworks offer increased performance phasizes a convenient multicore support [19]. It is NUMA
compared to a full-blown network stack. In this paper we aware, i.e. on systems with multiple CPU sockets the packet
focus on the most important representatives netmap [1], buffers can be allocated in memory regions a CPU can di-
PF_RING ZC [15], and Intel DPDK [2]. All three frame- rectly access. Moreover processes can be clustered for easy
works require modified drivers and use the same techniques data sharing amongst them.
for acceleration: PF_RING ZC features a driver with capabilities similar
to those of netmap, i.e. the driver is based on a regular Linux
• Bypassing the default network stack, i.e. the packets
driver acting transparently as long as no special application
are only processed by the processing framework and
is started. During the time such an application is active
by the applications running on top of them.
regular applications cannot send or receive packets using
• Relying on polling to receive packets instead of inter- the respective default OS interfaces. This driver may also
rupts. be configured to the deliver a copy of the packets to be
available to the OS, while a PF_RING ZC application is tle time. Snabb Switch’s design philosophy even applies to
active, but the duplication process lowers the performance. the driver as it is also written in Lua. In 2012 this frame-
Ntop [20] offers a number of applications running on top work was released. Being the youngest framework in this
of PF_RING ZC. For example n2disk, a packet capturing enumeration it is not as mature as the other frameworks,
tool, or nProbe, a tool for traffic monitoring. also no applications are known to use Snabb Switch as their
backend.
DPDK. The three frameworks presented in this paragraph were
DPDK is a collection of libraries, which offers not only excluded from a thorough examination for various reasons.
basic functions for sending and receiving packets, but also For PacketShader the reasons were the unavailability of the
provides additional functionality like a longest prefix match- GPU part and low development activity. PFQ was not in-
ing algorithm for the implementation of routing tables and vestigated because of different design goals leading to a per-
efficient hash maps. It relies on a custom user space API formance disadvantage compared to the other frameworks.
similar to PF_RING ZC instead of traditional system calls Snabb Switch was excluded because of its immaturity. This
used by netmap. The DPDK API [21] offers multicore sup- paper only focuses on the three most established competi-
port, additional libraries used for packet processing, and tors: netmap, PF_RING ZC, and DPDK, which appear to
features the highest degree of configurability amongst the be more mature, leading to a higher availability of applica-
investigated frameworks. tions using the three presented frameworks.
The driver of DPDK does not feature a transparent mode,
i.e. as soon as this driver is loaded, the NIC becomes avail- 3. RELATED WORK
able to DPDK but is made unavailable to the Linux kernel A survey of various packet IO frameworks was published
regardless of whether any DPDK-enabled application is run- by García-Dorado et al. [14]. The theoretical part of their
ning or not. DPDK uses a special kind of driver aiming to investigation is quite comprehensive and the paper includes
do most of its processing in user space. This UIO driver [22] measurements showing selected aspects of these frameworks,
still has a part of its code realized as kernel module but its e. g. the influence of the number of available cores and
tasks are reduced. It only initializes the used PCI devices by packet sizes on the throughput. They investigate the packet
mapping their memory regions into the user space process. IO engine of PacketShader, PFQ, netmap, and PF_RING
A notable example for an application using DPDK, which DNA, a predecessor of PF_RING ZC. DPDK and Snabb
gained attention, is an accelerated version of Open vSwitch Switch are not investigated. However, the authors only an-
called DPDK vSwitch [5]. An additional high-performance alyze packet capturing capabilities and neglect other aspects
software switching solution is xDPd [23], which supports of packet processing.
DPDK for network access. Throughput measurements of software packet forwarding
systems on commodity hardware have been conducted previ-
Other Frameworks. ously: Bolla and Bruschi analyze a Linux software router [13].
The already mentioned frameworks are not the sole so- Studies of software router performance and the influence of
lutions offering high-speed packet processing capabilities in various workloads were published by Dobrescu et al. [12].
user space. The highest throughput of a software solution implementing
PacketShader [24] is a packet processing framework using an OpenFlow switch with DPDK was presented and mea-
the general purpose capabilities of GPUs for packet process- sured in [29]. We also measured the throughput of Linux-
ing. It also features a separate engine for fast packet IO. based forwarding tools in previous work [30]. These mea-
The GPU part of PacketShader is not publicly available, surements allow for a direct comparison with results from
only the code of the packet engine was released, which can this paper because they were performed on the same test
be used on its own. However, this engine is not developed system.
as actively as netmap, PF_RING ZC, or DPDK leading to The latency of a Linux software router was also measured
a low number of updates in the repository [25]. by Bolla and Bruschi in [13]. A technique to measure dif-
PFQ [26] is a framework that is optimized for fast packet ferent parts of packet processing systems using commodity
capturing. It does not rely on specialized drivers like the hardware based on an understanding internal queuing was
other frameworks and can be used with every NIC as long described by Tedesco et al. [31]. Rotsos et al. [32] present
as this card is supported by Linux. However, without modi- a FPGA-based method to measure the latency of various
fied drivers the NIC cannot push the packets directly to user software and hardware OpenFlow switches. They present
space. The lack of this feature leads to a performance disad- measurements for Open vSwitch running on Linux as an ex-
vantage when compared to the other frameworks presented ample. Discussion of latency in software routers can also be
by our paper. A notable feature of PFQ is the integration of found in [33]. The authors describe a method that can be
a Haskell-based domain specific language for implementing used to distinguish the latency introduced by queuing from
packet processing algorithms [27]. PFQ focuses on providing the processing delay.
a framework to make packet processing easy and safe rather The selected literature shows that the performance of the
than providing the highest possible performance. Therefore Linux networking part is thoroughly researched and well-
the typical use cases for PFQ differ from the use cases of the known. There are also papers investigating a specific frame-
other frameworks. work exclusively. However, measurements require similar
Snabb Switch [28] aims to combine powerful packet pro- test conditions, i.e. comparable hardware and software se-
cessing capabilities with the scripting language Lua. Using tups, to ensure comparability. The paper by García-Dorado
a scripting language lowers the entry barriers for develop- et al. [14] provides those conditions but measures only a few
ers, as these languages are designed to be learned easily selected aspects. Therefore we try to give a fair compari-
and also allow developing applications with few lines in lit- son by testing each framework on the same hardware. This
paper includes measurements not yet published in similar The second upper bound depends on the CPU of the
papers, e.g. the transmission of packets or latency determi- packet processing system (cf. Point 4 of enumeration in
nation. We also introduce a new model to provide a basic Section 4.1). That computational limit is called cmax .
understanding how packet processing applications work and For simplicity we only consider these two bounds and omit
how their performance can be estimated. the upper bound for memory and PCIe bandwidth, as these
are not a bottleneck at least in our test setup. A combina-
4. PERFORMANCE CONSIDERATIONS tion of the upper bounds leads to the maximum number of
packets per second that can be processed, also known as up-
We present a model that provides insights into the perfor-
per bound of the throughput Tmax . As the upper bound of
mance of the packet processing applications built for high-
the throughput cannot exceed a single upper bound, Tmax
speed IO frameworks. It uses the main factors influencing
can be described by the following formula:
performance to provide an upper bound for the capabilities
of a software-based packet processing system and to show
the limits and potential bottlenecks. Tmax = M IN (rmax , cmax ) (1)
4.1 Limits and Influencing Factors The value tmax itself is influenced by two factors, the line-
Performance limits are grounded on four different charac- rate allowed by the Ethernet standard v ethernet and the in-
teristics of the hardware: dividual packet size spacket
i (this size also includes bits for
preamble, start frame delimiter, padding and inter frame
1. The maximum transfer rate of the used NICs. Thus it gap).
is determined by the Ethernet standards (i.e. 1 GbE,
10 GbE, or 40 GbE). n
X
v ethernet ≥ spacket
i (2)
2. PCI express is used to connect the NICs to the rest
i=0
of the system. Nowadays typical hardware uses PCIe
v2.0 with 8 lanes per NIC, which offers a usable link If n in Equation 2 is maximal, it equals to rmax , i.e. n is
bandwidth of 32 Gbit/s [34] for every interface in rx the number of packets that can be sent per second with re-
and tx direction respectively. Commonly available two spect to their individual packet length spacket
i and the band-
port NICs cannot reach the 32 Gbit/s limit, which ren- width of the Ethernet v ethernet .
ders this limit irrelevant for this type of NIC. The value for cmax depends on the resources provided by
the CPU - the cycles. These processing cycles can either
3. As packet data is sent to the memory the RAM could be used to handle packets or to process other tasks. The
restrict the possible network bandwidth. A typical following formula uses f CP U to describe the available num-
system with DDR3 memory provides a bandwidth of ber of cycles provided by the CPU per second. Costs for an
21.2 Gbytes/s (dual channel, effective clock speed of individual packet are represented by cpacket . All costs for
i
1333 MHz [35]. Our measurements showed that this other processing tasks running on the CPU are summed up
bandwidth is high enough to support at least eight in cother .
network ports transferring and sending concurrently
at 10 Gbit/s. In more sophisticated hardware setups, n
i.e. servers with several CPUs, the actual configura- f CP U ≥ cother +
X
cpacket (3)
i
tion may limit the achievable transfer rates. If a CPU i=0
has to access a NIC or RAM attached to a different
CPU, the interconnect between CPUs may act as a The total costs of the packet processing in CPU cycles
bottleneck. is given by the sum in Equation 3. This sum contains the
number of processed packets represented by the number of
4. The fourth component involved in packet processing is packets per second n and the individual costs cpacket
i of the
the CPU. Due to modern offloading features of NICs packets. The maximum number of cycles per second f CP U
the processing load on the CPU can be kept low. Ex- is a fixed value depending on the hardware. Therefore, the
amples like netmap show that handling a fully loaded CPU resources available for packet processing and the other
10 Gbit/s link is possible even if the traffic consists of system tasks are bound by this limit, i.e. they have to be
a high number of short packets. This is not the case if smaller or equal than f CP U . To get the maximum number
the Linux network stack is used. However, if complex of packets cmax , n has to be maximized with respect to
packet processing algorithms are performed, the CPU Inequation 3.
may lower the transfer rate even for high-speed packet Figure 1 shows the combination of the two upper bounds
frameworks. Therefore the CPU is considered to be Tmax as combination of rmax and cmax in respect to growing
the dominating bottleneck. [1] costs per packet described by the x-axis. For this section
only the dashed and dotted lines are relevant. The value
4.2 Upper Bound for Packet Processing rmax depends on the size of the packets. When using mini-
In this section, we use the identified bottlenecks to con- mally sized packets with 64 Byte and taking overhead data
duct a generic model for packet processing. These limits act like preamble or inter frame gap into account, 14.88 million
as upper bounds for the number of packets per second that packets per second (Mpps) can be achieved on 10 GbE. As
can be processed. long as rmax is reached, the costs cmax are low enough to
The first upper bound is a fixed limit determined by the be fully handled by the available CPU, the traffic is bound
used Ethernet standard (cf. Point 1 of enumeration in Sec- by the limit of the NIC. At the point cequal the through-
tion 4.1). This limit is called rmax . put begins to decline. Beyond this point, processing time of
CPU cycles, the cycles are spent waiting for new packets
in the busy wait loop. If these costs are included, a new
rmax value is introduced c∗packet
const and both sides of the former
Throughput n

cmax Inequation 4 are now balanced:

c%
IO f CP U = n · c∗packet
const (5)

The costs per packet c∗packet


const can originate from different
c%
task sources:
c%
busy
0 cequal c∗packet
const = cIO + ctask + cbusy (6)
0
Cost ctask 1. cIO : These costs are used by the framework for sending
and receiving a packet. The framework determines the
Figure 1: Model for packet processing amount of these costs. In addition, these costs are con-
stant per packet due to the design of the frameworks
by completely avoiding operations depending on the
the CPU does not suffice the traffic capabilities of the NIC, length of the packet, e.g. buffer allocation.
i.e. the traffic becomes CPU bound and the throughput sub-
2. ctask : The application running on top of the framework
sequently sinks.
determines those costs, which depend on the complex-
The costs per packet determine how many packets can be
ity of the processing task.
processed without surpassing the computational limit cmax .
The actual shape of cmax cannot be determined as it de- 3. cbusy : These costs are introduced by the busy wait on
pends on the traffic and the processing task. Regardless sending or receiving packets. If throughput is lower
of the precise shape of this curve, the outcome stays the than tmax , i.e. the per-packet costs are higher than
same, i.e. higher per-packet costs decrease the throughput. cequal , cbusy becomes 0. The cycles spent on cbusy are
The hyperbolic shape of cmax depicted in Figure 1 holds for effectively wasted as no actual processing is done.
packet processing frameworks and is explained in detail in
the following section. Combining Equations 5 and 6 leads to:

4.3 High-Performance Prediction Model f CP U = n · (cIO + ctask + cbusy ) (7)


According to Rizzo, packet processing costs can be divided Figure 1 depicts the behavior of the throughput while
into per-byte and per-packet costs, with the latter dominat- gradually increasing ctask as described by Equation 7. The
ing for IO frameworks; i.e. it is only slightly more expen- highlighted areas show the relative part of the three com-
sive to send a 1.5 KB packet than sending a 64 B packet [1]. ponents of c∗packet
const . Each area depicts the accumulated per-
This leads to two assumptions to be made. The first as- packet costs of their respective component x called c% x.
sumption is that the per-packet costs are constant for high- The relative importance of c% %
IO compared to ctask de-
performance IO frameworks. The second one is that exper- creases for higher task difficulties because of two reasons.
iments are performed under the most demanding circum- The first reason is the decreasing throughput with fewer
stances if the highest packet rate is chosen, i.e. 64 B packets packets needing a lower amount of processing power. The
have to be used. second reason is that while ctask increases the relative por-
In case of constant costs per packet ∀i : cpacket
i = cpacket
const
tion of cycles needed for IO gets smaller.
and a dedicated core for packet processing leads to cother = Low values of ctask and only parts of the cycles spent on
0. If the packet processing application itself also generates cIO , increase busy waiting that leads to a high value for
a constant load per packet and the high-performance frame- cbusy . c% %
busy decreases linearly while ctask grows accordingly
works have roughly constant costs per packet and use ded- until cequal is reached. This point subsequently marks the
icated cores for packet processing the Inequation 3 can be cost value, where no cycles are wasted on busy waiting.
simplified to the following inequation: ctask increases steadily, which leads to a growing relative
portion of c%task .
f CP U ≥ n · cpacket
const (4)
If packet processing includes actions that depend on the 5. PERFORMANCE COMPARISON
type of packet or the traffic characteristics, computation The available CPU cycles are the main limiting factor of
may become infeasible. Such a scenario may be packet mon- software packet processing (cf. Section 4.1). Subsequently,
itoring were certain types of packets require additional CPU the throughput of a packet processing application heavily
cycles for further analysis [36]. Without restriction to cer- depends on the amount of CPU cycles available for its pro-
tain traffic patterns it is still possible to approximate the cessing task. This amount is influenced by numerous factors
overall costs with average per-packet costs or to do a worst and the following measurements present a selection of fac-
case estimation. tors we consider relevant for real world applications: The
Due to the architecture of the frameworks, which all poll overhead caused by the complexity of packet processing, the
the NIC in a busy waiting manner, an application uses all time the CPU spends waiting for data to arrive in cache, and
the available CPU cycles all the time. If the limit of the the effect of different batch sizes, i.e. whether the packet
NIC is reached but n · cpacket
const is lower than the available throughput rises if more packets are processed per call. For
every factor a dedicated measurement is performed. As frameworks to be under full load at all times. Therefore, a
batch size in particular determines the queuing delay of the simple comparison of CPU usage by a tool like top does not
packets on the processing system and latency during packet work. For this kind of measurement there is no way to tell
forwarding is also investigated. the relative portions of the three components of c∗packet
const in
Initially, we explain the test setup and various methods Equation 7 apart.
to precisely determine the used CPU cycles and check them The use of a profiling tool would list the relative portion
for the applicability for our tests. of each called function. By adding up the result for the func-
tions associated with efficiency cIO , this component could be
5.1 Measurement Setup determined. This method was also rejected as the overhead
Our test setup consists of three distinct servers: a for- introduced by the interruption caused by the profiling tool
warder running the investigated frameworks, a source, and itself lowers the throughput and affects the measurement.
a sink connected via 10 GbE links. The device under test Rizzo measured efficiency by reducing the CPU clock fre-
is equipped with a dual port Intel X520-SR2 NIC, the load quency until the throughput of the NIC was beginning to
generator and sink use single port X520-SR1 NICs. These decline [1]. At this point no busy waiting cycles happen as
cards use PCIe v2.0 with 8 lanes, which offers a usable link depicted in Figure 1. This results in a cbusy value of 0. The
bandwidth of 32 Gbit/s in both directions. The Intel cards packet processing task was simplified so that this compo-
were chosen, as driver implementations exist for each of the nent named ctask can also be neglected. Only component
investigated frameworks. Also possible performance influ- cIO remains, which is the efficiency of the framework. How-
ences introduced by different NICs are avoided. The server ever even at the lowest supported clock speed (1.6 GHz) in
acting as forwarder runs on an Intel Xeon E3-1230 V2 CPU. our test setup the forwarders transmitted at full line rate.
The clock speed was fixed at 3.3 GHz, with power conserv- Therefore this solution could also not be applied.
ing mechanisms, Turbo Boost, and Hyper-Threading deacti-
vated to make the measurements consistent and repeatable. 5.2.2 Novel Method
The forwarder statically forwards packets between the two To overcome the flaws of the previously presented methods
interfaces without consulting a routing or flow table. It mod- for determining the efficiency, we introduce a novel method
ifies a single byte in the packet to ensure that the packet is for our measurements. Therefore we add a piece of software
loaded into the first level cache. Forwarding is done in a producing a constant load per packet on the CPU. The load
single thread pinned to a specific core. can be specified as a number of CPU cycles to wait. This
As performance depends on the number of processed pack- value can be increased until the throughput begins to de-
ets rather than the length of the individual packets, we use cline. Intel provides a benchmark method [40] based on
constant bit rate traffic with the minimum packet size of a clock counter called TSC. We used this guide to design
64 B for all measurements in this paper to maximize the and calibrate this load mechanism. The code containing the
load on the frameworks. The packets are counted on the load generation and benchmarking mechanisms is available
sink using the statistics registers of the NIC. at [41].
We conducted measurements with a version of netmap, At the point of decline cbusy is known to be 0, ctask is
which was published on the 23rd March 2014 in the official known by design. Subsequently cIO can be calculated. For
repository [37], PF_RING ZC version 6.0.2 [20] , and DPDK this experiment the forwarders were modified to implement
version 1.6.0 [38]. this emulated CPU load ctask by spending a predefined num-
Our packet generator MoonGen [39] was used for latency ber of CPU cycles per packet beside the framework’s packet
measurements. It uses hardware features of our Intel NICs IO operations. For the basic performance tests we ignore
for sub-microsecond latency determination. cache effects that can occur during a lookup in a data struc-
Every data point in our performance measurements is an ture (e.g. a forwarding or flow table). Therefore, we can
average value. This value is calculated from 30 single mea- assume that packet processing applications spend a fixed
surements over a period of 30 s. Confidence intervals are amount of CPU cycles per packet.
unnecessary as results are stable and reproducible for all
frameworks. An observation also made by Rizzo in the ini- 5.3 Measure the Transmission Efficiency
tial presentation of netmap [1]. Figure 2(a) presents the results of throughput measure-
ments with different CPU loads for the task emulator. As
5.2 Determine the Transmission Efficiency anticipated by our model in Figure 1, an increasing work-
In our testbed all of the frameworks are able to forward load decreases the measured throughput. In the next step,
packets at full line rate with a single CPU core. To mea- we get back to our goal of measuring the per-packet CPU
sure the transmission efficiency expressed by the CPU load load consumed by each framework. To forward a packet, a
caused by packet transmission, the CPU load generated by CPU core must dedicate a number of cycles for transmission
each framework needs to be compared. In Equation 7 this (cIO ), i.e. for receiving and sending a packet, cycles for the
efficiency is referred to as cIO . A low number of cycles spent emulated task (ctask ) and possibly polls the NIC unneces-
on cIO increases the number of cycles available for the ac- sarily (cbusy ).
tual packet processing task, i.e. ctask . This in return allows Knowing f CP U , and taking r and ctask from Fig. 2(a)
more demanding applications to be built using more efficient allows for the calculation of cbusy + cIO by applying our
frameworks without performance penalties. Equation 7. These results are shown in Figure 2(b) for each
framework. Starting at around 220 cycles for cbusy + cIO the
5.2.1 Known Approaches for Measuring CPU Load graph decreases until the throughput is not longer limited
However, due to their architecture (cf. Section 2.3), exces- by the 10 Gbit/s line rate. At this point, the throughput
sive polling on the NIC causes the CPU cores used by the becomes limited by the CPU and no busy wait cycles happen
15 The necessary time to access data residing in RAM is
netmap (nm)
shortened by the ability of modern CPUs to buffer accesses
Throughput [Mpps]
PF_RING (PR) to RAM by integrating a hierarchy of several caches differ-
DPDK (DK) ing in size and access time. To test for different scenarios
10 with only partly filled caches the size of the data structure
was made adaptable. The software influences what is put
into cache indirectly by accessing data in RAM, which is
5 then put into the cache or by giving hints to memory ad-
dresses via specialized commands. To optimize for common
access patterns, data close to already accessed addresses can
be prefetched by the CPU before it is accessed [42]. Our
0 tests showed that if a data structure is accessed linearly this
0 200 400 600 800 1,000
prefetching is working efficiently enough to hide the slow ac-
ctask [CPU cycles] cess speed to RAM. In the scenario of a routing table the
(a) Forwarding with an emulated task data to be accessed is determined by the traffic and the ac-
cess pattern is likely to be non linear.
To mimic a worst-case scenario the addresses accessed
250 were randomized. Aiming for a realistic scenario the prefetch-
ing was counteracted by using a circular linked list with ran-
200 dom access pattern. This was achieved by randomly chosen
CPU cycles

links between the list elements while ensuring that the per-
150 mutation contains a single cycle so that all memory loca-
tions are accessed when the list is traversed. This guar-
100 antees a random access on RAM or cache by iterating one
step through the list for each received packet. The size of
50 the linked list can be varied to emulate different routing or
flow table sizes. An implementation of this data structure
0 is publicly available [41].
0 200 400 600 800 1000 Figure 3(a) depicts the throughput of the investigated
ctask [CPU cycles] frameworks in relation to the list size of our task simulator.
For every packet processed one emulated table lookup was
nm (cIO + cbusy ) nm (cIO ) nm (cbusy ) performed. To investigate CPU-limited, rather than NIC-
PR (cIO + cbusy ) PR (cIO ) PR (cbusy ) limited, throughput a constant CPU load of 100 cycles was
DK (cIO + cbusy ) DK (cIO ) DK (cbusy ) introduced, the point in Figure 2(a) where the throughput
was beginning to decline for all three frameworks. This off-
(b) Transmission cycles cIO & busy polling cycles cbusy
set explains the lower throughput of netmap in Figure 3(a),
as expected from the data in Figure 2(a).
Figure 2: Transmission efficiency measurements The CPU in our test server has 3 cache levels: A L1-
cache with 32 KB, a L2-cache with 256 KB and a L3-cache
with 8 MB [42]. Measurements showed that the average ac-
any longer, i.e. cbusy = 0. This allows for the separation of cess time is 10 cycles for list sizes ≤ 32 KB, growing to 20
the two components cbusy and cIO into two individual graphs cycles for list sizes ≤ 256 KB, growing to 60 cycles for list
also depicted in Figure 2(b). sizes ≤ 8 MB, and finally reaching 250 cycles for list sizes
netmap becomes CPU bound with 50 cycles of additional larger than that.
workload per packet, DPDK and PF_RING ZC after 150 The graph in Figure 3(a) shows no clear transition from L1
cycles. At this point cIO , which describes the cycles needed to L2 due to the low 10 cycle increase. The decline at around
for a packet to be received and sent by the respective frame- 256 KB is visible due to the larger speed difference between
work, reaches its lowest value and stays roughly constant for L2-cache and L3-cache. The next drop in the graph is the
all higher packet rates. DPDK has the lowest CPU cost per transition between L3-cache and non-cached RAM accesses.
packet forwarding operation with approximately 100 cycles. DPDK is slightly slower than PF_RING ZC when the L2-
We measured a cIO of approximately 900 cycles for for- cache is fully occupied by the data structure. This means
warding applications based on the Linux network stack in that DPDK has a slightly higher cache footprint compared
previous work [30]. This means that the frameworks dis- to PF_RING ZC.
cussed in this paper can lead to a nine-fold performance Figure 3(b) plots the cache misses, obtained by read-
increase over classical applications. ing the CPU’s performance registers. Only the results for
DPDK are given. The results for netmap and PF_RING ZC
5.4 Influence of Caches are similar and are not shown for improved readability of the
graph. The number of cache misses starts at a certain level
The forwarding scenarios in the previous section ignored
and begins to rise as a cache fills up until the size of the
the influence of caches, which can introduce a delay when
test data exceeds the respective cache size. This observa-
accessing a data structure, e.g. the routing table. To imitate
tion holds for every cache level.
this behavior the task emulator described in the preceding
The data shown by Figure 3(a) can be used to test our
section was enhanced to access a data structure while trans-
model against a different problem. In contrast to the previ-
ferring packets.
Throughput [Mpps] 15 15

Throughput [Mpps]
10 10

5 DK L1 Cache Size 5
PR L2 Cache Size
nm L3 Cache Size
0 0
1K 16 K 256 K 4M 64 M 512 M 0 250 500 750 1000
Working Set Size [Bytes] (log2 scale) Ctask [CPU cycles]
(a) Performance with memory accesses 8 Batch nm 8 Batch DK 8 Batch PR
32 Batch nm 32 Batch DK 32 Batch PR
108 256 Batch nm
Cache Misses [Hz] (log10 scale)

107
106 Figure 4: Throughput influenced by batch sizes

105
104 larger batch sizes. PF_RING and DPDK reach their high-
3
est throughput at a batch size of 32. The larger batch sizes
10 are therefore omitted for those frameworks in Figure 4 be-
L1 Misses DK L1 Cache Size
102 cause they also do not have adverse effects on the through-
L2 Misses DK L2 Cache Size put. netmap needs a batch size of at least 256 to reach a
101 L3 Misses DK L3 Cache Size throughput performance close to the other two frameworks.
100 This is due to the relatively expensive system calls required
1K 16 K 256 K 4M 64 M 512 M
to send or receive a batch (cf. Section 2.3).
Working Set Size [Bytes] (log2 scale)
(b) Cache misses for DPDK 5.6 Latency
Increasing the batch size boosts throughput but raises la-
Figure 3: Cache measurements tency because the packets spend a longer time queued if pro-
cessed in larger batches. Overloading a software forwarding
application causes a worst-case behavior for the latency be-
ously fixed load per packet, in this experiment the load per cause all queues will fill up. So a high latency is expected
packet was determined by the cache access times. However, for all cases where packets are dropped due to insufficient
even under these circumstances the model provides a good processing resources.
estimation if the average per-packet costs are used. For in- We used the IEEE 1588 hardware time stamping features
stance, at a list size of 256 MB ctask the average costs to of the Intel 82599 controller to measure the latency of the
access a list element are 250 cycles. Taking the 100 extra cy- forwarding applications [9]. The packets are time stamped in
cles into account, this leads to average costs of 350 cycles for hardware on the source and sink immediately before sending
ctask . For DPDK the cIO is roughly 100 cycles and f CP U and after receiving them from the physical layer. The time
is 3.3 GHz. The expected throughput is 7.3 Mpps with our stamps do not include any software latency or queuing de-
model, and the measured value in Figure 3(a) is 7.1 Mpps. lays on the source and sink. This achieves sub-microsecond
The minor difference can be explained by the fact that the accuracy. [39]
test data structure also competes for cache space with data Figure 5 shows the latency for different batch sizes under a
required by the framework, which results in additional over- packet rate of 99% of the line rate1 and no additional work-
head beyond the cache miss when sending or receiving pack- load. The latencies were acquired by sending time stamped
ets. Therefore, the size of data structures required for rout- packets periodically (up to 350 time stamped packets per
ing also needs to be considered when designing a software second) at randomized intervals by using a different trans-
router. mit queue on the load generator. The time stamped packets
are indistinguishable from the normal load packets for the
5.5 Influence of Batch Sizes forwarding application.
In the following measurements, we analyze the influence Both DPDK and PF_RING ZC are overloaded with a
of the batch size, i.e. the number of packets handled by one batch size of 8, netmap with all batch sizes smaller than 256
API call. as described in the previous section. This causes all queues
The tests shown in Figure 4 were conducted using dif- to fill up and the applications exhibit a worst-case behavior
ferent queue sizes with increasing CPU load using the task 1
Using full line rate with constant bit rate traffic causes
emulator. For each iteration of the test, the batch size was delays after a minor interruption (like printing statistics)
doubled starting at a batch size of 8 up to a batch size of because it is not possible to send faster than the incoming
256. The results show that each framework profits from traffic.
Latency [µs] (log10 scale) DK smaller batch sizes for applications sensitive to high latency
1,000 or larger batch sizes for applications where raw performance
PR
is critical.
nm
If mere performance and latency figures are considered
100 DPDK and PF_RING ZC seem to be superior to netmap.
Though netmap has advantages. It uses well-known OS in-
terfaces and modified system calls for packet IO, leading
10 to increased performance while remaining a certain degree
of interface continuity and system robustness by perform-
ing checks on the user-provided packet buffers. DPDK and
PF_RING ZC favor more radical approaches by breaking
1 with those concepts, resulting in even higher performance
8 16 32 64 128 256 gains, but lack the robustness and familiarity of the API.
Batch Size An application built on DPDK of PF_RING ZC can crash
the system by misconfiguring the NIC, a scenario that is
Figure 5: Latency by batch size prevented by netmap’s kernel driver.
Our conclusion is that the modification of the classical
design for system interfaces results in higher performance.
that is typical for a system that is overloaded. DPDK und The more these interfaces are modified, the higher the packet
PF_RING achieve a latency of 9 µs with a batch size of 16 rates that can be achieved. As a drawback, this requires
and the latency then gradually increases with the batch size. applications to be ported to one of these frameworks.
PF_RING ZC gets slightly faster than DPDK for larger
batch sizes. netmap achieves a forwarding latency of 34 µs
with a batch size of 256. Acknowledgment
These latencies can be compared to other forwarding meth- This research was supported by the DFG as part of the
ods and hardware switches: Rotsos et al. measured a latency MEMPHIS project (CA 595/5-2), the KIC EIT ICT Labs
of 35 µs for Open vSwitch under light load and 3 - 6 µs for on SDN, and the BMBF under EUREKA-Project SASER
hardware-based switches [32]. Bolla and Bruschi measured (01BP12300A). We also want to express our sincere thanks
∼15 µs to ∼80 µs for the Linux router in various scenarios to the anonymous reviewers, especially reviewer #1, for the
without packet loss and latencies in the order of 1000 µs for constructive feedback.
overload scenarios [13].

6. CONCLUSIONS AND OUTLOOK 7. REFERENCES


High-speed packet IO frameworks are no longer in fledgling [1] L. Rizzo, “netmap: a novel framework for fast packet
stages and allow for a multiple of the packet rates of classical I/O,” in USENIX Annual Technical Conference, April
network stacks. The performance increase comes from pro- 2012.
cessing in batches, preallocated buffers, and avoiding costly [2] “Impressive Packet Processing Performance Enables
interrupts. Greater Workload Consolidation,” in Intel Solution
We described the processing performance of high-speed Brief. Intel Corporation, 2013, Whitepaper.
packet IO frameworks. Starting with a model describing
[3] “PF_RING ZC,” http://www.ntop.org/products/pf_
packet processing software in general, this model is gradu-
ring/pf_ring-zc-zero-copy/, last visited 2015-03-31.
ally adapted to reflect applications using high-performance
[4] L. Rizzo and G. Lettieri, “VALE, a switched ethernet
frameworks. For our experiments we rely on a precisly gen-
for virtual machines,” in CoNEXT. ACM, 2012, pp.
erated load on the CPU. The code generating these differ-
61–72.
ent kinds of load on the CPU is publicly available [41].
Experiments showed the performance characteristics pre- [5] “Intel Open Source Technology Center,”
dicted by this model. Thus proving the assumptions right https://01.org/packet-processing, last visited
made during the development of this model. The CPU time 2015-03-31.
spent on receiving and transmitting packets, for instance, [6] “netmap-ipfw,”
remained constant despite the influence of the varying pro- https://code.google.com/p/netmap-ipfw/, last visited
cessing times per packet. Further measurements showed 2015-03-31.
that this model can be applied to estimate processing tasks, [7] F. Fusco and L. Deri, “High Speed Network Traffic
which can be approximated with a constant average load. A Analysis with Commodity Multi-core Systems,” in
possible use case for this model is to evaluate the eligibility Internet Measurement Conference, November 2010,
of PC systems for specific packet processing tasks. pp. 218–224.
We also showed the trade-off between throughput and la- [8] S. Gallenmüller, “Comparison of Memory Mapping
tency with different queue sizes. Larger batch sizes increase Techniques for High-Speed Packet Processing,”
the performance but also the average latency. However, http://www.net.in.tum.de/fileadmin/bibtex/
there is also a minimal batch size where the frameworks publications/theses/
are overloaded. In that case latency is a multiple of what 2014-gallenmueller-high-speed-packet-processing.pdf,
it could be if the packets would be sent in larger batches. 2014, last visited 2015-03-31.
These results can be used to choose the configuration and [9] “Intel 82599 10 GbE Controller Datasheet Rev. 2.76.”
the framework best fit for an application’s requirements, i.e. Intel Corporation, 2012.
[10] “Intel I/O Acceleration Technology,” [27] N. Bonelli, S. Giordano, G. Procissi, and L. Abeni, “A
http://www.intel.com/content/www/us/en/ purely functional approach to packet processing,” in
wireless-network/accel-technology.html, Intel, last Proceedings of the Tenth ACM/IEEE Symposium on
visited 2015-03-31. Architectures for Networking and Communications
[11] J. H. Salim, “When NAPI Comes To Town,” in Linux Systems, ser. ANCS ’14. New York, NY, USA: ACM,
2005 Conf, 2005. 2014, pp. 219–230.
[12] M. Dobrescu, K. Argyraki, and S. Ratnasamy, [28] “Snabb Switch: Fast open source packet processing,”
“Toward predictable performance in software https://github.com/SnabbCo/snabbswitch, last
packet-processing platforms,” in Proceedings of the 9th visited 2015-03-31.
USENIX Conference on Networked Systems Design [29] G. Pongracz, L. Molnar, and Z. L. Kis, “Removing
and Implementation, ser. NSDI’12. Berkeley, CA, Roadblocks from SDN: OpenFlow Software Switch
USA: USENIX Association, 2012, pp. 11–11. [Online]. Performance on Intel DPDK,” Second European
Available: Workshop on Software Defined Networks
http://dl.acm.org/citation.cfm?id=2228298.2228313 (EWSDN’13), pp. 62–67, 2013.
[13] R. Bolla and R. Bruschi, “Linux Software Router: [30] P. Emmerich, D. Raumer, F. Wohlfart, and G. Carle,
Data Plane Optimization and Performance “Assessing Soft- and Hardware Bottlenecks in
Evaluation,” Journal of Networks, vol. 2, no. 3, pp. PC-based Packet Forwarding Systems,” in Fourteenth
6–17, June 2007. International Conference on Networks (ICN 2015),
[14] J. L. García-Dorado, F. Mata, J. Ramos, P. M. Barcelona, Spain, Apr. 2015.
Santiago del Río, V. Moreno, and J. Aracil, “Data [31] A. Tedesco, G. Ventre, L. Angrisani, and L. Peluso,
Traffic Monitoring and Analysis,” E. Biersack, “Measurement of Processing and Queuing Delays
C. Callegari, and M. Matijasevic, Eds. Berlin, Introduced by a Software Router in a Single-Hop
Heidelberg: Springer-Verlag, 2013, ch. Network,” in IEEE Instrumentation and Measurement
High-Performance Network Traffic Processing Systems Technology Conference, May 2005, pp. 1797–1802.
Using Commodity Hardware, pp. 3–27. [32] C. Rotsos, N. Sarrar, S. Uhlig, R. Sherwood, and
[15] L. Deri, “nCap: Wire-speed Packet Capture and A. W. Moore, “Oflops: An Open Framework for
Transmission,” in End-to-End Monitoring Techniques OpenFlow Switch Evaluation,” in Passive and Active
and Services. IEEE, 2005, pp. 47–55. Measurement. Springer, 2012, pp. 85–95.
[16] “netmap(4),” in FreeBSD 11 Man Pages. The [33] S. Larsen, P. Sarangam, R. Huggahalli, and
FreeBSD Project, 2014. S. Kulkarni, “Architectural Breakdown of End-to-End
[17] S. Hemminger, “netmap: infrastructure (in staging),” Latency in a TCP/IP Network,” International Journal
http://lwn.net/Articles/548077/, last visited of Parallel Programming, vol. 37, no. 6, pp. 556–571,
2015-03-31. 2009.
[18] “The Click Modular Router Project,” [34] PCI-SIG, “Express base specification revision 2.0,”
http://www.read.cs.ucla.edu/click/, last visited 2006.
2015-03-31. [35] J. L. Hennessy and D. A. Patterson, Computer
[19] “PF_RING ZC API,” http: Architecture: A Quantitative Approach. Elsevier,
//www.ntop.org/pfring_api/pfring__zc_8h.html, 2012.
last visited 2015-03-31. [36] L. Braun, C. Diekmann, N. Kammenhuber, and
[20] “Ntop,” http://www.ntop.org, last visited 2015-03-31. G. Carle, “Adaptive Load-Aware Sampling for
[21] “Data Plane Development Kit: Programmer’s Guide, Network Monitoring on Multicore Commodity
Revision 6.” Intel Corporation, 2014. Hardware,” in IFIP Networking, May 2013.
[22] “UIO: user-space drivers,” [37] “netmap,” https://code.google.com/p/netmap/, last
http://lwn.net/Articles/232575/, last visited visited 2015-03-31.
2015-03-31. [38] “DPDK,” http://www.dpdk.org, last visited
[23] “xDPd,” https://www.xdpd.org, last visited 2015-03-31.
2015-03-31. [39] P. Emmerich, S. Gallenmüller, F. Wohlfart,
[24] S. Han, K. Jang, K. Park, and S. Moon, D. Raumer, and G. Carle, “MoonGen: A Scriptable
“PacketShader: a GPU-accelerated Software Router,” High-Speed Packet Generator,”
SIGCOMM Computer Communication Review, vol. 40, http://go.tum.de/276657, 2015, Draft, Conference tbd.
no. 4, 2010. [Online]. Available: [40] G. Paoloni, “How to Benchmark Code Execution
http://dl.acm.org/citation.cfm?id=2043164.1851207 Times on Intel IA-32 and IA-64 Instruction Set
[25] “PacketShader Packet-IO-Engine,” Architectures,” 2010.
https://github.com/PacketShader/Packet-IO-Engine, [41] “Simulation algorithms for empirical evaluation of
last visited 2015-03-31. processor performance,”
[26] N. Bonelli, A. Di Pietro, S. Giordano, and G. Procissi, https://github.com/gallenmu/sheep, last visited
“On Multi–Gigabit Packet Capturing With 2015-05-28.
Multi–Core Commodity Hardware,” in Passive and [42] “Intel 64 and IA-32 Architectures Optimization
Active Measurement. Springer, 2012, pp. 64–73. Reference Manual.” Intel Corporation, 2012.

You might also like