Comparison of Frameworks-For High-Performance Packet IO2015
Comparison of Frameworks-For High-Performance Packet IO2015
c%
IO f CP U = n · c∗packet
const (5)
links between the list elements while ensuring that the per-
150 mutation contains a single cycle so that all memory loca-
tions are accessed when the list is traversed. This guar-
100 antees a random access on RAM or cache by iterating one
step through the list for each received packet. The size of
50 the linked list can be varied to emulate different routing or
flow table sizes. An implementation of this data structure
0 is publicly available [41].
0 200 400 600 800 1000 Figure 3(a) depicts the throughput of the investigated
ctask [CPU cycles] frameworks in relation to the list size of our task simulator.
For every packet processed one emulated table lookup was
nm (cIO + cbusy ) nm (cIO ) nm (cbusy ) performed. To investigate CPU-limited, rather than NIC-
PR (cIO + cbusy ) PR (cIO ) PR (cbusy ) limited, throughput a constant CPU load of 100 cycles was
DK (cIO + cbusy ) DK (cIO ) DK (cbusy ) introduced, the point in Figure 2(a) where the throughput
was beginning to decline for all three frameworks. This off-
(b) Transmission cycles cIO & busy polling cycles cbusy
set explains the lower throughput of netmap in Figure 3(a),
as expected from the data in Figure 2(a).
Figure 2: Transmission efficiency measurements The CPU in our test server has 3 cache levels: A L1-
cache with 32 KB, a L2-cache with 256 KB and a L3-cache
with 8 MB [42]. Measurements showed that the average ac-
any longer, i.e. cbusy = 0. This allows for the separation of cess time is 10 cycles for list sizes ≤ 32 KB, growing to 20
the two components cbusy and cIO into two individual graphs cycles for list sizes ≤ 256 KB, growing to 60 cycles for list
also depicted in Figure 2(b). sizes ≤ 8 MB, and finally reaching 250 cycles for list sizes
netmap becomes CPU bound with 50 cycles of additional larger than that.
workload per packet, DPDK and PF_RING ZC after 150 The graph in Figure 3(a) shows no clear transition from L1
cycles. At this point cIO , which describes the cycles needed to L2 due to the low 10 cycle increase. The decline at around
for a packet to be received and sent by the respective frame- 256 KB is visible due to the larger speed difference between
work, reaches its lowest value and stays roughly constant for L2-cache and L3-cache. The next drop in the graph is the
all higher packet rates. DPDK has the lowest CPU cost per transition between L3-cache and non-cached RAM accesses.
packet forwarding operation with approximately 100 cycles. DPDK is slightly slower than PF_RING ZC when the L2-
We measured a cIO of approximately 900 cycles for for- cache is fully occupied by the data structure. This means
warding applications based on the Linux network stack in that DPDK has a slightly higher cache footprint compared
previous work [30]. This means that the frameworks dis- to PF_RING ZC.
cussed in this paper can lead to a nine-fold performance Figure 3(b) plots the cache misses, obtained by read-
increase over classical applications. ing the CPU’s performance registers. Only the results for
DPDK are given. The results for netmap and PF_RING ZC
5.4 Influence of Caches are similar and are not shown for improved readability of the
graph. The number of cache misses starts at a certain level
The forwarding scenarios in the previous section ignored
and begins to rise as a cache fills up until the size of the
the influence of caches, which can introduce a delay when
test data exceeds the respective cache size. This observa-
accessing a data structure, e.g. the routing table. To imitate
tion holds for every cache level.
this behavior the task emulator described in the preceding
The data shown by Figure 3(a) can be used to test our
section was enhanced to access a data structure while trans-
model against a different problem. In contrast to the previ-
ferring packets.
Throughput [Mpps] 15 15
Throughput [Mpps]
10 10
5 DK L1 Cache Size 5
PR L2 Cache Size
nm L3 Cache Size
0 0
1K 16 K 256 K 4M 64 M 512 M 0 250 500 750 1000
Working Set Size [Bytes] (log2 scale) Ctask [CPU cycles]
(a) Performance with memory accesses 8 Batch nm 8 Batch DK 8 Batch PR
32 Batch nm 32 Batch DK 32 Batch PR
108 256 Batch nm
Cache Misses [Hz] (log10 scale)
107
106 Figure 4: Throughput influenced by batch sizes
105
104 larger batch sizes. PF_RING and DPDK reach their high-
3
est throughput at a batch size of 32. The larger batch sizes
10 are therefore omitted for those frameworks in Figure 4 be-
L1 Misses DK L1 Cache Size
102 cause they also do not have adverse effects on the through-
L2 Misses DK L2 Cache Size put. netmap needs a batch size of at least 256 to reach a
101 L3 Misses DK L3 Cache Size throughput performance close to the other two frameworks.
100 This is due to the relatively expensive system calls required
1K 16 K 256 K 4M 64 M 512 M
to send or receive a batch (cf. Section 2.3).
Working Set Size [Bytes] (log2 scale)
(b) Cache misses for DPDK 5.6 Latency
Increasing the batch size boosts throughput but raises la-
Figure 3: Cache measurements tency because the packets spend a longer time queued if pro-
cessed in larger batches. Overloading a software forwarding
application causes a worst-case behavior for the latency be-
ously fixed load per packet, in this experiment the load per cause all queues will fill up. So a high latency is expected
packet was determined by the cache access times. However, for all cases where packets are dropped due to insufficient
even under these circumstances the model provides a good processing resources.
estimation if the average per-packet costs are used. For in- We used the IEEE 1588 hardware time stamping features
stance, at a list size of 256 MB ctask the average costs to of the Intel 82599 controller to measure the latency of the
access a list element are 250 cycles. Taking the 100 extra cy- forwarding applications [9]. The packets are time stamped in
cles into account, this leads to average costs of 350 cycles for hardware on the source and sink immediately before sending
ctask . For DPDK the cIO is roughly 100 cycles and f CP U and after receiving them from the physical layer. The time
is 3.3 GHz. The expected throughput is 7.3 Mpps with our stamps do not include any software latency or queuing de-
model, and the measured value in Figure 3(a) is 7.1 Mpps. lays on the source and sink. This achieves sub-microsecond
The minor difference can be explained by the fact that the accuracy. [39]
test data structure also competes for cache space with data Figure 5 shows the latency for different batch sizes under a
required by the framework, which results in additional over- packet rate of 99% of the line rate1 and no additional work-
head beyond the cache miss when sending or receiving pack- load. The latencies were acquired by sending time stamped
ets. Therefore, the size of data structures required for rout- packets periodically (up to 350 time stamped packets per
ing also needs to be considered when designing a software second) at randomized intervals by using a different trans-
router. mit queue on the load generator. The time stamped packets
are indistinguishable from the normal load packets for the
5.5 Influence of Batch Sizes forwarding application.
In the following measurements, we analyze the influence Both DPDK and PF_RING ZC are overloaded with a
of the batch size, i.e. the number of packets handled by one batch size of 8, netmap with all batch sizes smaller than 256
API call. as described in the previous section. This causes all queues
The tests shown in Figure 4 were conducted using dif- to fill up and the applications exhibit a worst-case behavior
ferent queue sizes with increasing CPU load using the task 1
Using full line rate with constant bit rate traffic causes
emulator. For each iteration of the test, the batch size was delays after a minor interruption (like printing statistics)
doubled starting at a batch size of 8 up to a batch size of because it is not possible to send faster than the incoming
256. The results show that each framework profits from traffic.
Latency [µs] (log10 scale) DK smaller batch sizes for applications sensitive to high latency
1,000 or larger batch sizes for applications where raw performance
PR
is critical.
nm
If mere performance and latency figures are considered
100 DPDK and PF_RING ZC seem to be superior to netmap.
Though netmap has advantages. It uses well-known OS in-
terfaces and modified system calls for packet IO, leading
10 to increased performance while remaining a certain degree
of interface continuity and system robustness by perform-
ing checks on the user-provided packet buffers. DPDK and
PF_RING ZC favor more radical approaches by breaking
1 with those concepts, resulting in even higher performance
8 16 32 64 128 256 gains, but lack the robustness and familiarity of the API.
Batch Size An application built on DPDK of PF_RING ZC can crash
the system by misconfiguring the NIC, a scenario that is
Figure 5: Latency by batch size prevented by netmap’s kernel driver.
Our conclusion is that the modification of the classical
design for system interfaces results in higher performance.
that is typical for a system that is overloaded. DPDK und The more these interfaces are modified, the higher the packet
PF_RING achieve a latency of 9 µs with a batch size of 16 rates that can be achieved. As a drawback, this requires
and the latency then gradually increases with the batch size. applications to be ported to one of these frameworks.
PF_RING ZC gets slightly faster than DPDK for larger
batch sizes. netmap achieves a forwarding latency of 34 µs
with a batch size of 256. Acknowledgment
These latencies can be compared to other forwarding meth- This research was supported by the DFG as part of the
ods and hardware switches: Rotsos et al. measured a latency MEMPHIS project (CA 595/5-2), the KIC EIT ICT Labs
of 35 µs for Open vSwitch under light load and 3 - 6 µs for on SDN, and the BMBF under EUREKA-Project SASER
hardware-based switches [32]. Bolla and Bruschi measured (01BP12300A). We also want to express our sincere thanks
∼15 µs to ∼80 µs for the Linux router in various scenarios to the anonymous reviewers, especially reviewer #1, for the
without packet loss and latencies in the order of 1000 µs for constructive feedback.
overload scenarios [13].