Atc
Atc
Abstract
Many applications (routers, trafc monitors, rewalls, etc.) need to send and receive packets at line rate even on very fast links. In this paper we present netmap, a novel framework that enables commodity operating systems to handle the millions of packets per seconds traversing 1..10 Gbit/s links, without requiring custom hardware or changes to applications. In building netmap, we identied and successfully reduced or removed three main packet processing costs: per-packet dynamic memory allocations, removed by preallocating resources; system call overheads, amortized over large batches; and memory copies, eliminated by sharing buffers and metadata between kernel and userspace, while still protecting access to device registers and other kernel memory areas. Separately, some of these techniques have been used in the past. The novelty in our proposal is not only that we exceed the performance of most of previous work, but also that we provide an architecture that is tightly integrated with existing operating system primitives, not tied to specic hardware, and easy to use and maintain. netmap has been implemented in FreeBSD for several 1 and 10 Gbit/s network adapters. In our prototype, a single core running at 900 MHz can send or receive 14.88 Mpps (the peak packet rate on 10 Gbit/s links). This is more than 20 times faster than conventional APIs. Large speedups (5x and more) are also achieved on userspace Click and other packet forwarding applications using a libpcap emulation library running on top of netmap.
1 Introduction
General purpose OS provide a rich and exible environment for running, among others, many packet processing and network monitoring and testing tasks. The high rate raw packet I/O required by these applications is not the
This
intended target of general purpose OSes. Raw sockets, the Berkeley Packet Filter [12] (BPF), the AF SOCKET family, and equivalent APIs have been used to build all sorts of network monitors, trafc generators, and generic routing systems. Performance, however, is not adequate the millions of packets per second (pps) that can be present on 1..10 Gbit/s links. In search of better performance, some systems (see Section 3) either run completely in the kernel, or bypass the device driver and the entire network stack by exposing the NICs data structures to user space applications. Efcient as they may be, many of these approaches depend on specic hardware features, give unprotected access to hardware, or are poorly integrated with the existing OS primitives. The netmap framework presented in this paper combines and extends some of the ideas presented in the past trying to address their shortcomings. Besides giving huge speed improvements, netmap does not depend on specic hardware1, has been fully integrated in the OS (FreeBSD) with minimal modications, and supports unmodied libpcap clients, through a compatibility library. One metric to evaluate our framework is performance: in our implementation, moving one packet between the wire and the userspace application takes less than 70 CPU clock cycles, which is at least one order of magnitude faster than standard APIs. In other words, a single core running at 900 MHz can source or sink the 14.88 Mpps achievable on a 10 Gbit/s link. The same core running at 150 MHz is well above the capacity of a 1 Gbit/s link. Other, equally important, metrics are safety of operation and ease of use. netmap clients cannot possibly crash the system, because device registers and critical kernel memory regions are not exposed to clients, and they cannot inject bogus memory pointers in the kernel (these are often vulnerabilities of schemes based
1 netmap can give isolation even without hardware mechanisms such as IOMMU or VMDq, and is orthogonal to hardware ofoading and virtualization mechanisms (checksum, TSO, LRO, VMDc, etc.)
Hardware NIC registers ... head tail NIC ring phy_addr len
Figure 1: Typical NICs data structures and their relation with the OS data structures. on shared memory). At the same time, netmap uses an extremely simple data model well suited to zerocopy packet forwarding; supports multi-queue adapters; and uses standard system calls (select()/poll()) for event notication. All this makes it very easy to port existing applications to the new mechanism, and to write new ones that make effective use of the netmap API. The rest of the paper is structured as follows. Section 2 completes the discussion of the motivations for this work, and gives some background on how network adapters are normally managed in operating systems. Section 3 presents related work, illustrating some of the techniques that netmap integrates and extends. Section 4 describes netmap in detail, Performance data are presented in Section 5. Finally, Section 6 discusses open issues and our plans for future work.
the CPU of these events. On the transmit side, the NIC expects the OS to ll buffers with data to be sent. The request to send new packets is issued by writing into the registers of the NIC, which in turn starts sending packets marked as available in the TX ring. At high packet rates, interrupt processing can be expensive and possibly lead to the so-called receive livelock [14], or inability to perform any useful work above a certain load. Polling device drivers [8, 14, 16] and the hardware interrupt mitigation implemented in recent NICs solve this problem. Some high speed NICs support multiple transmit and receive rings. This helps spreading the load on multiple CPU cores, eases on-NIC trafc ltering, and helps decoupling virtual machines sharing the same hardware.
ware architectures is that most systems barely reach 0.5..1 Mpps per core from userspace, and even remaining in the kernel yields only modest speed improvements, usually within a factor of 2.
nisms such as IOMMUs), and a misbehaving client can potentially trash data anywhere in the system. UIO-IXGBE [9] implements exactly what we have described above: buffers, hardware rings and NIC registers (see Figure 1) are directly accessible to user programs, with obvious risks for the stability of the system. PF RING [2] exports to userspace clients a shared memory region containing a ring of pre-allocated packet buffers. The kernel is in charge of copying data between skbufs and the shared buffers, so the system is safe and no driver modications are needed. This approach amortizes the system call costs over batches of packets, but retains the data copy and skbuf management overhead. An evolution of PF RING called DNA [3] avoids the copy because the memory mapped ring buffers are directly accessed by the NIC. Same as UIO-IXGBE, DNA clients have direct access to the NICs registers and rings. The PacketShader [5] I/O engine (PSIOE) is one of the closest relatives to our proposals. PSIOE uses a custom device driver that replaces the skbuf-based API with a simpler one, using preallocated buffers. Custom ioctl()s are used to synchronize the kernel with userspace applications, and multiple packets are passed up and down through a memory area shared between the kernel and the application. The kernel is in charge of copying packet data between the shared memory and packet buffers. Unlike netmap, PSIOE only supports one specic network card, and does not support select()/poll(), requiring modications to applications in order to let them use the new API. Hardware solutions: Some hardware has been designed specically to support high speed packet capture, or possibly generation, together with special features such as timestamping, ltering, forwarding. Usually these cards come with custom device drivers and user libraries to access the hardware. As an example, DAG [1, 7] cards are FPGA-based devices for wire-rate packet capture and precise timestamping, using fast onboard memory for the capture buffers (at the time, the I/O bus was unable to sustain line rate). NetFPGA [10] is another example of an FPGA-based card where the rmware of the card can be programmed to implement specic functions directly in the NIC, ofoading the main CPU from some work.
encryption, Tx Segmentation Ofoading, Large Receive Ofoading, are completely orthogonal to our proposal: they reduce some processing in the host stack but do not address the communication with the device. Similarly orthogonal are the features relates to virtualization, such as support for multiple hardware queues and the ability to assign trafc to specic queues (VMDq) and/or queues to specic virtual machines (VMDc, SR-IOV). We expect to run netmap within virtual machines, although it might be worthwhile (but not the focus of this paper) to explore how the ideas used in netmap could be used within a hypervisor to help the virtualization of network hardware.
Application netmap API netmap rings host stack NIC rings network adapter
4 Netmap
The previous survey shows that most related proposals have identied, and tried to remove, the following high cost operations in packet processing: data copying, metadata management, and system call overhead. Our framework, called netmap, is a system to give user space applications very fast access to network packets, both on the receive and the transmit side, and including those from/to the host stack. Efciency does not come at the expense of safety of operation: potentially dangerous actions such as programming the NIC are validated by the OS, which also enforces memory protection. Also, a distinctive feature of netmap is the attempt to design and implement an API that is simple to use, tightly integrated with existing OS mechanisms, and not tied to a specic device or hardware features. netmap achieves its high performance through several techniques: a lightweight metadata representation which is compact, easy to use, and hides device-specic features. Also, the representation supports processing of large number of packets in each system call, thus amortizing its cost; linear, xed size packet buffers that are preallocated when the device is opened, thus saving the cost of per-packet allocations and deallocations; removal of data-copy costs by granting applications direct, protected access to the packet buffers. The same mechanism also supports zero-copy transfer of packets between interfaces; support of useful hardware features (such as multiple hardware queues). Overall, we use each part of the system for the task it is best suited to: the NIC to move data quickly between the network and memory, and the OS to enforce protection and provide support for synchronization. 4
Figure 2: In netmap mode, the NIC rings are disconnected from the host network stack, and exchange packets through the netmap API. Two additional netmap rings let the application talk to the host stack.
netmap_if
num_rings ring_ofs[]
netmap rings
ring_size cur avail flags buf_ofs flags len index pkt_buf pkt_buf pkt_buf pkt_buf
NIC ring
phy_addr len
Figure 3: User view of the shared memory area exported by netmap. At a very high level, when a program requests to put an interface in netmap mode, the NIC is partially disconnected (see Figure 2) from the host protocol stack. The program gains the ability to exchange packets with the NIC and (separately) with the host stack, through circular queues of buffers (netmap rings) implemented in shared memory. Traditional OS primitives such as select()/poll() are used for synchronization. Apart from the disconnection in the data path, the operating system is unaware of the change so it still continues to use and manage the interface as during regular operation.
netmap supports these features by associating, to each interface, three types of user-visible objects, shown in Figure 3: packet buffers, netmap rings, and netmap if descriptors. All objects for all netmap-enabled interfaces in the system reside in one large memory region, allocated by the kernel in a non-pageable area, and shared by all user processes. The use of a single buffer pool is just for convenience and not a strict requirement of the architecture. With little effort and almost negligible runtime overhead we can easily modify the API to have separate memory regions for different interfaces, or even introduce shadow copies of the data structures to reduce data sharing. Since the memory region is mapped by processes and kernel threads in different virtual address spaces, any memory reference contained in that region must use relative addresses, so that pointers can be calculated in a position-independent way. The solution to this problem is to implement references as offsets between the parent and child data structures. Packet buffers have xed-size (2 Kbytes in the current implementation) and are shared by the NICs and user processes. Each buffer is identied by a unique index, which can be easily translated into a virtual address by user processes or by the kernel, and into a physical address used by the NICs DMA engines. Buffers for all netmap rings are preallocated when the interface is put into netmap mode, so that during network I/O there is never the need to allocate memory, or a risk to run out of resources. The metadata describing the buffer (index, data length, some ags) are stored into slots which are part of the netmap rings described next. Each buffer is referenced by a netmap ring and by the corresponding hardware ring. A netmap ring is a device-independent replica of the circular queue implemented by the NIC, and includes: ring size, the number of slots in the ring; cur, the index of the current read or write position in the ring; avail, the number of available buffers, representing received packets in RX rings, or empty slots in TX rings; flags, to indicate special conditions such as empty TX ring or errors; buf ofs, the offset between the ring and the beginning of the array of (xed-size) packet buffers; slots[], an array with ring size entries. Each slot contains the index of the corresponding packet buffer, the length of the packet, and some ags used to request special operations on the buffer. 5
Finally, a netmap if contains read-only information describing the interface, such as the number of rings and an array with the memory offsets between the netmap if and each netmap ring associated to the interface (once again, offsets are used to make addressing position-independent). 4.1.1 Data ownership and access rules The netmap data structures are shared between the kernel and userspace, but the ownership of the various data areas is well dened, so that there are no races. In particular, the netmap ring is always owned by the userspace application except during the execution of a system call, when it is updated by the upper half of the kernel. The lower half of the kernel, which may be called by an interrupt, never touches a netmap ring. Packet buffers between cur and cur+avail-1 are owned by the userspace application, whereas the remaining buffers are owned by the kernel. The boundaries between these two regions are updated during system calls.
Both NIOC*SYNC ioctl()s are non blocking, involve no data copying (except from the synchronization of the slots in the netmap and hardware rings), and can deal with multiple packets at once. These features are essential to reduce the per-packet overhead to very small values. The in-kernel part of these system calls does the following: locates the interface and rings involved, using a private kernel copy of the netmap if which cannot be modied by user processes; validates the cur eld and the content of the slots involved (lengths and buffer indexes, both in the netmap and hardware rings); synchronizes the content of the slots between the netmap and the hardware rings, and issues commands to the NIC to advertise new packets to send or newly receive buffers; updates the avail eld in the netmap ring. The amount of work in the kernel is minimal, and the checks performed make sure that any value written to the shared data structure cannot cause system crashes. 4.2.1 Blocking primitives Blocking I/O is supported through the select() and poll() system calls. Netmap le descriptors can be passed to these functions, and are reported as ready (waking up the caller) when avail > 0. Before returning from a select()/poll(), the system updates the status of the rings, same as in the NIOC*SYNC ioctls. This way, applications spinning on an eventloop require only one system call per iteration. 4.2.2 Multi-queue interfaces For cards with multiple ring pairs, le descriptors (and the related ioctl() and poll()) can be congured in one of two modes, chosen through the ring id eld in the argument of the NIOCREG ioctl(). In the default mode, the le descriptor controls all rings, causing the kernel to check for available buffers on any of the rings. In the alternate mode, a le descriptor is associated to a single TX/RX ring pair. This allows multiple threads/processes to create separate le descriptors, bind them to different ring pairs, and to operate independently on the card without interference or need for synchronization. Binding a thread to a specic core just requires a standard OS system call, setaffinity(), without the need of any new mechanism. 6
4.2.3 Talking to the host stack In addition to the netmap rings associated to the hardware ring pairs, each interface in netmap mode exports an additional ring pair, used to handle packets to/from the host stack. A le descriptor can be associated to the host stack netmap rings using a special value for the ring id eld in the NIOCREG call. In this mode, NIOCTXSYNC encapsulates buffers into mbufs and then passes them to the host stack. Packets coming from the host stack are instead copied to the buffers in the netmap rings so that subsequent NIOCRXSYNC will report them as received packets. Note that the network stack in the OS believes it has full control and access to a network interface even when this operates in netmap mode. As a consequence, it will happily try to communicate over that interface with external systems. It is then a responsibility of the netmap client to make sure that packets are properly passed between the rings connected to the host stack and those connected to the NIC. Implementing this feature is straightforward, possibly even using the zero-copy technique shown in Section 4.5. This is also an ideal opportunity to implement functions such as rewalls, trafc shapers and NAT boxes, which are normally attached to packet lter hooks.
used to navigate through the data structures in the shared memory region, netmap clients do not need any library to use the system, and the code is extremely compact and readable.
fds.fd = open("/dev/netmap", O_RDWR); strcpy(nmr.nm_name, "ix0"); ioctl(fds.fd, NIOCREG, &nmr); p = mmap(0, nmr.memsize, fds.fd); nifp = NETMAP_IF(p, nmr.offset); fds.events = POLLOUT; for (;;) { poll(fds, 1, -1); for (r = 0; r < nmr.num_queues; r++) { ring = NETMAP_TXRING(nifp, r); while (ring->avail-- > 0) { i = ring->cur; buf = NETMAP_BUF(ring, ring->slot[i].buf_index); ... store the payload into buf ... ring->slot[i].len = ... // set packet length ring->cur = NETMAP_NEXT(ring, i); } } }
int pcap_inject(pcap_t *p, void *buf, size_t size) { int si, idx; for (si = p->begin; si < p->end; si++) { struct netmap_ring *ring = NETMAP_TXRING(p->nifp, si); if (ring->avail == 0) continue; idx = ring->slot[ring->cur].buf_idx; bcopy(buf, NETMAP_BUF(ring, idx), size); ring->cur = NETMAP_RING_NEXT(ring, ring->cur); ring->avail--; return size; } return -1; } int pcap_dispatch(pcap_t *p, int cnt, pcap_handler callback, u_char *user) { int si, ret = 0; for (si =p->begin; si < p->end; si++) { struct netmap_ring *ring = NETMAP_RXRING(p->nifp, si); while ((cnt == -1 || ret != cnt) && ring->avail > 0) { int i = ring->cur; int idx = ring->slot[i].buf_idx; p->hdr.len = p->hdr.caplen = ring->slot[i].len; callback(user, &p->hdr, NETMAP_BUF(ring, idx)); ring->cur = NETMAP_RING_NEXT(ring, i); ring->avail--; ret++; } } return ret; }
4.7 Implementation
In the design and development of netmap, a fair amount of work has been put into making the system maintainable and performant. The current version, included in FreeBSD, consists of about 2000 lines of code for system call (ioctl, select/poll) and driver support. There is no need for a userspace library: a small C header (200 lines) denes all the structures, prototypes and macros used by netmap clients. To keep device drivers modications small (a must, if we want the API to be implemented on new hardware), most of the functionalities are implemented in common code, and each driver only needs to implement two functions for the core of the NIOC*SYNC routines, one function to reinitialize the rings in netmap mode, and one function to export device driver locks to the common code. This reduces individual driver changes, mostly mechanical, to about 500 lines each, (a typical device driver has 4k .. 10k lines of code). In the netmap architecture, device drivers do most of their work in the context of the userspace process. This simplies resource management (e.g. binding processes to specic cores), and makes the system more controllable and robust, as we do not need to worry of executing too much code in non-interruptible contexts. 7
We generally modify drivers so that the interrupt service routine does no work except from waking up any sleeping process. This means that interrupt mitigation delays are directly passed to user processes. Synchronization between the NIC and netmap rings is done in the upper half of the system calls, and in a way that avoids expensive operations. As an example, we dont reclaim transmitted buffers or look for more incoming packets if a system call is invoked with avail > 0. This provides huge speedups for applications that unnecessarily invoke system calls on every packet. Two more optimizations (pushing out any packets queued for transmission even if POLLOUT is not specied; and updating a timestamp within the netmap ring before poll() returns) reduce from 3 to 1 the number of system calls in each iteration of the typical event loop once again a signicant performance enhancement for certain applications. To date we have not tried to add special performance optimizations to the code, such as aggressive use of prefetch instructions, or data placements to improve cache behaviour. netmap support is currently available for the Intel 10 Gbit/s adapters (ixgbe driver), and for various 1 Gbit/s adapters (Intel, RealTek).
ii) Per-packet costs have multiple origins. At the very least, the CPU must update a slot in the NIC ring for each packet. Additionally, depending on the software architecture, each packet might require additional work, such as memory allocations, system calls, programming the NICs registers, updating statistics and the like. In some cases, part of the operations in the second set can be removed or amortized over multiple packets. Given that in most cases (and certainly this is true for netmap) per-packet costs are the dominating component, the most challenging situation in terms of system load is when the link is traversed by the smallest possible packets. This explains why we run most of our tests with 64 byte packets (60+4 CRC)2 . Of course, in order to exercise the system and measure its performance we need to run some test code, but we want it to be as simple as possible in order to reduce the interference on the measurement. Our initial tests then use two very simple programs that make application costs almost negligible: a packet generator which streams pre-generated packets, similar to the one presented in Section 4.4, and a packet receiver which just counts incoming packets.
16 14 12 10 8 6 4 2 0 0 0.5
2.5
Figure 4: Transmit performance with 64-byte packets, variable clock rates and number of cores. netmap reaches line rate near 900 MHz. The two lines at the bottom represent pktgen (a specialised, in-kernel generator available on linux, peaking at about 4 Mpps at 2.93 GHz) and a netsend (a FreeBSD userspace generator, peaking at 790 Kpps). Section 4.4. The packet size can be congured at runtime, as well as the number of queues and threads/cores used to send trafc. Packets are prepared in advance so that we can run the tests with close to zero per-byte costs. The test program loops around a poll(), sending at most B packets (batch size) per ring at each round. On the receive side we use a similar program, except that this time we poll for read events and only count packets.
the NIC ring) is amortized among all descriptors that t into a cache line, and other costs (such as reading/writing the NICs registers) are amortized over the entire batch. The measurements exhibit a slightly superlinear behaviour (e.g. when doubling the clock speed between 150 and 300 MHz), though in absolute terms the deviation is very small (3-4 clock cycles per packet), and could be explained by different rations between CPU, memory and I/O bus speeds at different clock rates. Not shown by the graphs, but measured in the experiments, is the fact that once the system reaches line rate, increasing the clock speed reduces the CPU usage. The generator in fact sleeps until an interrupt from the NIC reports the availability of new buffers, and wakes up the process. Experiments using 4 cores show a speedup of about 2.5 times over the 1-core case. The speedup is modest, but likely because the clock reduction also affects shared components (caches, bus) which are heavily exercised by these tests. It is useful to compare the performance of our netmapbased generator with similar applications using conventional APIs. Figure 4 also reports the maximum throughput of two packet generators representative of the performance achievable using standard APIs. The line at the bottom represents netsend, a FreeBSD userspace application running on top of a raw socket. netsend peaks at 790 Kpps at the highest clock speed (2.93 GHz plus turbo boost), and does 390 Kpps at 1.2 GHz. The 40x difference in speed can be explained with the many additional operations that the raw socket API requires: data copies, one system call per packet, in-kernel buffer allocations, and no chance to amortize the cost of accessing the NICs registers. The other line in the graph is pktgen, an in-kernel packet generator available in Linux, which reaches almost 4 Mpps at maximum clock speed, and 2 Mpps at 1.2 GHz (the minimum speed we could set in Linux). In this case the system call and memory copy costs are removed, but there are still device driver overheads (skbuf allocations and deallocations, NIC programming) that make the system at least 7 times slower than netmap. The speed vs. clock rate experiments on the receive path give results similar to the transmit section. The system can do line rate even below 1 GHz and just one receive ring, at least for packet sizes multiple of 64 bytes.
Tx Rate (Mpps)
16 14 12 10 8 6 4 2 0 50 100
16 14 12 10 8 6 4 2 0
Tx Rate (Mpps)
Rate (Mpps)
Figure 5: Actual transmit and receive speed with variable packet sizes (excluding Ethernet CRC). The top curve is the transmit rate, the bottom curve is the receive rate. See Section 5.4 for explanations. on the performance of the system, both on the transmit and the receive side. Transmit speeds with variable packet sizes exhibit the expected 1/size behaviour, as shown by the upper curve in Figure 5. The receive side, instead, shows some surprises as indicated by the bottom curve in Figure 5. The maximum rate, irrespective of CPU speed and number of rings, is achieved only for certain packet sizes such as 64+4, 128+4 and above 144+4 bytes (in all cases, the extra 4 bytes corresponding to the Ethernet CRC are not written to memory). At other packet sizes, performance drops (e.g. we see 11.6 Mpps at 60 bytes, and a plateau around 8.5 Mpps between 65 and 127 bytes). Investigation suggests that the NIC is issuing additional read cycles to preserve the content of remaining parts of 64-byte blocks, preventing reaching line rate. We have encountered several similar hardware bugs while testing netmap on a variety of network adapters.
Figure 6: Transmit performance with 1 core, 2.93 GHz, 64-byte packets, and varying rings and burst size. As we see from the data, performance grows almost linearly with the batch size, up to the maximum value, which is approximately 12.5 Mpps with one queue, and 14.88 Mpps with 2 or more queues. The low speed achieved with a batch size of 1 shows that the cost of the poll() system call is much larger than the per-packet cost measured in Section 5.3, and so it is important to amortize it over large batches. With some surprise, we found out that using a single queue (an experiment which was not done previously) we could not reach the line rate but saturated near 12.5 Mpps. Further tests with lower clock speeds reached the same maximum rate, suggesting that the bottleneck is not the CPU but must be searched in the NIC and its interface to the PCI-Express bus. This throughput limitation is not present with 2 and 4 queues, which allow the generator to reach the maximum packet rate achievable on the link.
Conguration netmap-fwd (1.6 GHz) netmap-fwd + pcap click-fwd + netmap click-etherswitch + netmap click-fwd + native pcap openvswitch + netmap openvswitch + native pcap bsd-bridge
the libpcap emulation library adds a signicant overhead to the previous case (7.5 Mpps at full clock vs. 14.2 Mpps at 1.6 GHz means a difference of about 80-100 ns per packet). We have not yet investigated whether/how this can be improved (e.g. by reducing stalls, optimizing the scan of rings, etc.); replacing the native API with the netmap-based libpcap emulation gives a speedup between 4 and 8 times for OpenvSwitch and Click, despite the fact that pcap inject() does use data copies. This is also an important result because it means that reallife applications can actually benet from our API.
Figure 7: Forwarding performance of our test hardware with various software congurations. We have then explored how a few packet forwarding applications behave when using the new API, either directly or through the libpcap compatibility library described in Section 4.6. The test cases are the following:
netmap-fwd, a simple application that forwards packets between interfaces using the zero-copy technique shown in Section 4.5. In this case we reach line rate (14.2 Mpps, due to the limitations in the NIC) at just 1.6 GHz. netmap-fwd + pcap, as above but using the libpcap emulation (Section 4.6) instead of the zero-copy code; click-fwd, a simple Click [8] conguration shown below, that passes packets between interfaces: FromDevice(ix0) -> Queue -> ToDevice(ix1) FromDevice(ix1) -> Queue -> ToDevice(ix0) The experiment has been run using Click userspace with the systems libpcap, and on top of netmap with the libpcap emulation library; click-etherswitch, as above but replacing the two queues with an EtherSwitch element; openvswitch, the OpenvSwitch software with userspace forwarding, both with the systems libpcap and on top of netmap; bsd-bridge, in-kernel FreeBSD bridging, using the mbuf-based device driver.
5.7 Discussion
In presence of huge performance improvements such as those presented in Figure 4 and Figure 7, which show that netmap is 4 to 40 times faster than similar applications using the standard APIs, one might wonder about two things: i) are we doing a fair comparison, and ii) what is the contribution of the various mechanisms to the performance improvement. The answer to the rst question is that the comparison is indeed fair. All the various generators in Figure 4 do exactly the same thing and each one tries to do it in the most efcient way, constrained only by the underlying APIs they use. The answer is even more obvious for Figure 7, where in many cases we just use the same unmodied binary on top of two different libpcap implementations. The results measured in different congurations also let us answer the second question evaluate the impact of different optimizations on the netmaps performance. Data copies, as shown in Section 5.6, are moderately expensive, but they do not prevent signicant speedups (such as the 7.5 Mpps achieved forwarding packets on top of libpcap+netmap). This is an interesting result, in light of the fact that we may want to reintroduce data copies to reduce the interference among netmap clients (see Section 4.3). Per-packet system calls certainly play a major role, as witnessed by the difference between netsend and pktgen (albeit on different platforms), or by the low performance of the packet generator when using small batch sizes. Finally, an interesting information on the cost of the skbuf/mbuf-based API comes from the comparison of pktgen (taking about 250 ns/pkt) and the netmap-based packet generator, which only takes 20-30 ns per packet which are spent in programming the NIC). These two application essentially differ only on the way packet buffers are managed, because the amortized cost of system calls and memory copies is negligible in both cases. 11
Figure 7 reports the measured performance. All experiments have been run on a single core with two 10 Gbit/s interfaces, and maximum clock rate except for the rst case where we saturated the link at just 1.6 GHz4 . From the experiment we can draw a number of interesting observations: native netmap forwarding with no data touching operation easily reaches line rate. This is interesting because it means that full rate bidirectiona operation is possible;
4 Previous revision of this paper reported a lower value, but we recently removed a useless and expensive function call in the critical path of packet forwarding.
[3] D ERI , L. ncap: Wire-speed packet capture and transmission. In Workshop on End-to-End Monitoring Techniques and Services (2005), IEEE, pp. 4755. [4] D OBRESCU , M., E GI , N., A RGYRAKI , K., C HUN , B., FALL , K., I ANNACCONE , G., K NIES , A., M ANESH , M., AND R ATNASAMY, S. Routebricks: Exploiting parallelism to scale software routers. In ACM SOSP (2009), pp. 1528. [5] H AN , S., JANG , K., PARK , K., AND M OON , S. Packetshader: a gpu-accelerated software router. ACM SIGCOMM Computer Communication Review 40, 4 (2010), 195206. [6] H ANDLEY, M., H ODSON , O., AND K OHLER , E. Xorp: An open platform for network research. ACM SIGCOMM Computer Communication Review 33, 1 (2003), 5357. [7] H EYDE , A., AND S TEWART, L. Using the endace dag 3.7 gf card with freebsd 7.0. [8] K OHLER , E., M ORRIS , R., C HEN , B., JANNOTTI , J., AND K AASHOEK , M. The click modular router. ACM Transactions on Computer Systems (TOCS) 18, 3 (2000), 263297. [9] K RASNYANSKY, M. Uio-ixgbe. Qualcomm, https://opensource.qualcomm.com/wiki/UIO-IXGBE. [10] L OCKWOOD , J. W., M C K EOWN , N., WATSON , G., ET AL . Netfpgaan open platform for gigabit-rate network switching and routing. IEEE Conf. on Microelectronics Systems Education (2007). [11] M ARIAN , T. Operating systems abstractions for software packet processing in datacenters. PhD Dissertation, Cornell University (2010). [12] M C C ANNE , S., AND JACOBSON , V. The bsd packet lter: A new architecture for user-level packet capture. In USENIX Winter Conference (1993), USENIX Association. [13] M C K EOWN , N., A NDERSON , T., BALAKRISHNAN , H., PARULKAR , G., P ETERSON , L., R EXFORD , J., S HENKER , S., AND T URNER , J. Openow: enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review 38 (March 2008), 6974. [14] M OGUL , J., AND R AMAKRISHNAN , K. Eliminating receive livelock in an interrupt-driven kernel. ACM Transactions on Computer Systems (TOCS) 15, 3 (1997), 217252. [15] R IZZO , L. Netmap home page. http://info.iet.unipi.it/luigi/netmap/ . Universit` di Pisa, a
[16] R IZZO , L. Polling versus interrupts in network device drivers. BSDConEurope 2001 (2001). [17] R IZZO , L., C ARBONE , M., AND C ATALLI , G. Transparent acceleration of software packet forwarding using netmap. INFOCOM12, Orlando, FL, March 2012 (to appear), http://info.iet.unipi.it/luigi/netmap/ . [18] S TEVENS , W., AND W RIGHT, G. TCP/IP illustrated (vol. 2): the implementation. Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA, 1995.
References
[1] The dag project. Tech. rep., University of Waikato, 2001. [2] D ERI , L. Improving passive packet capture:beyond device polling. In SANE 2004, Amsterdam.
12