0% found this document useful (0 votes)
12 views12 pages

PLUG Flexible Lookup Modules For Rapid Deployment of

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 12

PLUG: Flexible Lookup Modules for Rapid Deployment of

New Protocols in High-speed Routers

Lorenzo De Carli† Yi Pan† Amit Kumar† Cristian Estan‡1 Karthikeyan Sankaralingam†


† ‡
University of Wisconsin-Madison NetLogic Microsystems
{lorenzo,yipan,akumar,karu}@cs.wisc.edu [email protected]

ABSTRACT bridging, virtual LANs, tunnels, classless addressing, access con-


New protocols for the data link and network layer are being pro- trol lists, network address translation, to name a few. Yet, the exist-
posed to address limitations of current protocols in terms of scala- ing infrastructure has many acute shortcomings and new protocols
bility, security, and manageability. High-speed routers and switches are sorely needed. Data link and network layer protocols have been
that implement these protocols traditionally perform packet pro- proposed recently to improve scalability [23, 29], security [51, 53,
cessing using ASICs which offer high speed, low chip area, and 15, 14], reduce equipment cost [1, 25], to ease management [23,
low power. But with inflexible custom hardware, the deployment 28, 14], or to offer users more control over their traffic [44, 52].
of new protocols could happen only through equipment upgrades. The two main factors that slow down the deployment of new pro-
While newer routers use more flexible network processors for data tocols are the inevitable tussle of the various stakeholders [19] and
plane processing, due to power and area constraints lookups in for- the need for physical equipment upgrades.
warding tables are done with custom lookup modules. Thus most The use of new equipment is a necessity for changes at the physi-
of the proposed protocols can only be deployed with equipment cal layer such as switching to new media or upgrades to higher link
upgrades. speeds. But data link layer and network layer changes do not nec-
To speed up the deployment of new protocols, we propose a essarily require changes to the hardware and can be accomplished
flexible lookup module, PLUG (Pipelined Lookup Grid). We can through software alone if the hardware is sufficiently flexible. Our
achieve generality without loosing efficiency because various cus- goal is to enable such deployment of innovative new data plane
tom lookup modules have the same fundamental features we retain: protocols without having to change the hardware.
area dominated by memories, simple processing, and strict access Sustained efforts by academia and industry have produced ma-
patterns defined by the data structure. We implemented IPv4, Eth- ture systems that take us closer to this goal. The NetFPGA project’s
ernet, Ethane, and SEATTLE in our dataflow-based programming [39] FPGA-based architecture allows experimental deployment of
model for the PLUG and mapped them to the PLUG hardware changes to the data plane into operational backbones [10]. While
which consists of a grid of tiles. Throughput, area, power, and la- FPGAs are an ideal platform for building high-speed prototypes
tency of PLUGs are close to those of specialized lookup modules. for new protocols, high power and low area efficiency make them
less appealing for commercial routers. Many equipment manufac-
Categories and Subject Descriptors: B.4.1 [Data Communica- turers have taken the approach of developing network processors
tion Devices]: Processors; C.2.2 [Network Protocols]: Protocol that are more efficient than FPGAs. For example Cisco’s Silicon
Architecture (OSI model) Packet Processor [22] and QuantumFlow [18] network processors
General Terms: Algorithms, Design, Performance can handle tens of gigabits of traffic per second with packet pro-
Keywords: lookup, flexibility, forwarding, dataflow, tiled architec- cessing done by fully programmable 32-bit RISC cores. But for
tures, high-speed routers efficiency they implement forwarding table lookup with separate
hardware modules customized to the protocol. Many of the pro-
1. INTRODUCTION posed new protocols use different forwarding table structures and
on these platforms, they cannot be deployed with software updates
The current Internet relies extensively on two protocols designed (without degrading throughput). Thus we are left with the original
in the mid-’70s: Ethernet and IPv4. With time, due to needs not problem of hardware upgrades for deploying new protocols.
anticipated by the designs of these protocols, a number of new pro- In this paper, we propose replacing custom lookup with flexible
tocols, techniques and protocol extensions have been deployed in lookup modules that can accommodate new forwarding structures
routers and switches changing how they process packets: Ethernet thus removing an important impediment to the speedy deployment
1
Work done while at University of Wisconsin-Madison. of new protocols. The current lookup modules for different pro-
tocols have many fundamental similarities: they consist mostly of
memories accessed according to strict access patterns defined by
the data structure, they perform simple processing during lookup,
Permission to make digital or hard copies of all or part of this work for and they function like deep pipelines with fixed latency and pre-
personal or classroom use is granted without fee provided that copies are dictable throughput. We present a design for a general lookup mod-
not made or distributed for profit or commercial advantage and that copies ule, PLUG (Pipelined Lookup Grid) which builds on these proper-
bear this notice and the full citation on the first page. To copy otherwise, to ties to achieve performance and efficiency close to those of exist-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ing custom lookup modules. Instead of using a completely flexible
SIGCOMM’09, August 17–21, 2009, Barcelona, Spain. hardware substrate like an FPGA, PLUGs contain lightweight pro-
Copyright 2009 ACM 978-1-60558-594-9/09/08 ...$10.00.

207
grammable microcores that perform the simple processing needed two match, it sends the port number onto the result bus. To increase
during lookup. Data structured are spatially laid out and PLUGs throughput these operations can be pipelined.
move the computation required by lookups to the relevant portions IP Lookup: IP lookup requires the longest matching prefix op-
of the data structure. The contributions of this paper are: eration which can be performed using a multibit trie as shown in
• Programmable lookup modules that enable changing data plane Figure 1e. The trie is traversed from root to leaves and at each level
protocols in high-speed routers and switches without hardware two bits of the IP address are used to index into the current node2 .
upgrades (Section 2); At the location identified by these bits we may find the final result
of the lookup (a port number) or a pointer to a node at the next
• A dataflow-based programming model for representing forward- level. Figure 1f shows how a custom lookup module can be orga-
ing tables (Section 3); nized: three memories, each with the nodes from a given level of
• An implementation of the forwarding tables for many existing the trie. Local processing elements read the correct node from the
and new protocols by using this model and a discussion of other memory, perform the required processing and generate the data to
possible uses (Section 4). be sent on the next link. The input link carries the IP address, the
In Section 5, we outline a scalable tiled architecture that imple- next one carries the remaining bits of the IP address together with
ments the programming model. Section 6 presents a static schedul- either the final result if a match was found by the first processing
ing approach that avoids internal resource conflicts, simplifies hard- element or the address of the node to be accessed at the next level.
ware, and guarantees predictable throughput. Section 7 has a pre- If the result is found earlier than the third level, later processing
liminary evaluation of the efficiency of the proposed architecture. elements just pass it through without performing memory reads.
Similarities: The two custom lookup modules are fundamentally
similar. Each have large memories connected to local processing
2. A CASE FOR FLEXIBILITY IN LOOKUP elements. The processing elements perform simple operations, and
MODULES each lookup follows an orderly succession of steps until it produces
Since network processors are programmable, it is plausible to a result after a fixed number of cycles. But their overall structure
implement forwarding lookup for new protocols and even existing differs because they implement different lookup algorithms. The
ones like Ethernet and IP directly on the network processor shar- custom lookup modules lack generality in three key respects: the
ing a single in-memory copy of the forwarding table between all number and size of memories, the specific processing performed
cores. While appealingly simple, such a solution has performance and the communication patterns supported.
and cost disadvantages: it requires high on-chip interconnect band- 2.2 PLUG: A Universal Lookup Module
width resulting in high area and power overheads [13].
Hence, high-speed network processors do not adopt this approach In developing the PLUG we first developed a programming model
and instead use separate lookup modules for forwarding tables. The that enables the direct expression of the inherent structure in the
lookup module receives the address from the network processor lookup operation. Lookup objects describe this logical structure of
core and returns the lookup result a few cycles later. By using a the data in the forwarding table and the associated algorithms for
lookup module the on-chip network connecting the cores to other lookups and updates. Conceptually the lookup objects are specified
resources has less traffic and thus a network with a smaller bisection with data flow graphs (Figures 1c and 1g). The nodes of these data
bandwidth suffices. The local caches of the cores are not “polluted” flow graphs are logical pages which represent portions of the data
with portions of the forwarding tables, so the cores can use smaller structure and the local processing steps for the data during lookups
caches and less area. Also the overall latency of the lookup is re- and updates. The directed edges of the data flow graphs denote the
duced as a single roundtrip over the on-chip network is sufficient. communication patterns. By extracting the structure, this program-
The biggest drawback of this approach is the lack of flexibility. A ming model simplifies the architecture and programming.
new lookup module must be implemented for each protocol. We ad- The PLUG architecture implements the programming model us-
dress this drawback by proposing PLUG, a programmable lookup ing a modular design outlined in Figures 1d and 1h. More details
module. of the architecture are in Section 5. It addresses the memory gener-
ality problem by having a large number of small memories that can
2.1 Two Examples be grouped together to implement memory regions of the desired
sizes. Processing generality is achieved by using use lightweight
To illustrate the state-of-art and show the potential for a flexible
16-bit programmable processors. To accommodate any communi-
lookup module, we sketch how the forwarding tables for Ethernet
cation pattern, PLUGs use a multi-hop internal network that con-
and IP are implemented today (Figure 1). Based on the similari-
sists of communication modules at each tile and multiple direct
ties and the differences we derive the mechanisms required for a
links between all pairs of neighboring tiles. Figures 1d and 1h
universal lookup module.
show how the two data flow graphs can be mapped to the same
Ethernet Forwarding: For Ethernet we need a lookup in a hash
2-tile PLUG. Multiple logical pages can be mapped to the same tile
table using 48-bit Ethernet addresses as keys and port numbers as
by allocating separate memories to each of them. If a logical page
values associated with the keys. Figure 1a shows a simple hash ta-
requires more memory than available on a tile, it can be mapped
ble with 4 entries in each bucket. A key being looked up can be in
to multiple tiles. Through the programs running on the lightweight
any of the entries of the bucket it hashes to, so it must be compared
processors, we not only control the processing performed during
against the keys stored in 4 entries. To reduce latency these compar-
lookups, but also the communication patterns and the number and
isons are done in parallel by the custom lookup module from Figure
location of memories implementing each logical page.
1b. The entries are divided among 4 memories, each holding one
One of the distinguishing characteristics of the PLUG architec-
entry of each bucket. The memories are connected to local pro-
ture is that it is globally statically scheduled to avoid resource con-
cessing elements. During a lookup, the key and the number of the
flicts. No two processing elements contend for the same memory
bucket it hashes to are broadcast to the four processing elements.
Each one of them reads the entry corresponding to the bucket, com- 2
Since the trie in Figure 1e has three levels each corresponding to
pares the key stored there with the key being looked up and if the the two bits of the IP address, it holds prefixes of length up to 6.

208
Figure 1: Data structures used by the forwarding tables for Ethernet and IP, separate custom lookup modules for them, data flow
graphs describing both algorithms and the mapping of both of them to the same 2-tile PLUG.

or for the same communication ports. Three important benefits are: to the various methods of the lookup object. While lookup methods
a) simplification of the hardware, b) fixed latency guarantees for all are the most frequent, additional methods are required to build and
lookup operations, and c) processing a new lookup or update every maintain the data structure.
cycle. The programming model rules explained in the next section Communication: Messages are used to communicate between
allow such a statically scheduled architecture. code-blocks and the data flow graph edges represent these mes-
sages. The execution of a code-block starts when the page receives
a message which also indicates the type of code-block to execute.
3. PROGRAMMING THE PLUG The entire context of the code-block consists of the data in the mes-
A PLUG lookup object implements the main data structures of sage and that in the data-block read from memory. Each data-block
the forwarding table and the methods for accessing them. In this can send one or more messages to other pages and these must carry
section we describe the primitives of the programming model: data- the entire context needed for the code-blocks they trigger. The exe-
blocks, code-blocks, messages, and logical pages. We conclude cution of the lookup object is started by sending messages from the
with examples of hash table implementations using the model. input interface of the PLUG. If the method produces results they
Data: Data-blocks are the primitive building blocks for logical emerge as messages at the output interface.
pages and are small chunks of data that can be read with a single The communication patterns between the logical pages of the
memory access. A logical page is a collection of data-blocks with lookup object are described by the data flow graph which is a di-
similar roles. No method accesses more than one data-block in each rected acyclic graph. The nodes in this graph are logical pages with
page which rules out conflicts for memory accesses. two special nodes for the input and the output interface. Each edge
Processing: The methods of the lookup object are broken into in the graph represents one message. In practice, a small num-
code-blocks. Each code-block is associated with one logical page ber of patterns can be used to synthesize the complex data flow
and it can perform one memory access to read or write a data-block graphs required for real protocols. Figure 2 shows these commu-
from that page. Each page has multiple code-blocks corresponding nication patterns, each of which requires some support in our soft-

209
Figure 2: Communication patterns in data flow graphs.

ware toolchain to implement. Simple features of the architecture


are used to implement these communication patterns: multiple par- Figure 3: Data flow graphs corresponding to changes to the
allel network links (for the divide and combine patterns), multi-hop basic hash table design.
multicast messages (for the spread pattern), and static arbitration
for deciding which message gets discarded on a conflict (for the
discard pattern). utilization without increasing the number of pages. We use two ta-
Implementing the model: To simplify the hardware, the PLUG bles with different hash functions and on insertion hash to both and
programming model imposes some limitations: single memory ac- insert in the table where the bucket the key hashes to is emptiest.
cess per page, strict adherence to the acyclic data flow graph, and This reduces the likelihood of buckets having significantly more
limits on the sizes of data-blocks, code-blocks, and messages. entries occupied than average. The example from Figure 3b imple-
Limitations: Some complex but infrequent operations (some up- ments d-left hashing. It uses two separate messages for the groups
dates, expiration of old entries, memory compaction, etc.) cannot of pages implementing the two tables because different hash func-
be implemented as a single method of the lookup object. We sepa- tions are used and the two buckets have different positions.
rate such operations into an arbitrarily complex routine that runs on Splitting: In some cases (e.g. Ethane [14]) the size of the key is
network processor cores and invokes one or more methods of the larger than the maximum data-block size we can support. We can
lookup object. These methods need to follow the same data-flow accommodate such hash tables by splitting the entries among mul-
graph as the lookup methods. Thus an update to a page that is not tiple pages. In the example from Figure 3c we split the keys into a
directly connected to the input interface by an edge in the graph will high key and low key stored on different pages. We send the two
have to go through one or more intermediate pages where simple portions of the key to the two separate groups of pages and each
code-blocks will just relay the update message. This is a gener- page performs a check for its half of the key only. The last page
alization of the idea of “write bubbles” [7] used for updating IP keeps the values for all entries and it reads the value associated
lookup pipelines without disrupting lookups. with the key only if both halves of the key successfully match indi-
cated by it receiving two messages. There is one subtle possibility
3.1 Hash Tables for false positives: the two halves of the key may have matched
The lookup objects we implemented use many hash tables with different entries of the bucket, and in this case the lookup must fail.
wide ranges of values for parameters such as the number of entries, Merging: In other cases (e.g. Ethernet) the size of a hash table
the size of the keys and the size of the values stored in entries. To entry is smaller than half the size we can afford for a data-block.
simplify the presentation of the individual lookup objects we give Combining two entries in a data-block (Figure 3d) reduces the num-
below an overview of the four changes to the basic hash table we ber of pages (and hence power), but increases latency slightly be-
used. Figure 3 presents the data flow graphs corresponding to the cause the code blocks need to perform two key comparisons.
basic hash table and the four modifications. Table 1 summarizes Fingerprints: A further opportunity for power saving is the use
their advantages, disadvantages and applicability. of fingerprints [11] together with checks for exact matches for the
Multiple hash functions: In the basic hash table (Figure 3a), due key as shown in Figure 3e. A first page contains small fingerprints
to the randomness of the hash function, the buckets do not have (say 8 bits) which are computed together with the hash function
the same number of elements in them. This presents a problem as and must be different for each of the keys in a bucket. All finger-
when the number of used entries approaches the size of the hash prints for a bucket are read and compared against the fingerprint of
table, some of the buckets will be full much before others. If a new the key being looked up. In the next page we can read directly the
key hashes to such a bucket it cannot be inserted, even though there only entry for which the fingerprint matches (if any) to perform the
are many empty entries in the hash table. To make such a situation check whether the keys match. This technique reduces power re-
unlikely, the hash table has to be run at a utilization of less than quirements significantly because we do not compare against more
100%. Using fewer, larger buckets allows better memory utiliza- than one key in a bucket, but it increases latency because of the
tion, but it increases power consumption as we need more logical processing in the fingerprint page. The use of fingerprints intro-
pages. D-left hashing [11, 12, 49] allows us to increase the memory duces new conflicts as entries with the same fingerprint cannot be

210
Modification Applicability Advantage Main disadvantages
Larger buckets always better memory utilization higher power
D-left hashing (b) always better memory utilization multiple input messages
Split entries (c) large entries can fit keys larger than data-block higher power, latency
Combined entries (d) small entries lower power higher latency
Fingerprints (e) with d-left hashing lower power higher latency

Table 1: The advantages and disadvantages of various modifications to the basic hash table.

4) uses two separate hash tables for the two types of entries. If
the lookups in both the hash tables succeed, only the result from
the misbehaving host table reaches the output interface. We im-
plemented the specific types of hash tables proposed in the original
Ethane paper: both use d-left hashing with two tables and one entry
per bucket. The key in the flow table is more than 26 bytes long, so
we used the split entries (Figure 3c) since the specific instantiation
of the architecture we consider limits the size of data blocks to 16
bytes. The lookup object has a total of 8 pages. By using larger
buckets and fingerprints (Figure 3e) the memory utilization could
be further improved without using more pages.
Figure 4: Data flow graph for Ethane lookup object. The Ethane flow table entries contain byte and packet counters
which we store in arrays outside the PLUG. The reason is that the
basic PLUG programming model mandates that each code block
stored in the same bucket even if the keys differ. When fingerprints perform a single memory access, but incrementing counters re-
are used in combination with d-left hashing this is not that big a quires two: a read and a write. The lookup method identifies the
problem as the conflicting entry can be moved to the other table. position of the entry and no further searching is required to locate
the counters to increment. In Section 4.6 we show how the con-
4. LOOKUP OBJECTS FOR PROTOCOLS straint of a single memory access per code block can be relaxed
if lookups arrive at a rate lower than one per cycle. With a rate of
4.1 Ethernet one lookup every 6 cycles we could accommodate two counters per
entry which translates to 167 million lookups per second. This ex-
We implemented Ethernet forwarding with learning, but without
ceeds the rate required for forwarding 64-byte packets at line rate
VLAN support. The lookup object is a hash table with 4 entries in
for eight 10Gbps links.
each bucket, d-left hashing (2 tables) and combined entries (2 per
For multicast, the flow table entries store the set of ports through
data-block). It uses a total of 4 logical pages. Each entry has 64
which the packets of a given flow should be forwarded (note that
bits: a 48-bit Ethernet address, a valid bit, a 12-bit port number and
the flow identifier also includes the input port). Our implementa-
a 3-bit timestamp counter. If the key passed to the lookup method
tion can be easily extended to support multicast for switches with
matches a valid entry, the port number and timestamp are returned.
a small number of ports (the Ethane paper reports implementing
For each unicast Ethernet packet two instances of the lookup meth-
switches with 4 ports) by storing in the flow table a bitmap with
ods are invoked for destination and the source. If the lookup on the
the ports to which the packets should be sent. But since the size of
source finds no match, we insert the address in the hash table and
the bitmap is the number of ports, for switches with more than a
thus learn the port through which it is reachable. We keep outside
few dozen ports more radical changes to the lookup object would
the PLUG a secondary data structure summarizing which entries
be required to avoid a big increase in memory usage.
are used (valid) and which are not and use it to determine in which
position to insert a new entry in.
We manage the expiration of the entries using a coarse 3-bit “cur- 4.3 Seattle
rent time” variable whose value is stored in every newly created SEATTLE [29] switches participate in a routing protocol to com-
entry. When the lookup on the source address of a packet returns pute shortest paths between each pair of switches. The data plane
a timestamp that is older than the current time, we invoke an up- forwarding table of each switch has a table with the next hop to
date method for updating only the timestamp in that entry. A back- use for reaching any of the other switches. Endhost interfaces are
ground task periodically reads all valid entries and invalidates those reachable through a single switch that connects them to the SEAT-
whose timestamp is too old. Our coarse 3-bit timestamps general- TLE network. Individual switches do not store forwarding entries
ize the “activity bit” [14] used for garbage-collecting inactive en- for each remote host, they just keep a cache of popular destina-
tries and allow more accurate control of the actual amount of time tions storing the switch they are reachable through (not the port
an entry stays inactive before removal. they are reachable through). This way the cache need not be in-
validated when topology changes affect the path to switches that
4.2 Ethane connect cached destinations. To locate unknown hosts, switches
In Ethane [14] a controller host explicitly builds the forwarding use a one-hop DHT in which each switch is responsible for storing
tables of all switches. The forwarding tables consist of two types indefinitely entries for all hosts hashing to a certain portion of the
of entries: flow entries which indicate how the switch should for- DHT ring. If a switch receives a packet whose destination is not in
ward each allowed flow and per host entries indicating misbehaving its cache, it needs to forward it to the resolver switch responsible
hosts whose packets should be dropped. The lookup object (Figure for the portion of the hash space it maps to.

211
The lookup object organizes the IPv4 forwarding table as a three-
level multi-bit trie with the root node covering the first 16 bits of
the IP address and the nodes at the other two levels covering 8 bits.
Uncompressed trie nodes consist of two arrays of size 65536 for
the root and 256 for the other nodes. The first array specifies the
port associated with the longest matching prefix (from among those
covered by the node) and the second holds pointers to children. For
nodes without children, we omit the second array. In the result ar-
ray there are often port numbers that repeat (covered by the same
Figure 5: Data flow graph for SEATTLE lookup object. prefix from the forwarding table), and in the child array there are
many entries with a NULL pointer. The compression algorithm
Our lookup object for SEATTLE (Figure 5) implements three saves space by removing repetitions from the first array and re-
components: a hash table mapping host addresses to their locations, moving NULL pointers from the second. We use two compression
a next hop array with the next hop port for each switch and a DHT techniques (Figure 6): bitmap compression for “dense” nodes and
ring for looking up the resolver switch implemented as an array of value-list compression for “sparse” nodes. A node is sparse if the
two-level B-trees. The structure of the lookup object is such that compressed array is below a certain size (8 in our implementation,
one lookup per packet is sufficient. In rare cases we also need an 4 in Figure 6). Bitmap compression breaks the arrays into chunks
update of the timestamps in the location table. To minimize latency, of 32 values (8 in Figure 6) and builds 32-bit bitmaps to find the
the DHT ring and the location table are looked up in parallel. The right entry within these arrays during lookup. In the first bitmap
DHT ring always produces a result, but if the lookup in the location the bit is set for positions which differ from the next one and in
table is successful it overrides it. Hence we always read the next the second the bit is set for non-NULL children. The lookup code-
hop array, either to find the next hop to the switch connecting the block processing the summary counts the number of bits set before
destination of the packet, or to its resolver. the position to which the IP address being looked up is mapped
The next hop table is a simple array with the addresses and the to. With value-list compression (not shown in Figure 6) we keep a
next hops for all the switches. We do not need a hash table because list of values for the 8 bits covered by the node for which the port
the location table and DHT ring store the position in the array at number differs from that for the previous value. For the child ar-
which the information associated with a switch is. We also need ray we keep a list of values that correspond to children. The lookup
to store the address of the switch in this array because it is needed code-block performs binary search in the value-list array to find the
to build the outer header when the packet is encapsulated to be correct index in the two compressed arrays.
tunneled to the destination switch.
The location table uses MAC addresses as keys. It stores many 4.5 Other Protocols
types of entries: the cache of popular remote destinations, addresses IPv6: Preliminary analysis shows that IPv6 also maps well to PLUG.
for which the switch is the resolver in the DHT, locally connected Since the last 64 bits of IPv6 addresses are the host part, prefixes
addresses and the addresses for all switches. Each entry contains with lengths from 65 to 127 are invalid and we use a hash table
the position in the next hop array for the connecting switch. It also for the last 64 bits. To reduce the latency of the lookup we divide
stores a timestamp and a 2-bit type field which identifies which of prefixes into two pipelines that are looked up in parallel: one for
these four categories the entry belongs to. There are different poli- prefix lengths of up to 48 and the other for prefix lengths of 49 to
cies for updating entries of different types. For example entries are 64 and 128. The root of this second pipeline is a node with stride
removed from the cache when a timeout occurs, whereas entries for 48 implemented as a hash table. The latency of the IPv6 lookup is
which the switch is the resolver are kept until the DHT changes. only 70% larger than than for IPv4, but it uses more than 3 times
The DHT ring needs to find the switch with the hashed identi- the number of pages (27) increasing power consumption.
fier nearest, but no larger than that of the address being looked up. AIP: The Accountable Internet Protocol [2] uses addresses com-
We divide the hash space into 4096 intervals of equal size and for posed of 160-bit cryptographic accountability domain (AD) ad-
each interval we keep a B-tree with the hashed identifiers of all the dresses and 160-bit endpoint identifiers (EIDs). This helps solve
switches that map to it. We chose the number of intervals so that many security problems that stem from the fundamental lack of ac-
we can use B-trees of depth 2. The size of B-tree nodes is 128 countability in the traditional IP. The core forwarding operation on
bits: five 8-bit keys, five 16-bit values pointing to the entries for AIP addresses requires looking up large hash tables with AD ad-
the switches associated with the keys and an 8-bit counter for the dresses or EIDs as keys and PLUG can implement this efficiently.
number of keys actually used. We use one of the values in the root Since 144 of the 160 bits are generated by cryptographic hash func-
node to point to the entry of the switch with the largest ID in earlier tions we can exploit their randomness to reduce the size of the ac-
intervals. The second level of each B-tree has 5 nodes statically tual keys stored in the hash table buckets. We can use some of these
assigned to it, but the number of children actually used varies. bits as the bucket identifier and others as fingerprint, thus the actual
entry only needs to store the remaining bits.
4.4 IP Version 4 PLayer: PLayer [28] is a data link protocol for data centers achiev-
For IP version 4 we implemented two lookup objects that per- ing more robust control of the traffic by using policy-aware switches
form the “longest matching prefix” operation. The first one is a to direct traffic through the right sequence of middle boxes. For-
straightforward adaptation of the “Lulea” algorithm [20] which uses warding is based on rules on the source of the packet and the flow
compressed multibit tries and achieves very compact sizes for the 5-tuple. This is similar to the packet classification operation dis-
forwarding table, but updates are relatively slow. The second one is cussed in Section 4.6.
an algorithm using similar techniques that results in slightly larger NAT: PLUGs are well suited for implementing network address
forwarding tables, but supports fast updates. For brevity we only translation with hash tables and arrays. We also considered a NAT-
describe here the second lookup object, but we note that the main based architecture, IPNL [23]. Most of the required operations can
difference from Lulea is that it does not perform leaf pushing. be implemented by PLUGs. The one operation that is hard to im-

212
Figure 6: Compression technique and data flow graph for the IPv4 lookup object.

plement directly on PLUG is lookups when keys are variable-sized each code block accesses the memory once. In fact each code block
such as for a cache of mappings from fully qualified domain names has a window of X cycles in which it can perform multiple memory
to IP addresses. A hash table with collision-resistant fingerprints of accesses, for example to read increment and then write counters.
the FQDNs as keys would be straightforward. The semantics of concurrent execution of multiple methods in the
pipeline is still that of atomic sequential execution.
4.6 Discussion Scaling up: If the desired throughput is larger than one lookup per
PLUGs can be used for other functions and they can be used in cycle, multiple independent pipelines with separate copies of the
scaled up or scaled down modes which we discuss below. forwarding table can be used. The cost of the increase in through-
Packet classification: Another throughput-critical data plane oper- put is an increase in memory requirements, but it can be reduced
ation involves matching packet headers against rules with ranges or by taking advantage of the existence of popular destinations in the
prefixes for each of the header fields (source, destination and proto- forwarding tables. We can split the forwarding tables into a small
col for IP and source and destination port). TCAM-based solutions group of trie nodes or hash table entries that store popular destina-
for the packet classification problem cannot be implemented on our tions and a larger group of unpopular ones. The pipelines can have
low-power SRAM-based architecture, but many of the algorithmic separate copies of the popular entries but share one copy of the un-
solutions could work well. Algorithms that rely on processing the popular ones. Collisions could still occur in the shared pipeline,
fields independently and combining the results through bitmap op- but they should be rare and one can recover by submitting again
erations [35, 6] can make use of the parallelism of the PLUG pro- the lookup that did not complete due to the collision. Hence we
gramming model, the wide memory reads, and the ability to select can achieve performance benefits akin to those of caching in this
the highest priority answers by discarding lower priority results. statically scheduled architecture.
Other approaches relying on decision trees [26, 43] or forests of Limitations: PLUGs are not suitable for all data plane processing.
tries [4] can be naturally adapted to a pipeline and the flexibility For some tasks a traditional architecture (multiple processor cores
of the programmable µcores makes it easy to support the required with multiple levels of caches backed by DRAM) is better suited.
operations on the tree nodes. While we have not implemented any There are three fundamental characteristics that distinguish tasks
of these algorithms on PLUG, we believe that packet classification suitable for a PLUG: the ability to describe the task with an acyclic
can be one of the important uses for it. data flow graph, use of a data structure that can fit into SRAM,
Signature matching: Deep packet inspection often relies on DFA- and poor locality in this data structure. Signature matching using
based signature matching and compression techniques [33, 34, 8, DFAs involves cyclic data dependencies and hence the task cannot
31] are often applied to reduce the memory used by transition ta- be described as an acyclic data flow graph; PLUG can be used as
bles. Lookup in these compressed transition tables is well suited a lookup module invoked separately for each byte of input, but it
for PLUG and we have an implementation using some of the pro- cannot implement the whole computation. The second condition
posed techniques. The latency of the PLUG can be a problem, but may not hold for equipment where per flow state is too large and a
algorithms consuming multiple bytes at a time [9] may alleviate it. PLUG is able to accommodate only a fraction of the flow records.
Per flow state: Features such as accounting, application identifi- Examples of where the third condition does not hold are tasks such
cation and intrusion prevention are supported by modern network as parsing protocol headers where no data structure is used or pro-
equipment and require per flow state. Hash tables with per flow tocols for which the data structure fits in the L1 cache [1].
state can be placed on the PLUG. With current technology PLUGs
accommodating on the order of one million flows are feasible.
Scaling down: In settings where the target throughput is less than 5. PLUG ARCHITECTURE
one lookup per cycle, the programming model can be extended in Figure 7 shows the high level overview of the PLUG architec-
powerful ways. If the pipeline is only required to accept a new ture. It is a tiled multi-core multi-threaded architecture with a very
method invocation every X cycles, we can relax the limitation that simple on-chip network. Tiles provide three types of resources:

213
that µcore is free. The resource constraints dictate that no more
than M logical pages can be assigned to a single tile and thus the
maximum number of virtual tiles is M . Depending on the length
of the code-blocks, different number of µcores are assigned to each
logical page. The number of networks assigned to each virtual tile
depends on the number of messages generated.
µCores: Each µcore is a simple 16-bit single-issue, in-order pro-
cessor with only integer computation support, simple control-flow
support (no branch prediction), and simple instruction memory (256
entries per µcore). The register file is quite small, only sixteen 16-
bit entries and it requires four consecutive entries to be read for
feeding the router.
Router: Each tile also includes simple routers that implement a
lightweight on-chip network (OCN). Compared to conventional OCNs,
this network requires no buffering or flow control as the OCN traf-
fic is guaranteed to be conflict-free. The routers implement simple
ordered routing and the arbiters apply a fixed priority scheme. Each
network is 64-bits wide, and the software mapper assigns networks
to the virtual tiles. The network message is a total of 80 bits, 16
bits of header information and 64 bits of data. The header infor-
mation contains five fields: destination encoded as a X coordinate
and Y coordinate, a 4-bit type field to encode 16 possible types of
messages, a multicast bit which indicates the message must be de-
livered to some intermediate hops en route to the final destination
and a 3-bit selector field. This field is used to select between vir-
tual tiles and to control which of the tiles on the path of a multicast
message actually process it.

5.1 Implementation Specification


To derive a PLUG configuration, we examined different proto-
cols and current and projected future data sets. Our target was 1
Figure 7: PLUG architecture overview showing three virtual billion lookups per second (depending on minimum packet size and
tiles to map three logical pages. on the number of lookups/packet, maximum traffic volume is be-
tween 160 Gbps and 600 Gbps) and less than 5 watts worst case
computation shown by the grid of computation cores (called µcores), power. We picked a technology design point for 2010 and chose
storage shown by the grid of SRAMs, and routers shown by the a 32nm design process. Many of our data-sets required 2MB and
array of routers. The routers form an on-chip interconnection net- to account for future expansion, we provisioned for 4MB on-chip
work connecting multiple tiles together to form the full PLUG chip. storage. Based on area models we developed (explained below)
External interfaces to provide input and read output from the PLUG a 21mm2 chip provides 16 tiles and a total storage of 4MB. Our
are extensions of this on-chip network. The architecture can sup- workload analysis showed 32 cores, four 64KB banks, and 8 net-
port multiple external interfaces and PLUGs can be standalone chips works per tile meets our design goals. This 16-tile 4Mx32Cx8R
or integrated with other chips. configuration is the specification evaluated in this paper.
A key simplification of the architecture is that it is globally stat-
ically scheduled which is achieved by adhering to a set of rules in 5.2 Modeling and Physical Design
generating the code-blocks. We use a RISC ISA with specialization We constructed area and power models for the PLUG architec-
for bit manipulation. In the remainder of this section, we describe ture using widely accepted design tools and methodology. Our
the tile organization, the µcores, and the on-chip network. models include three components: SRAM, the µcores, and inter-
Tile: The tile’s resources are virtualized as a memory array with connection network.
M ports and thus allowing up to M accesses at a time, a router ar- Area: The SRAMs were modeled using CACTI 5.0 [48] - a stan-
ray with R ports thus allowing R input/output messages at a time, dard SRAM modeling tool. We used single-ported SRAMs and to
and a computation array with C cores. The PLUG architecture can save power used LSTP memory cells which have low static power
be scaled along any of these dimensions, and the figure shows a but are slower. For modeling the µcore array, we used published
4Mx16Cx6R configuration. This virtualization allows the several processor data-sheets and used the Tensilica Xtensa LX2 [47] as a
logical pages and their associated code-blocks to be mapped to a baseline for our µcore. This processor is a simple 32-bit, 5-stage
tile. Thus one physical tile can be viewed as multiple logical or vir- in-order processor and occupies 0.206mm2 built at 90nm. Project-
tual tiles, where each such virtual tile has a subset of the resources. ing for 32nm technology and simplifying to a 16-bit data path, we
An example assignment is indicated by the coloring shown in the scale down its area and our models project 0.013mm2 . We conser-
figure, where three logical pages are mapped to a single tile by con- vatively assumed the interconnect’s area is 10% of processor area.
structing three virtual tiles colored blue, green, and orange. Based on this model, a single PLUG tile’s area is 1.29 mm2 of
When a message arrives through a network, it triggers the execu- which 74% is SRAM.
tion of the code-block it refers to. The next available free µcore starts Power: CACTI provides power measurements for SRAMs and
executing the code-block. If another message arrives in the next cy- we estimate worst case power by assuming the SRAM is accessed
cle, another µcore is used. When a code-block finishes executing, every cycle. We used processor data-sheet information about the

214
Xtensa LX2 processor to derive the power consumption of our 32
µcore array. We model interconnect power per tile as 15% of pro-
cessor (µcore array in our case) dynamic power for an active link,
adapting the findings in [50]. The worst case power for a tile is 990
milliwatts.
Dynamic chip power for the PLUG is derived by considering
the (maximum) number of tiles that will be activated during the
execution of a method (activity number (A)) and considering and
average links active (L). Thus, worst case power can be modeled as
A tiles executing instructions in all µcores and one memory access
every cycle. We compute L based on the mappings of the lookup
objects to the grid discussed in Section 6. The final chip power = [
( (memory leakage power per tile) + µcore leakage power per tile))
* (total number of tiles) ] + [ ( (dynamic memory power per tile) +
µcore power per tile) ) * (activity number) ] + [ (interconnect power
per active link) * (average links active) ]. This model is seeded with
the activity number (A) for different protocols.

6. SCHEDULING AND MAPPING


Earlier sections describe the lookup objects, we show here how
the logical pages and code-blocks can be mapped to the actual
physical resources. While we currently perform these operations
manually, we envision that the compiler could eventually make all
mapping decisions. A first step is to convert logical pages into
smaller physical pages that can fit on an SRAM bank. Different
logical pages can get mapped to a single physical tile and each log-
ical page has its own set of SRAM banks and µcores. Second, the
mapper also checks for resource conflicts - the number of instruc-
tions in the code-blocks that have been mapped to a tile must not Figure 8: Mapping lookup objects to the PLUG grid.
exceed the total number of µcores available. Third, the mapper also
enforces fixed delays as explained in detail below.
the same edge do not collide, but if messages corresponding to dif-
Code-blocks: Each tile can be executing multiple instances of
ferent edges can occur in the same area of the grid the edges must
code-blocks for the same logical page. While memories and net-
be mapped to different networks.
works are not shared between different pages mapped to the same
Multicast: Hash tables use “spread” edges (Figure 2a) to distribute
tile, the cores that run code-blocks for the same page could get
the lookup to multiple logical pages. We implement these using
into conflicts when accessing memory or sending messages. We
multicast messages. Each multicast message takes a single path
ensure that each code-block for a page performs the memory oper-
through the grid, but it may trigger code blocks at more than one
ation the same number of cycles after its start. Since the code-block
tile. In Figure 8b we show the mapping of a 2-page hash table
instances running concurrently are started in different cycles, this
to a 16-tile grid. Each logical page is mapped to 8 tiles and the
ensures that no memory conflicts arise. Similarly the sending of
mapping ensures that each lookup will need to activate entries from
messages occurs a fixed number of cycles after the code-block is
the two pages mapped to neighboring tiles. To avoid triggering
started.
code blocks on intermediate tiles (e.g. tiles A and E for lookup
Different paths: Two lookup instances can take different paths
1) we use the selector field from the the message header. Code-
through the grid. The time it takes for the result to emerge is the
blocks are triggered only if the value of this field matches a local
sum of the processing delays (which are the same for all lookups)
configuration register (similar to how multicast is implemented by
and the total propagation delay. If the paths have different lengths
Ethernet endhosts).
the total latencies differ and conflicts can occur at the output inter-
face (lookups initiated in different cycles emerge at the same time)
or inside the grid. Figure 8a shows how adopting a propagation 7. EVALUATION
discipline can ensure that all paths have equal length. Since mes-
sages propagate only to the right or down, the propagation latency In this section, we evaluate PLUGs and their suitability and effi-
to every tile is the Manhattan distance from the tile connected to ciency in supporting our suite of four protocols. We implemented
the input interface. The two lookups reach tile J through differ- each of the protocols using our software development stack. First,
ent paths, but their propagation delays are the same. Note that a we present a quantitative characterization of the different protocols
conflict on the link between tile O and tile P is still possible: the to demonstrate feasibility. We then describe the mapping of these
propagation delays are the same, but the processing delays differ as protocols to the PLUG architecture and evaluate performance in
lookup 1 has been processed by the first two pages, while lookup terms of latency, area, and energy efficiency. For all experiments
3 has been processed by all three. To avoid these types of conflicts we use a 21 mm2 4Mx32Cx8R 16-tile PLUG chip with 64KB
we use two separate networks. More generally this type of conflict memory banks.
arises because the two messages correspond to two different edges
in the data flow graph: for lookup 1 the edge between pages 2 and 7.1 Software Toolchain and Methodology
3 and for lookup 2 the edge between page 3 and the output. The We implement PLUG lookup objects as C++ objects. For each
propagation disciplines can ensure that messages corresponding to lookup object, we first developed a reference implementation which

215
Protocol Memory # of # of Lines of code # of Protocol Memory (MB) Area (mm2 ) Power (Watts)
size logical code- tiles PLUG Ideal PLUG Ideal PLUG Ideal
(MB) pages blocks PLUG Reference Ethernet 2.00 2.0 9.8 6.6 0.72 0.22
Ethernet 2 4 4 243 51 8 IPv4 2.00 1.4 7.4 4.6 1.06 0.22
IPv4 1.4 8 26 450 330 8 Seattle 2.75 2.0 13.5 7.2 1.68 0.33
Seattle 2 11 18 390 347 11 Ethane 2.00 2.0 9.8 6.6 1.36 0.32
Ethane 2 8 15 1120 200 8

Table 2: Characterization of the lookup objects implemented Table 4: Memory, area, and power characteristics.
on a 16-tile PLUG.
Table 3 describes in detail the characteristics of individual logical
Protocol Logical page bytes page mem. Latency
pages. It shows the size of the data-blocks in each logical page and
name per size banks (cycles)
data- (KB) per Code Total how many physical memory banks are used to schedule this page
block page block to the PLUG hardware.
Ethernet Buckets 16 512*4 8 10 10 Latency: Column 6 in Table 3 shows the latency in cycles of the
L1 summaries 12 24 1 10 largest code-block associated with each logical page of an appli-
L1 results 2 34 1 6 cation in the last column. These numbers vary between 6 and 20
L1 children 2 27 1 9 82
cycles. The total latency is simply the sum of the code-block laten-
IPv4 L2 summaries 8-14 609 10 20
L2 results 2 685 11 6
cies along the data flow graph’s critical path and it varies between
L2 children 2 7 1 9 10 and 82 cycles. These numbers do not include the propagation
L3 summaries 6-9 36 1 16 latency which is 8 cycles on our 16-tile grid.
L3 results 2 26 1 6 Memory and area: Table 4 characterizes the memory occupied by
HostLoc 9 144*8 3 14 the different implementations. During the mapping process logical
DHT Ring 0 16 64 1 20 49 pages are broken down into physical pages that will fit on a mem-
Seattle
DHT Ring 1 16 320 5 20
ory bank. Then, we decide which page is mapped to which tile and
NextHop 8 512 8 9
Misbehaving 16 256*2 4 21 which bank. To avoid conflicts, certain banks may not get com-
Ethane Flow table 1 16 256*4 4 10 31 pletely full because they contain a physical page smaller than the
Flow table 2 16 256*2 4 10 memory available in the bank. Occasionally we leave banks un-
used on tiles where we cannot map a new page because too many
µcores are used by the pages already mapped to the tile. Column
Table 3: Logical page characterization of the protocols. 2 in Table 4, shows the total memory (sum of SRAM banks) used
in the different protocols accounting for these fragmentation and
scheduling losses and compares to the actual memory required by
directly implements the application without transforming it to the the application (column 3). For Ethernet forwarding and Ethane
PLUG model but implements the same routines (lookup, updates, which both consist of very regular hash tables, there is no overhead.
expiration of old entries, etc.). We use a C++ framework, which In the other two protocols, the overheads are acceptable: 38% and
provides the following: 1) a logical-page data structure as an array 43%.
data-type, 2) explicitly defined code-block functions to access these The PLUG chip devotes 74% of the area to memories and the
data structures, and 3) network messages that are routed between remaining 26% to computation cores and routers. Column 4 in Ta-
physical pages. Applications are implemented using this frame- ble 4 shows the area occupied by a PLUG chip if sized to match
work and executed by invoking methods and passing input network the needs of the protocol alone. If a protocol required only 4 tiles,
messages. Hash functions are computed by the network processor. then we report the area of a 4-tile PLUG in this column. We are
Our framework executes the corresponding code-blocks on differ- not counting the area of unused tiles as overhead because they can
ent pages and finally provides the result message which is verified be used by other lookup objects. Column 5 shows an aggressive
by comparing to the reference implementation. For performance estimate of the minimum area required for a specialized lookup
analysis, we hand-assembled code-block programs, based on the module. For this aggressive estimate, we assumed no area for any
C++ implementation. of the processing required and count only the area of memories as-
suming no area losses due to alignment problems when laying out
7.2 Quantitative Results the memories corresponding to the logical pages. IPv4 for example
We evaluate the cost of the generality provided by PLUG by would require eight individual SRAMs whose sizes match the size
comparing against idealized implementations for lookup modules of the logical pages listed in Table 4. We notice that the area over-
for all the protocols we implemented. We also compare against heads, introduced due to the generality of PLUG can be as high as
NetLogic’s NLA9000, a popular lookup module for IPv4 that im- 88%. However, this comparison is to an idealized implementation.
plements a proprietary algorithmic pipelined lookup. The difference will be smaller when comparing PLUGs to realistic
Table 2 describes the characteristics of the four lookup objects specialized implementations.
we implemented. For each protocol we used data-sets and traces Power: Column 6 in Table 4 shows the power consumed by the
that reflect deployed scenarios and/or future needs. Ethernet for- PLUG for different protocols derived using our power model. Col-
warding uses random 100K addresses, IPv4 uses an actual routing umn 7 shows the power consumed by an ideal implementation which
table with 280K prefixes, Seattle has been dimensioned to support requires no processing power and consumes only memory read/write
60K switches and for Ethane we used the guidelines from [14]. power. We see that PLUGs are within 5X of this oracle implemen-
For all protocols, we see that the number of logical pages is quite tation that simply cannot be physically constructed.
small and that the number of lines of code to develop the PLUG Comparison to state-of-the-art: The NLA9000 from NetLogic
implementations is similar to the reference implementation. The is a chip widely used for IP lookups and it can be configured as
last column in the table also shows the number of PLUG tiles used. TCAM or as a low-power algorithmic IP lookup engine using a

216
Metric PLUG NLA9000 nal communication and processing) and achieves higher throughput
Area (mm2 ) 140 - for lookups than existing tiled architectures because we can fully
Power (Watts) 1.4 6.5
pipeline processing in each tile. The MIT RAW tiled micropro-
Throughput (billion lookups/second) 633 300
Latency of lookup (ns) 148 160 cessor has been evaluated for implementing a full IP router [40].
PLUGs are less general in that they provide support only for the
lookup task, but through this specialization they can achieve better
Table 5: PLUG comparison to NetLogic NLA9000 (both at efficiency.
55nm technology, interface overheads ignored for PLUG).
9. CONCLUSIONS AND FUTURE WORK
proprietary lookup algorithm. It can fit 1.5 million IPv4 prefixes, High speed network processors use custom modules for lookups
it does 300 million lookups per second at a latency of 160ns con- in forwarding tables because of area, power, and performance ad-
suming 6.5 Watts. Since this chip is built at 55nm technology we vantages. Since most new protocols require different structures for
compare to a PLUG chip also at 55nm technology for this discus- their forwarding tables, inflexible lookup modules customized to
sion. Because the SRAMs are slower, the highest frequency for the current protocols slow down the deployment of new ones by mak-
PLUG is 633 MHz which reduces throughput. We can accommo- ing hardware upgrades necessary. We propose PLUG, an architec-
date 1.5 million IPv4 prefixes with a 140 mm2 PLUG providing ture for a flexible lookup module and we show how the forwarding
9 MB of storage using 36 tiles. In this configuration the PLUG tables of four different protocols can be mapped to it. We anticipate
would consume 1.4 Watts and provide a latency of 148ns. Table 5 that PLUGs can also support packet classification and deep packet
shows a summary of this comparison. Our numbers for PLUG do inspection.
not account for the power and latency of the interfaces to the rest of The forwarding tables are expressed as lookup objects that break
the system, but the numbers for NLA9000 include these interface the data structure into logical pages, break the processing into code
overheads also. We conclude that in this case the PLUG is actually blocks each local to one page and use explicit messages for all data
more efficient than a specialized lookup chip. transfers. The data flow graph describes the internal communica-
tion patterns of the lookup object. The data, processing, and com-
munication of the lookup object are mapped to the tiled PLUG
8. RELATED WORK architecture in a way that avoids resource conflicts producing a
Lookup modules: Ternary content addressable memories (TCAMs) pipeline with a fixed throughput of one lookup per cycle. Each
use hardware parallelism to match the search key against all entries. tile has multiple µcores to process one new request every cycle.
They have been used for IP lookup [38, 42]. Their main problem The static avoidance of resource conflicts makes the routers used
is their large power consumption due in large part to the fact that for internal communication and the µcores very simple.
all entries are activated in parallel. While selective activation of We have shown that PLUGs compare favorably to specialized
TCAM blocks reduces the power consumption [54], SRAM-based lookup modules. Future work to refine the architecture, compiler,
algorithmic lookup modules are the preferred solution for large for- and system software is required to definitely answer the question of
warding tables. Pipelined tries [7, 17, 5, 32, 27] are used by algo- how effective PLUGs are in deployed systems. Further evaluation
rithmic lookup modules. The PLUG solution for IP lookup falls in a product environment can best answer some of the questions
into this category. Casado et al. [16] propose a flexible lookup that relate to the economics of high-end routers.
module that can accommodate protocol changes. Their design is
based on a TCAM cache that hands packets over to software when
it cannot make a forwarding decision. Unlike PLUG which can
Acknowledgments
deliver predictable throughput irrespective on the traffic mix, this This work is sponsored by NSF grants 0546585, 0627102, and
solution depends on a high cache hit rate to achieve good perfor- 0716538 and by a gift from the Cisco University Research Program
mance. Fund at Silicon Valley Community Foundation. We thank Pere
Dataflow: The PLUG programming model is inspired originally Monclus, Mike Swift, Mike Ichiriu, Randy Smith, Matt Fredrik-
by the SDF model [36] and by the dataflow model of Click [30]. son and the anonymous reviewers for suggestions that improved
The core differences are that we focus on lookups, not on the entire this paper.
functionality of the router and that our goal is to map the objects to
a module specialized for lookups, not to a general-purpose proces- 10. REFERENCES
sor. Gordon et al. discuss a general dataflow like approach called [1] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable,
StreamIt [24]. Similar to historical dataflow machines [21, 3] the commodity data center network architecture. In SIGCOMM,
PLUG architecture implements dataflow execution but in a coarse pages 63–74, Aug. 2008.
[2] D. G. Andersen, H. Balakrishnan, N. Feamster, T. Koponen,
granularity of code blocks and network messages. The conflict- D. Moon, and S. Shenker. Accountable internet protocol
free operation of PLUG pipelines is similar to systolic arrays that (AIP). In Proceedings of the ACM SIGCOMM, Aug. 2008.
execute in a very regular fashion. [3] Arvind and D. E. Culler. Dataflow Architectures. Annual
Tiled architectures: The PLUG architecture is inspired by re- Review of Computer Science, 1:225–253, 1986.
cently proposed tiled architectures [46, 37, 41, 45]. The key dis- [4] F. Baboescu, S. Singh, and G. Varghese. Packet classification
tinction is that these architectures are targeting general-purpose pro- for core routers: Is there an alternative to CAMs. In
cessing and thus include area- and power-consuming features such INFOCOM, Apr. 2003.
as memory disambiguation, control speculation, networks with flow [5] F. Baboescu, D. Tullsen, G. Rosu, and S. Singh. A tree based
router search engine architecture with single port memories.
control and buffering and are too inefficient for workloads domi- In ISCA, June 2005.
nated by memory accesses such as the lookups. Targeting lookups [6] F. Baboescu and G. Varghese. Scalable packet classification.
allows us to statically eliminate all resource conflicts in PLUG. In SIGCOMM, pages 199–210, Aug. 2001.
Thus our architecture spends less area and power on “overhead” [7] A. Basu and G. Narlikar. Fast incremental updates for
beyond the memories required to store the forwarding table (inter- pipelined forwarding engines. In INFOCOM, Apr. 2003.

217
[8] M. Becchi and P. Crowley. An improved algorithm to
accelerate regular expression evaluation. In ANCS, Dec. [31] S. Kong, R. Smith, and C. Estan. Efficient signature
2007. matching with multiple alphabet compression tables. In
[9] M. Becchi and P. Crowley. Efficient regular expression SecureComm, Sept. 2008.
evaluation: Theory to practice. In ANCS, December 2008. [32] S. Kumar, M. Becchi, P. Crowley, and J. Turner. CAMP: fast
[10] N. Beheshti, Y. Ganjali, M. Ghobadi, N. McKeown, and and efficient IP lookup architecture. In ANCS, pages 51–60,
G. Salmon. Experimental study of router buffer sizing. In Dec. 2006.
Internet Measurement Conference, Oct. 2008. [33] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, and
[11] F. Bonomi, M. Mitzenmacher, R. Panigraphy, S. Singh, and J. Turner. Algorithms to accelerate multiple regular
G. Varghese. Beyond Bloom filters: From approximate expressions matching for deep packet inspection. In
membership checks to approximate state machines. In SIGCOMM, Sept. 2006.
SIGCOMM, Sept. 2006. [34] S. Kumar, J. Turner, and J. Williams. Advanced algorithms
[12] A. Broder and M. Mitzenmacher. Using multiple hash for fast and scalable deep packet inspection. In ANCS 2006,
functions to improve IP lookups. In INFOCOM, pages pages 81–92.
1454–1463, Apr. 2001. [35] T. V. Lakshman and D. Stiliadis. High-speed policy-based
[13] L. D. Carli, Y. Pan, A. Kumar, C. Estan, and packet forwarding using efficient multi-dimensional range
K. Sankaralingam. Flexible Lookup Modules for Rapid matching. In SIGCOMM, pages 203–214, Sept. 1998.
Deployment of New Protocols in High-speed Routers. [36] E. A. Lee and D. G. Messerschmitt. Synchronous data flow.
Technical Report TR1658, UW-Madison, Computer Science Proceedings of the IEEE, 75(9):1235–1245, 1987.
Department, May 2009. [37] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and
[14] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, M. Horowitz. Smart Memories: A modular reconfigurable
and S. Shenker. Ethane: taking control of the enterprise. In architecture. In ISCA, pages 161–171, June 2000.
SIGCOMM, Aug. 2007. [38] A. J. McAuley and P. Francis. Fast routing table lookup
[15] M. Casado, T. Garfinkel, A. Akella, M. Freedman, D. Boneh, using CAMs. In INFOCOM, pages 1382–1391, Apr. 1993.
N. McKeown, and S. Shenker. SANE: A protection [39] N. McKeown. The NetFPGA project.
architecture for enterprise networks. In USENIX Security http://www.netfpga.org/.
Symposium, Aug. 2006. [40] U. Saif, J. W. Anderson, A. Degangi, and A. Agarwal.
[16] M. Casado, T. Koponen, D. Moon, and S. Shenker. Gigabit routing on a software-exposed tiled-microprocessor.
Rethinking packet forwarding hardware. In HotNets-VII, In ANCS, pages 51–60, Oct. 2005.
Oct. 2008. [41] K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh,
[17] F. Chung, R. Graham, and G. Varghese. Parallelism versus S. W. Keckler, D. Burger, and C. R. Moore. Exploiting ILP,
memory allocation in pipelined router forwarding engines. In TLP and DLP with the Polymorphous TRIPS Architecture.
SPAA, pages 103–111, June 2004. In ISCA ’03, pages 422–433, June 2003.
[18] Cisco Public Information. The cisco quantumflow processor: [42] D. Shah and P. Gupta. Fast updating algorithms for TCAMs.
Cisco’s next generation network processor. IEEE Micro, 21(1):36–47, Jan. 2001.
http://www.cisco.com/en/US/prod/collateral/routers/ [43] S. Singh, F. Baboescu, G. Varghese, and J. Wang. Packet
ps9343/solution_overview_c22-448936.html, 2008. classification using multidimensional cutting. In SIGCOMM,
[19] D. Clark, J. Wroclawski, K. Sollins, and R. Braden. Tussle in 2003.
cyberspace: Defining tomorrow’s internet. In SIGCOMM, [44] I. Stoica, D. Adkins, S. Zhuang, S. Shenker, and S. Surana.
August 2002. Internet indirection infrastructure. In SIGCOMM, Aug. 2002.
[20] M. Degermark, A. Brodnik, S. Carlsson, and S. Pink. Small [45] S. Swanson, K. Michelson, A. Schwerin, and M. Oskin.
forwarding tables for fast routing lookups. In SIGCOMM, Wavescalar. In MICRO ’03, pages 291–302, December 2003.
pages 3–14, Oct. 1997. [46] M. B. Taylor, J. Kim, J. Miller, D. W. laff, F. Ghodrat,
[21] J. Dennis. A preliminary architecture for a basic data-flow B. Greenwald, H. Hoffman, P. Johnson, W. L. Jae-Wook Lee,
processor. In ISCA ’75, pages 126–132, January 1975. A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen,
[22] W. Eatherton. The push of network processing to the top of M. Frank, S. Amarasinghe, and A. Agarwal. The RAW
the pyramid. Keynote Address at ANCS, Oct. 2005. Microprocessor: A Computational Fabric for Software
[23] P. Francis and R. Gummadi. IPNL: A NAT-extended internet Circuits and General-Purpose Programs. IEEE Micro,
architecture. In SIGCOMM, pages 69–80, Aug. 2001. 22(2):25–35, March 2002.
[24] M. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, [47] Xtensa lx2: The fastest processor core ever,
C. Leger, A. A. Lamb, J. Wong, H. Hoffman, D. Z. Maze, http://www.tensilica.com/products/xtensa_lx.htm.
and S. Amarasinghe. A stream compiler for [48] S. Thoziyoor, N. Muralimanohar, and N. Jouppi. Cacti 5.0.
communication-exposed architectures. In ASPLOS 2002, San Technical Report HPL-2007-167, HP Research Labs, 2007.
Jose, CA USA, Oct. 2002. [49] B. Vöcking. How asymmetry helps load balancing. In
[25] C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Luz. IEEE-FOCS, pages 131–140, Oct. 1999.
DCell: A scalable and fault-tolerant network structure for [50] H.-S. Wang, X. Zhu, L.-S. Peh, and S. Malik. Orion: a
data centers. In SIGCOMM, pages 75–86, Aug. 2008. power-performance simulator for interconnection networks.
[26] P. Gupta and N. McKeown. Packet classification using In MICRO 35, pages 294–305, 2002.
hierarchical intelligent cuttings. In Hot Interconnects VII, [51] A. Yaar, A. Perrig, and D. Song. SIFF: A stateless internet
Aug. 1999. flow filter to mitigate DDoS flooding attacks. In Proceedings
[27] W. Jiang, Q. Wang, and V. K. Prasanna. Beyond TCAMs: An of the IEEE Symposium on Security and Privacy, May 2004.
SRAM-based parallel multi-pipeline architecture for terabit [52] X. Yang, D. Clark, and A. W. Berger. NIRA: a new
IP lookup. In INFOCOM, Apr. 2008. inter-domain routing architecture. IEEE/ACM Transactions
[28] D. A. Joseph, A. Tavakoli, and I. Stoica. A policy-aware on Networking, 15(4):775–788, Aug. 2007.
switching layer for data centers. In SIGCOMM, pages 51–62, [53] X. Yang, D. Wetherall, and T. Anderson. A DoS-limiting
Aug. 2008. network architecture. In SIGCOMM, Aug. 2005.
[29] C. Kim, M. Caesar, and J. Rexford. Floodless in seattle: A [54] F. Zane, G. Narlikar, and A. Basu. CoolCAMs:
scalable ethernet architecture for large enterprises. In Power-efficient TCAMs for forwarding engines. In
SIGCOMM, pages 3–14, Aug. 2008. INFOCOM, Apr. 2003.
[30] E. Kohler, R. Morris, B. Chen, J. Jannotti, and F. Kaashoek.
The Click modular router. ACM Trans. Comput. Syst.,
18(3):263–297, Aug. 2000.

218

You might also like