Routerbricks: Scaling Software Routers With Modern Servers

RouterBricks
Scaling Software Routers with Modern Servers

Kevin Fall Intel Labs, Berkeley Feb 24, 2010
Ericsson, San Jose, CA
Project Participants
Intel Labs
Gianluca Iannaccone (co-PI, researcher) Sylvia Ratnasamy (co-PI, researcher) Kevin Fall (principal engineer) Allan Knies (principal engineer) Maziar Manesh (research engineer) Eddie Kohler (Click expert) Dan Dahle (tech strategy) Badarinath Kommandur (tech strategy)
Ecole Polytecnique (EPFL), Switzerland

Katerina Argyraki (faculty) Mihai Dobrescu (student) Diaqing Chu (student)
2
Outline
Introduction Approach: cluster-based router RouteBricks implementation Performance results Next steps
RouterBricks: in a nutshell
A high-speed router using IA server components
fully programmable: control and data plane extensible: evolve networks via software upgrade incrementally scalable: flat cost per bit
Motivation
Network infrastructure is doing more than ever before
Packet-pushing (routing) no longer the whole story
security, data loss protection, application optimization, etc.
has led to a proliferation of special appliances and notions that perhaps routers could do more
Cisco, Juniper supporting open APIs Openflow consortium: Stanford, HP, Broadcom, Cisco
But these platforms werent born programmable

5
Motivation
If flexibility ultimately implies programmability...
Hard to beat IA platforms and their ecosystem Or price
However, must deal with persistent folklore:
IA cant do high-speed packet processing

But todays IA isnt the IA you know from your youth
multicore, multiple integrated mem-controllers, PCIe, multi-Q NICs,
Motivation
Combine a desire for more programmability... with new router friendly server trends
a new opportunity for IA servers?
Router Bricks: How might we

build a big (~1Tbps) IA-based software router?
7
Challenge
traditional software routers
research prototypes (2007): 1 - 2 Gbps Vyatta* datasheet (2009): 2 - 4 Gbps
current carrier-grade routers line speeds: 10/40Gbps aggregate switching speeds:40Gbps to 92Tbps!
8
* Other names and brands may be claimed as properties of others
Strategy
1. A cluster-based router architecture
each server need only scale to line speeds (10-40Gbps), rather than aggregate speeds (40Gbps 92Tbps)
2. Understand whether modern server architectures can scale to line speeds (10-40Gbps)
if not, why?
3. Leverage open-source control plane implementations

xorp, quagga, etc. [but we focus on data plane here]
Broader Benefits
1. infrastructure that is well-known and cheaper to evolve
familiar programming environment separately-evolvable network software and hardware reduced cost -> more frequent upgrade opportunity
2. networks with the benefits of the PC ecosystem

high-volume manufacturing widespread supply/support state-of-the-art process technologies (ride Moores Law) evolving PC platform features (power mgmt, crypto, etc.)
10
Outline
Introduction Approach: cluster-based router RouteBricks implementation Performance results Next steps
11
Traditional router architecture
N ports, per-port speed R bps
R bps
[R each direction]
ports
12
#1
N
12
Traditional router architecture

<< R bps
control processor (runs IOS/quagga/xorp, etc)
switch scheduler
runs at NR
switch fabric
linecard
runs at R bps
addr tables, FIB, ACLs
IP address lookup, q-mgmt, etc.
queue mgmt, shaping, etc.
a d d r t a b l e s ,
shaping,
queue mgmnt,
etc.
shaping,
queue mgmnt,
a d d r t a b l e s , F I B , A C L s
etc.
shaping,
queue mgmnt,
etc.
F I B Q mgmnt, etc. , A C L s
IP address lookup,
IP address lookup,
IP address lookup,
Q mgmnt, etc.
R bps ports #1 2 3 N
13
Moving to a cluster-router
switch scheduler
switch fabric
linecard
IP address lookup, q-mgmt, etc.
queue mgmt, shaping, etc.
shaping,
queue mgmnt,
etc.
shaping,
queue mgmnt,
a d d r t a b l e s , F I B , A C L s
etc.
shaping,
queue mgmnt,
etc.
IP address lookup,
IP address lookup,
IP address lookup,
Q mgmnt, etc.
R bps ports
14
3 2 step#1 single server implements one port; 1: N ports N servers
14
switch scheduler
Each server must process at least 2R traffic (in+out)

switch fabric
queue mgmnt, shaping, etc.
IP address lookup, Q mgmnt, etc
linecard implemented in software
R bps #1 2 N
15
step 1: single server implements one port; N ports N servers
15
switch scheduler
switch fabric
R bps ports #1 2 N
16
step 2: replace switch fabric and scheduler with a distributed, software-based solution
16
distributed scheduling algorithms, based on Valiant Load Balancing (VLB)
server-to-server interconnect topology
R bps ports #1 2 N
step 2: replace switch fabric and scheduler with a distributed, software-based solution
17
Example: VLB over a mesh*

* other topologies offer different tradeoffs
[each direction]
R bps
N ports, R bps port rate

# servers N N-1 2R N-1 3R (2R)*
2
N
internal fanout internal link capacity (RN/[N(N-1)/2]) processing/server
[out+in+through]
N servers can achieve switching speeds of N R bps, provided each server can process packets at 3R (*2R for Direct-VLB avg case) 18
18
Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview
Performance results Next steps
19
RB4: hardware architecture

10Gbps
4 dual-socket NHM-EPs
8x 2.8GHz cores (no SMT) 8MB L3 cache 6x1 GB DDR3 2 PCIe 2.0 slots (8 lanes) default BIOS setting
2x 10Gbps Oplin cards per server

dual port PCIe 1.1 (now using Niantic /PCIe 2.0)
20
RB4: software architecture

10Gbps
Place for value-added services (e.g., monitoring, energy proxy, management, etc.)
hooks for new srvcs
user space
packet processing (linecard)
Linux 2.6.24 Kernel
RB VLB
Click runtime RB device driver NIC NIC NIC NIC
RB data plane
implemented in Click unmodified

21
Click Overview
Modular, extensible software router
built on Linux as kernel module combines versatility and high performance
Architecture consists of
elements that implement packet processing functions configuration language that connects elements into a packet data flow internal scheduler that decides which element to run
Large open source library (200+ elements) means new routing applications can often be written with just a configuration script
slide material courtesy E.Kohler, UCLA
22

Value-added services (e.g., monitoring, energy proxy, management, etc.)
hooks for new srvcs
user space
Linux 2.6.24 Kernel
RB VLB
Click runtime RB device driver
NIC
NIC
NIC
NIC
Intel 10G driver polling-only operation (no interrupts) transfers packets to memory in batches of k (we use k=16) RSS w/ upto 32/64 rx/tx NIC queues

23
Outline
Performance results
cluster scalability single server scalability
Next steps
24
Cluster Scalability
1 R bps
N ports, R bps per port

# servers N
2 N
Internal fanout
internal link capacity
N-1
2R N-1 3R (2R)
processing/server
recall: VLB over a mesh
25
Cluster Scalability
y=x
10000
cost in #servers
1000
100
10
1 1 10 100 1000 10000
number of ports
26 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
Cluster Scalability
one server scales to 20Gbps; typical fanout
10000
cost in #servers
1000
100
10
1 1 10 100 1000 10000
number of ports
Cluster Scalability
10000
cost in #servers
1000
100
20Gbps; higher fanout
10
1 1 10 100 1000 10000
number of ports
Cluster Scalability
10000
cost in #servers
1000
100
20Gbps; higher fanout

server scales to 40Gbps + higher fanout
1 10 100 1000 10000
10
number of ports
Cluster Scalability
10000
cost in #servers
Conclusions so far (1) VLB-based server cluster scales well, is cost-effective 20Gbps; higher fanout 100 (2) feasible if a single server can scale to at least 20Gbps (2R)
1000 10
server scales to 40Gbps + higher fanout

1 10 100 1000 10000
number of ports
Outline
Performance results
Next steps
31

Value-added services (e.g., monitoring, energy proxy, management, etc.)
hooks for new srvcs
user space
Linux 2.6.24 Kernel
RB VLB++
Click runtime RB device driver
Tested 3 packet processing functions (so far) 1. simple forwarding (fwd) 2. IPv4 forwarding (rtr) 3. AES-128 encryption (ipsec)
NIC
NIC
NIC
NIC

32
Test Configuration
test server
packet processing functions

simple forwarding (no header processing; ~ bridging) IPv4 routing (longest-prefix destination lookup, 256K entry routing table) AES-128 packet encryption
packet processing
Click runtime RB device driver NIC NIC
test traffic
fixed-size packets (64B-1024B) abilene: real-world packet trace from Abilene/Internet2 backbone
traffic generation server traffic sink
33
Performance versus packet size

Performance for simple forwarding under different input traffic workloads results in bits-per-second (top) and packets-per-second (bottom)
In all our tests, the real-world Abilene and 1024B packet workloads achieve similar performance; hence, from hereon, we only consider two extreme traffic workloads: 64B and 1024B pkts.
34
Performance with different packet processing functions (64, 1KB pkts)
Simple forwarding and IPv4 forwarding for (realistic) traffic workloads with larger packets achieve ~25Gbps; limited by traffic generation due to the #PCIe slots Encryption is CPU limited
Simple Forwarding
IPv4 Forwarding
Encrypted Forwarding
35
Memory Loading
64B workload, NHM nom and benchmark represent upper bounds on available memory bandwidth normalized by packet rate to compare with actual apps. nom is based on nominal rated capacity; benchmark refers to empirically observed load using a stream-like read/write random access workload.
Packet rate (Mpps)
All applications are well below estimated upper bounds. Per-packet memory load is constant as a function of packet rate.
36
QuickPath (inter-socket) Loading

64B workload, NHM benchmark refers to the maximum load on the inter-socket QuickPath link with stream-like workload
All applications are well below estimated upper bound. Per-packet inter-socket load is constant versus packet rate.
37
QuickPath (I/O) Loading

64B workload, NHM benchmark refers to the maximum load on the I/O Quickpath link we have been able to generate with a NIC.
All applications are well below estimated upper bound. Per-packet I/O load is constant versus packet rate.
38
Per-packet load on CPU

64B workload, NHM
application instr/pkt (CPI) 1,033 (1.19) 1,595 (1.01) 14,221 (0.55)
CPU Saturation
simple forwarding ipv4 forwarding encryption
All applications reach CPU cycles upper bound. CPU load is (fairly) constant as a function of packet rate.
39
Single server scalability

Key results
(1) NHM server performance is sufficient to enable VLB clustering, for realistic input traffic (2) falls short for worst-case traffic (3) CPUs are the bottleneck for 64B packet workloads (4) scaling: constant per-packet load with increasing packet rate
40
Outline
Performance results
Next steps
41
Next Steps
RB prototype
control plane additional packet processing functions new hardware when available management interface reliability / robustness improvements
power packaging
42
Thanks
Also: see paper in SOSP 2009
http://routebricks.org
43
Backups
44
Click on multicore
Each core (or HW thread) runs one instance of Click
instance is statically scheduled and pinned to the core best performance when one core handles the entire data flow of a packet
Click runs internal scheduler to decide which element to run
45
More generally
Different topologies offer tradeoffs between:
per server packet processing capability (2R, 4R, 6R, ) per server fanout (#slots/server, ports/slot) number of servers required
For example
46
4-port VLB mesh, 10Gbps ports 8

10Gbps 2.5Gbps 10Gbps
5Gbps
Each server has internal fanout = 3 Each server runs at avg. 20Gbps
Each server has internal fanout = 7 Each server runs at avg. 20Gbps 47
8-port VLB mesh, server@ 20Gbps

10Gbps 2.5Gbps 5Gbps
40Gbps
10Gbps
10Gbps
48
8-port VLB mesh, server@ 20Gbps 1000

And each server has max 2.5Gbps internal fanout = 32 (1Gbps ports) 10Gbps
49
8-port VLB mesh, server@ 20Gbps 40Gbps 1000

10Gbps
1000 servers, each w/ 10Gbps external port

Plus (lg32(1000)-1)*1000 servers interconnected by a 32-ary-1000-fly topology (total 2000 servers)
1000
Each server has fanout=32

Each internal link runs at 0.625Gbps (=2*10/32)
50
More generally
Different topologies offer tradeoffs between:
per server forwarding capability per server fanout (#slots/server, ports/slot) number of servers required
input (for us)
dominates router cost
51

Routerbricks: Scaling Software Routers With Modern Servers

Uploaded by

Copyright:

Available Formats

Routerbricks: Scaling Software Routers With Modern Servers

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Routerbricks: Scaling Software Routers With Modern Servers

Uploaded by

Copyright:

Available Formats

RouterBricks

Scaling Software Routers with Modern Servers

Ecole Polytecnique (EPFL), Switzerland

But these platforms werent born programmable

However, must deal with persistent folklore:

IA cant do high-speed packet processing

Router Bricks: How might we

3. Leverage open-source control plane implementations

2. networks with the benefits of the PC ecosystem

Traditional router architecture

N ports, per-port speed R bps

Traditional router architecture

addr tables, FIB, ACLs

IP address lookup, q-mgmt, etc.

queue mgmt, shaping, etc.

IP address lookup, q-mgmt, etc.

queue mgmt, shaping, etc.

3 2 step#1 single server implements one port; 1: N ports N servers

Each server must process at least 2R traffic (in+out)

queue mgmnt, shaping, etc.

IP address lookup, Q mgmnt, etc

linecard implemented in software

step 1: single server implements one port; N ports N servers

distributed scheduling algorithms, based on Valiant Load Balancing (VLB)

server-to-server interconnect topology

Example: VLB over a mesh*

N ports, R bps port rate

Performance results Next steps

RB4: hardware architecture

2x 10Gbps Oplin cards per server

RB4: software architecture

Linux 2.6.24 Kernel

Click runtime RB device driver NIC NIC NIC NIC

implemented in Click unmodified

slide material courtesy E.Kohler, UCLA

RB4: software architecture

Linux 2.6.24 Kernel

Click runtime RB device driver

implemented in Click unmodified

N ports, R bps per port

recall: VLB over a mesh

1 1 10 100 1000 10000

1 1 10 100 1000 10000

20Gbps; higher fanout

1 1 10 100 1000 10000

20Gbps; higher fanout

server scales to 40Gbps + higher fanout

RB4: software architecture

Linux 2.6.24 Kernel

Click runtime RB device driver

implemented in Click unmodified

packet processing functions

Click runtime RB device driver NIC NIC

Performance versus packet size

Performance with different packet processing functions (64, 1KB pkts)

Packet rate (Mpps)

QuickPath (inter-socket) Loading

QuickPath (I/O) Loading

Per-packet load on CPU