Routerbricks: Scaling Software Routers With Modern Servers
Routerbricks: Scaling Software Routers With Modern Servers
Routerbricks: Scaling Software Routers With Modern Servers
Project Participants
Intel Labs
Gianluca Iannaccone (co-PI, researcher) Sylvia Ratnasamy (co-PI, researcher) Kevin Fall (principal engineer) Allan Knies (principal engineer) Maziar Manesh (research engineer) Eddie Kohler (Click expert) Dan Dahle (tech strategy) Badarinath Kommandur (tech strategy)
Outline
Introduction Approach: cluster-based router RouteBricks implementation Performance results Next steps
RouterBricks: in a nutshell
A high-speed router using IA server components
fully programmable: control and data plane extensible: evolve networks via software upgrade incrementally scalable: flat cost per bit
Motivation
Network infrastructure is doing more than ever before
Packet-pushing (routing) no longer the whole story
security, data loss protection, application optimization, etc.
has led to a proliferation of special appliances and notions that perhaps routers could do more
Cisco, Juniper supporting open APIs Openflow consortium: Stanford, HP, Broadcom, Cisco
Motivation
If flexibility ultimately implies programmability...
Hard to beat IA platforms and their ecosystem Or price
Motivation
Combine a desire for more programmability... with new router friendly server trends
a new opportunity for IA servers?
Challenge
traditional software routers
research prototypes (2007): 1 - 2 Gbps Vyatta* datasheet (2009): 2 - 4 Gbps
current carrier-grade routers line speeds: 10/40Gbps aggregate switching speeds:40Gbps to 92Tbps!
8
* Other names and brands may be claimed as properties of others
Strategy
1. A cluster-based router architecture
each server need only scale to line speeds (10-40Gbps), rather than aggregate speeds (40Gbps 92Tbps)
2. Understand whether modern server architectures can scale to line speeds (10-40Gbps)
if not, why?
Broader Benefits
1. infrastructure that is well-known and cheaper to evolve
familiar programming environment separately-evolvable network software and hardware reduced cost -> more frequent upgrade opportunity
Outline
Introduction Approach: cluster-based router RouteBricks implementation Performance results Next steps
11
R bps
[R each direction]
ports
12
#1
N
12
runs at NR
switch fabric
linecard
runs at R bps
a d d r t a b l e s ,
shaping,
queue mgmnt,
a d d r t a b l e s ,
etc.
shaping,
queue mgmnt,
a d d r t a b l e s , F I B , A C L s
etc.
shaping,
queue mgmnt,
etc.
F I B Q mgmnt, etc. , A C L s
IP address lookup,
F I B Q mgmnt, etc. , A C L s
IP address lookup,
IP address lookup,
Q mgmnt, etc.
R bps ports #1 2 3 N
13
Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
switch scheduler
switch fabric
linecard
addr tables, FIB, ACLs
a d d r t a b l e s ,
shaping,
queue mgmnt,
a d d r t a b l e s ,
etc.
shaping,
queue mgmnt,
a d d r t a b l e s , F I B , A C L s
etc.
shaping,
queue mgmnt,
etc.
F I B Q mgmnt, etc. , A C L s
IP address lookup,
F I B Q mgmnt, etc. , A C L s
IP address lookup,
IP address lookup,
Q mgmnt, etc.
R bps ports
14
14
Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
switch scheduler
switch fabric
R bps #1 2 N
15
15
Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
switch scheduler
switch fabric
R bps ports #1 2 N
16
step 2: replace switch fabric and scheduler with a distributed, software-based solution
16
Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
R bps ports #1 2 N
step 2: replace switch fabric and scheduler with a distributed, software-based solution
17
R bps
2
N
internal fanout internal link capacity (RN/[N(N-1)/2]) processing/server
[out+in+through]
N servers can achieve switching speeds of N R bps, provided each server can process packets at 3R (*2R for Direct-VLB avg case) 18
18
Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview
19
4 dual-socket NHM-EPs
8x 2.8GHz cores (no SMT) 8MB L3 cache 6x1 GB DDR3 2 PCIe 2.0 slots (8 lanes) default BIOS setting
20
user space
packet processing (linecard)
RB VLB
RB data plane
Click Overview
Modular, extensible software router
built on Linux as kernel module combines versatility and high performance
Architecture consists of
elements that implement packet processing functions configuration language that connects elements into a packet data flow internal scheduler that decides which element to run
Large open source library (200+ elements) means new routing applications can often be written with just a configuration script
22
user space
packet processing (linecard)
RB VLB
NIC
NIC
NIC
NIC
Intel 10G driver polling-only operation (no interrupts) transfers packets to memory in batches of k (we use k=16) RSS w/ upto 32/64 rx/tx NIC queues
Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview
Performance results
cluster scalability single server scalability
Next steps
24
Cluster Scalability
1 R bps
2 N
Internal fanout
internal link capacity
N-1
2R N-1 3R (2R)
processing/server
25
Cluster Scalability
y=x
10000
cost in #servers
1000
100
10
number of ports
26 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
Cluster Scalability
one server scales to 20Gbps; typical fanout
10000
cost in #servers
1000
100
10
number of ports
27 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
Cluster Scalability
one server scales to 20Gbps; typical fanout
10000
cost in #servers
1000
100
10
number of ports
28 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
Cluster Scalability
one server scales to 20Gbps; typical fanout
10000
cost in #servers
1000
100
10
number of ports
29 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
Cluster Scalability
one server scales to 20Gbps; typical fanout
10000
cost in #servers
Conclusions so far (1) VLB-based server cluster scales well, is cost-effective 20Gbps; higher fanout 100 (2) feasible if a single server can scale to at least 20Gbps (2R)
1000 10
number of ports
30 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )
Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview
Performance results
cluster scalability single server scalability
Next steps
31
user space
packet processing (linecard)
RB VLB++
Tested 3 packet processing functions (so far) 1. simple forwarding (fwd) 2. IPv4 forwarding (rtr) 3. AES-128 encryption (ipsec)
NIC
NIC
NIC
NIC
Test Configuration
test server
packet processing
test traffic
fixed-size packets (64B-1024B) abilene: real-world packet trace from Abilene/Internet2 backbone
traffic generation server traffic sink
33
In all our tests, the real-world Abilene and 1024B packet workloads achieve similar performance; hence, from hereon, we only consider two extreme traffic workloads: 64B and 1024B pkts.
34
Simple forwarding and IPv4 forwarding for (realistic) traffic workloads with larger packets achieve ~25Gbps; limited by traffic generation due to the #PCIe slots Encryption is CPU limited
Simple Forwarding
IPv4 Forwarding
Encrypted Forwarding
35
Memory Loading
64B workload, NHM nom and benchmark represent upper bounds on available memory bandwidth normalized by packet rate to compare with actual apps. nom is based on nominal rated capacity; benchmark refers to empirically observed load using a stream-like read/write random access workload.
All applications are well below estimated upper bounds. Per-packet memory load is constant as a function of packet rate.
36
All applications are well below estimated upper bound. Per-packet inter-socket load is constant versus packet rate.
37
All applications are well below estimated upper bound. Per-packet I/O load is constant versus packet rate.
38
CPU Saturation
All applications reach CPU cycles upper bound. CPU load is (fairly) constant as a function of packet rate.
39
40
Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview
Performance results
cluster scalability single server scalability
Next steps
41
Next Steps
RB prototype
control plane additional packet processing functions new hardware when available management interface reliability / robustness improvements
power packaging
42
Thanks
Also: see paper in SOSP 2009
http://routebricks.org
43
Backups
44
Click on multicore
Each core (or HW thread) runs one instance of Click
instance is statically scheduled and pinned to the core best performance when one core handles the entire data flow of a packet
45
More generally
Different topologies offer tradeoffs between:
per server packet processing capability (2R, 4R, 6R, ) per server fanout (#slots/server, ports/slot) number of servers required
For example
46
5Gbps
Each server has internal fanout = 3 Each server runs at avg. 20Gbps
Each server has internal fanout = 7 Each server runs at avg. 20Gbps 47
40Gbps
10Gbps
10Gbps
Each server has internal fanout = 7 Each server runs at avg. 20Gbps
Each server has internal fanout = 3 Each server runs at avg. 40Gbps
48
Each server has internal fanout = 7 Each server runs at avg. 20Gbps
49
More generally
Different topologies offer tradeoffs between:
per server forwarding capability per server fanout (#slots/server, ports/slot) number of servers required
51