Routerbricks: Scaling Software Routers With Modern Servers

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 51

RouterBricks

Scaling Software Routers with Modern Servers


Kevin Fall Intel Labs, Berkeley Feb 24, 2010
Ericsson, San Jose, CA

Project Participants
Intel Labs
Gianluca Iannaccone (co-PI, researcher) Sylvia Ratnasamy (co-PI, researcher) Kevin Fall (principal engineer) Allan Knies (principal engineer) Maziar Manesh (research engineer) Eddie Kohler (Click expert) Dan Dahle (tech strategy) Badarinath Kommandur (tech strategy)

Ecole Polytecnique (EPFL), Switzerland


Katerina Argyraki (faculty) Mihai Dobrescu (student) Diaqing Chu (student)
2

Outline
Introduction Approach: cluster-based router RouteBricks implementation Performance results Next steps

RouterBricks: in a nutshell
A high-speed router using IA server components
fully programmable: control and data plane extensible: evolve networks via software upgrade incrementally scalable: flat cost per bit

Motivation
Network infrastructure is doing more than ever before
Packet-pushing (routing) no longer the whole story
security, data loss protection, application optimization, etc.

has led to a proliferation of special appliances and notions that perhaps routers could do more
Cisco, Juniper supporting open APIs Openflow consortium: Stanford, HP, Broadcom, Cisco

But these platforms werent born programmable


5

Motivation
If flexibility ultimately implies programmability...
Hard to beat IA platforms and their ecosystem Or price

However, must deal with persistent folklore:

IA cant do high-speed packet processing


But todays IA isnt the IA you know from your youth
multicore, multiple integrated mem-controllers, PCIe, multi-Q NICs,

Motivation
Combine a desire for more programmability... with new router friendly server trends
a new opportunity for IA servers?

Router Bricks: How might we


build a big (~1Tbps) IA-based software router?
7

Challenge
traditional software routers
research prototypes (2007): 1 - 2 Gbps Vyatta* datasheet (2009): 2 - 4 Gbps

current carrier-grade routers line speeds: 10/40Gbps aggregate switching speeds:40Gbps to 92Tbps!

8
* Other names and brands may be claimed as properties of others

Strategy
1. A cluster-based router architecture
each server need only scale to line speeds (10-40Gbps), rather than aggregate speeds (40Gbps 92Tbps)

2. Understand whether modern server architectures can scale to line speeds (10-40Gbps)
if not, why?

3. Leverage open-source control plane implementations


xorp, quagga, etc. [but we focus on data plane here]

Broader Benefits
1. infrastructure that is well-known and cheaper to evolve
familiar programming environment separately-evolvable network software and hardware reduced cost -> more frequent upgrade opportunity

2. networks with the benefits of the PC ecosystem


high-volume manufacturing widespread supply/support state-of-the-art process technologies (ride Moores Law) evolving PC platform features (power mgmt, crypto, etc.)
10

Outline
Introduction Approach: cluster-based router RouteBricks implementation Performance results Next steps

11

Traditional router architecture

N ports, per-port speed R bps

R bps
[R each direction]

ports
12

#1

N
12

Traditional router architecture


<< R bps
control processor (runs IOS/quagga/xorp, etc)
switch scheduler

runs at NR

switch fabric

linecard

runs at R bps

addr tables, FIB, ACLs

IP address lookup, q-mgmt, etc.

queue mgmt, shaping, etc.

a d d r t a b l e s ,

shaping,

queue mgmnt,

a d d r t a b l e s ,

etc.

shaping,

queue mgmnt,

a d d r t a b l e s , F I B , A C L s

etc.

shaping,

queue mgmnt,

etc.

F I B Q mgmnt, etc. , A C L s

IP address lookup,

F I B Q mgmnt, etc. , A C L s

IP address lookup,

IP address lookup,

Q mgmnt, etc.

R bps ports #1 2 3 N
13

Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
switch scheduler

switch fabric

linecard
addr tables, FIB, ACLs

IP address lookup, q-mgmt, etc.

queue mgmt, shaping, etc.

a d d r t a b l e s ,

shaping,

queue mgmnt,

a d d r t a b l e s ,

etc.

shaping,

queue mgmnt,

a d d r t a b l e s , F I B , A C L s

etc.

shaping,

queue mgmnt,

etc.

F I B Q mgmnt, etc. , A C L s

IP address lookup,

F I B Q mgmnt, etc. , A C L s

IP address lookup,

IP address lookup,

Q mgmnt, etc.

R bps ports
14

3 2 step#1 single server implements one port; 1: N ports N servers

14

Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
switch scheduler

Each server must process at least 2R traffic (in+out)


addr tables, FIB, ACLs

switch fabric

queue mgmnt, shaping, etc.

IP address lookup, Q mgmnt, etc

linecard implemented in software

R bps #1 2 N

15

step 1: single server implements one port; N ports N servers

15

Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)
switch scheduler

switch fabric

R bps ports #1 2 N

16

step 2: replace switch fabric and scheduler with a distributed, software-based solution

16

Moving to a cluster-router
control processor (runs IOS/quagga/xorp, etc)

distributed scheduling algorithms, based on Valiant Load Balancing (VLB)

server-to-server interconnect topology

R bps ports #1 2 N

step 2: replace switch fabric and scheduler with a distributed, software-based solution

17

Example: VLB over a mesh*


* other topologies offer different tradeoffs
[each direction]

R bps

N ports, R bps port rate


# servers N N-1 2R N-1 3R (2R)*

2
N
internal fanout internal link capacity (RN/[N(N-1)/2]) processing/server
[out+in+through]

N servers can achieve switching speeds of N R bps, provided each server can process packets at 3R (*2R for Direct-VLB avg case) 18
18

Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview

Performance results Next steps

19

RB4: hardware architecture


10Gbps

4 dual-socket NHM-EPs
8x 2.8GHz cores (no SMT) 8MB L3 cache 6x1 GB DDR3 2 PCIe 2.0 slots (8 lanes) default BIOS setting

2x 10Gbps Oplin cards per server


dual port PCIe 1.1 (now using Niantic /PCIe 2.0)

20

RB4: software architecture


10Gbps
Place for value-added services (e.g., monitoring, energy proxy, management, etc.)
hooks for new srvcs

user space
packet processing (linecard)

Linux 2.6.24 Kernel

RB VLB

Click runtime RB device driver NIC NIC NIC NIC

RB data plane

implemented in Click unmodified


21

Click Overview
Modular, extensible software router
built on Linux as kernel module combines versatility and high performance

Architecture consists of
elements that implement packet processing functions configuration language that connects elements into a packet data flow internal scheduler that decides which element to run

Large open source library (200+ elements) means new routing applications can often be written with just a configuration script

slide material courtesy E.Kohler, UCLA

22

RB4: software architecture


Value-added services (e.g., monitoring, energy proxy, management, etc.)
hooks for new srvcs

user space
packet processing (linecard)

Linux 2.6.24 Kernel

RB VLB

Click runtime RB device driver

NIC

NIC

NIC

NIC

Intel 10G driver polling-only operation (no interrupts) transfers packets to memory in batches of k (we use k=16) RSS w/ upto 32/64 rx/tx NIC queues

implemented in Click unmodified


23

Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview

Performance results
cluster scalability single server scalability

Next steps
24

Cluster Scalability
1 R bps

N ports, R bps per port


# servers N

2 N
Internal fanout
internal link capacity

N-1
2R N-1 3R (2R)

processing/server

recall: VLB over a mesh

25

Cluster Scalability
y=x
10000

cost in #servers

1000

100

10

1 1 10 100 1000 10000

number of ports

26 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

Cluster Scalability
one server scales to 20Gbps; typical fanout
10000

cost in #servers

1000

100

10

1 1 10 100 1000 10000

number of ports

27 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

Cluster Scalability
one server scales to 20Gbps; typical fanout
10000

cost in #servers

1000

100

20Gbps; higher fanout

10

1 1 10 100 1000 10000

number of ports

28 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

Cluster Scalability
one server scales to 20Gbps; typical fanout
10000

cost in #servers

1000

100

20Gbps; higher fanout


server scales to 40Gbps + higher fanout
1 10 100 1000 10000

10

number of ports

29 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

Cluster Scalability
one server scales to 20Gbps; typical fanout
10000

cost in #servers

Conclusions so far (1) VLB-based server cluster scales well, is cost-effective 20Gbps; higher fanout 100 (2) feasible if a single server can scale to at least 20Gbps (2R)
1000 10

server scales to 40Gbps + higher fanout


1 10 100 1000 10000

number of ports

30 10Gbps port; typical server fanout=5 PCIe slots (2x10G or 8x1G ports/slot )

Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview

Performance results
cluster scalability single server scalability

Next steps
31

RB4: software architecture


Value-added services (e.g., monitoring, energy proxy, management, etc.)
hooks for new srvcs

user space
packet processing (linecard)

Linux 2.6.24 Kernel

RB VLB++

Click runtime RB device driver

Tested 3 packet processing functions (so far) 1. simple forwarding (fwd) 2. IPv4 forwarding (rtr) 3. AES-128 encryption (ipsec)

NIC

NIC

NIC

NIC

implemented in Click unmodified


32

Test Configuration
test server

packet processing functions


simple forwarding (no header processing; ~ bridging) IPv4 routing (longest-prefix destination lookup, 256K entry routing table) AES-128 packet encryption

packet processing

Click runtime RB device driver NIC NIC

test traffic
fixed-size packets (64B-1024B) abilene: real-world packet trace from Abilene/Internet2 backbone
traffic generation server traffic sink

33

Performance versus packet size


Performance for simple forwarding under different input traffic workloads results in bits-per-second (top) and packets-per-second (bottom)

In all our tests, the real-world Abilene and 1024B packet workloads achieve similar performance; hence, from hereon, we only consider two extreme traffic workloads: 64B and 1024B pkts.

34

Performance with different packet processing functions (64, 1KB pkts)

Simple forwarding and IPv4 forwarding for (realistic) traffic workloads with larger packets achieve ~25Gbps; limited by traffic generation due to the #PCIe slots Encryption is CPU limited

Simple Forwarding

IPv4 Forwarding

Encrypted Forwarding

35

Memory Loading
64B workload, NHM nom and benchmark represent upper bounds on available memory bandwidth normalized by packet rate to compare with actual apps. nom is based on nominal rated capacity; benchmark refers to empirically observed load using a stream-like read/write random access workload.

Packet rate (Mpps)

All applications are well below estimated upper bounds. Per-packet memory load is constant as a function of packet rate.
36

QuickPath (inter-socket) Loading


64B workload, NHM benchmark refers to the maximum load on the inter-socket QuickPath link with stream-like workload

All applications are well below estimated upper bound. Per-packet inter-socket load is constant versus packet rate.
37

QuickPath (I/O) Loading


64B workload, NHM benchmark refers to the maximum load on the I/O Quickpath link we have been able to generate with a NIC.

All applications are well below estimated upper bound. Per-packet I/O load is constant versus packet rate.
38

Per-packet load on CPU


64B workload, NHM
application instr/pkt (CPI) 1,033 (1.19) 1,595 (1.01) 14,221 (0.55)

CPU Saturation

simple forwarding ipv4 forwarding encryption

All applications reach CPU cycles upper bound. CPU load is (fairly) constant as a function of packet rate.
39

Single server scalability


Key results
(1) NHM server performance is sufficient to enable VLB clustering, for realistic input traffic (2) falls short for worst-case traffic (3) CPUs are the bottleneck for 64B packet workloads (4) scaling: constant per-packet load with increasing packet rate

40

Outline
Introduction Approach: cluster-based router RouteBricks implementation
RB4 prototype Click overview

Performance results
cluster scalability single server scalability

Next steps
41

Next Steps
RB prototype
control plane additional packet processing functions new hardware when available management interface reliability / robustness improvements

power packaging

42

Thanks
Also: see paper in SOSP 2009

http://routebricks.org

43

Backups

44

Click on multicore
Each core (or HW thread) runs one instance of Click
instance is statically scheduled and pinned to the core best performance when one core handles the entire data flow of a packet

Click runs internal scheduler to decide which element to run

45

More generally
Different topologies offer tradeoffs between:
per server packet processing capability (2R, 4R, 6R, ) per server fanout (#slots/server, ports/slot) number of servers required

For example

46

4-port VLB mesh, 10Gbps ports 8


10Gbps 2.5Gbps 10Gbps

5Gbps

Each server has internal fanout = 3 Each server runs at avg. 20Gbps

Each server has internal fanout = 7 Each server runs at avg. 20Gbps 47

8-port VLB mesh, server@ 20Gbps


10Gbps 2.5Gbps 5Gbps

40Gbps

10Gbps

10Gbps

Each server has internal fanout = 7 Each server runs at avg. 20Gbps

Each server has internal fanout = 3 Each server runs at avg. 40Gbps

48

8-port VLB mesh, server@ 20Gbps 1000


And each server has max 2.5Gbps internal fanout = 32 (1Gbps ports) 10Gbps

Each server has internal fanout = 7 Each server runs at avg. 20Gbps

49

8-port VLB mesh, server@ 20Gbps 40Gbps 1000


10Gbps

1000 servers, each w/ 10Gbps external port


Plus (lg32(1000)-1)*1000 servers interconnected by a 32-ary-1000-fly topology (total 2000 servers)
1000

Each server has fanout=32


Each internal link runs at 0.625Gbps (=2*10/32)
50

More generally
Different topologies offer tradeoffs between:
per server forwarding capability per server fanout (#slots/server, ports/slot) number of servers required

input (for us)

dominates router cost

51

You might also like