High Performance Switches and Routers: Theory and Practice: Sigcomm 99 August 30, 1999 Harvard University
High Performance Switches and Routers: Theory and Practice: Sigcomm 99 August 30, 1999 Harvard University
TheorySigcomm
and99Practice
August 30, 1999
Harvard University
H ig h Perfo rmance
Swi tc hi ng and Rout in g
Te le com Ce nte rWor ksho p: Se pt 4,1 9 7.
[email protected] [email protected]
Tutorial Outline
• Introduction:
What is a Packet Switch?
• Packet Lookup and Classification:
Where does a packet go next?
• Switching Fabrics:
How does the packet get there?
• Output Scheduling:
When should the packet leave?
Congestion
Admission Control Control
Reservation
Control Routing
Output Datapath:
Policing Switching Scheduling per-packet
processing
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
CP Li
U In ne
t er
M fa DMA DMA DMA
em ce
or Line Line Line
y Interface Interface Interface
MAC MAC MAC
CPU Buffer
Memory
Switched Backplane
1 2 3 4 5 6 13 14 15 16 17 18 25 26 27 28 29 30
1 2 3 4 5 6 7 8 9 10 1112 13 14 15 16
17 1819 20 21 22 23 2425 26 27 28 29 30 31 32
7 8 9 10 11 12 19 20 21 22 23 24 31 32 21
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
(Port, VCI)
Address
VCI
Data
Memory
{
Address Data
Data Hit? • Slow
48
Address • High Power
log2N
• Small
• Expensive
Associated
Search Data
Data
{
Address
Data
Hashing 16 Hit?
Memory
48 Function Address
log2N
#1 #2 #3 #4
Associated
Search Data
Data
48
Hashing Function
CRC-16
16
#1 #2 { Hit?
Address
log2N
#1 #2 #3
Linked lists
1
E R = --- 1 + -------------------------------
-
2 1 – --- 1 M
1 – -
N
Where:
ER = Expected number of memory references
M = Number of memory addresses in table
N = Number of linked lists
= M N
Disadvantages
• Non-deterministic lookup time
• Inefficient use of memory
< > 0 1
010 111
N entries
000011110000 111111111111
CPU Buffer
Memory
Fast Path
IP Address Space
Class A Class B Class C D
Class A
Routing Table:
212.17.9.4 Class B Exact
Class C Match
212.17.9.0 Port 4
Classless: 128.9.0.0
142.12/19
65/8
128.9/16
0 232-1
216
128.9.16.14
Copyright 1999. All Rights Reserved 34
IP Routers
CIDR
128.9.19/24
128.9.25/24
128.9.16/20 128.9.176/20
128.9/16
0 232-1
128.9.16.14
Prefix Port
• Lookup time
65/8 3
128.9.16.14 128.9/16 5 • Storage space
128.9.16/20 2 • Update time
128.9.19/24 7
128.9.25/24 10 • Preprocessing time
128.9.176/20 1
142.12/19 3
Incoming
Packet ---- ----
• IPv6
• 128bit destination address field
• Exact address architecture not yet known
Source: http://www.telstra.net/ops/bgptable.html
Copyright 1999. All Rights Reserved 40
Ternary CAMs
Associative Memory
Value Mask
10.0.0.0 255.0.0.0 R1
10.1.0.0 255.255.0.0 R2 Next Hop
10.1.1.0 255.255.255.0 R3
10.1.3.0 255.255.255.0 R4
10.1.3.1 255.255.255.255 R4
Priority Encoder
Level 8
Level 29
33K entries: 1.4MB data structure with 1.2-2.2 Mpps [O(log W)]
1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1
0 13
0 1
Copyright 1999. All Rights Reserved 49
Compacting Forwarding Tables
Disadvantages Advantages
• Scalability to larger • Extremely small data
tables? structure - can fit in
• Updates are complex. cache.
000011110000 111111111111
L8
L16
L24
Prefix length
Most prefixes are 24-bits or shorter
Copyright 1999. All Rights Reserved 53
Routing Lookups in Hardware
Prefixes up to 24-bits
224 = 16M entries
142.19.6
Next Hop
1 Next Hop
142.19.6
24
142.19.6.14
14
128.3.72
1 Next Hop
128.3.72
24 Next Hop
0 Pointer
128.3.72.44
Prefixes above
base 24-bits
Next
Next Hop
Hop
offset
44
i j Prefixes
0 longer than
N+M bits
N
Next Hop
N+M
Incoming
Packet ---- ----
… … … … … …
P1
P2
Field #2
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, *) R1
R5 R4
R2
Pros Cons
Sequential Small storage, scales well with Slow classification rates
Evaluation number of fields
Ternary CAMs Single cycle classification Cost, density, power
consumption
Grid of Tries Small storage requirements and Not easily extendible to
(Srinivasan et fast lookup rates for two fields. more than two fields.
al[Sigcomm Suitable for big classifiers
98])
R4
Dimension 2
R1 R7
R3 R5 R6
R2
Copyright 1999. All Rights Reserved 67
Grid of Tries
Disadvantages Advantages
• Static solution • Good solution for two
• Not easy to extend to dimensions
higher dimensions
20K entries: 2MB data structure with 9 memory accesses [at most 2W]
512 rules: 1Mpps with single FPGA and 5 128KB SRAM chips.
F2 Action
F3
Fn
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision
N 1
2 2
1
1
2
1
5ns SRAM
Shared
Memory
Delay
Load
58.6% 100%
Delay
Load
100%
Scheduler
Can be quite
complex!
?
Input m
Q(m,1)
Output n
Am (t) Dn(t)
Q(m,n)
1 1
2 2
1 1
2 2
Copyright 1999. All Rights Reserved 92
Input Queueing
Longest Queue First or
Oldest Cell First
={ }
Queue Length
Weight 100%
Waiting Time
1 1 1 1 1
10
2 1 2 Maximum weight 2 2
3 1 3 3 3
10
4 1 4 4 4
Copyright 1999. All Rights Reserved 93
Input Queueing
Why is serving long/old queues better than
serving maximum number of queues?
• When traffic is uniformly distributed, servicing the
maximum number of queues leads to 100% throughput.
• When traffic is non-uniform, some queues become
longer than others.
• A good algorithm keeps the queue lengths matched, and
services a large number of queues.
Uniform traffic Non-uniform traffic
Avg Occupancy
Avg Occupancy
VOQ #
Copyright 1999. All Rights Reserved VOQ # 94
Input Queueing
Practical Algorithms
• Maximal Size Algorithms
– Wave Front Arbiter (WFA)
– Parallel Iterative Matching (PIM)
– iSLIP
• Maximal Weight Algorithms
– Fair Access Round Robin (FARR)
– Longest Port First (LPF)
Requests Match
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Copyright 1999. All Rights Reserved 96
Wave Front Arbiter
Requests Match
Combinational
1,1 1,2 1,3 1,4 Logic Blocks
2,1 2,2 2,3 2,4
Requests Match
1 1 1 1 1 1
#1 2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Requests Grant Accept/Match
1 1 1 1 1 1
2 2 2 2 2 2
#2
3 3 3 3 3 3
4 4 4 4 4 4
Copyright 1999. All Rights Reserved 101
Parallel Iterative Matching
Maximal is not Maximum
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
Requests Accept/Match
1 1
2 2
3 3
4 4
Copyright 1999. All Rights Reserved 102
Parallel Iterative Matching
Analytical Results
1 1 1 1 1 1
#1 2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4
Requests Grant Accept/Match
1 1 1 1 1 1
2 2 2 2 2 2
#2
3 3 3 3 3 3
4 4 4 4 4 4
Copyright 1999. All Rights Reserved 108
iSLIP
Properties
• Random under low load
• TDM under high load
• Lowest priority to MRU
• 1 iteration: fair to outputs
• Converges in at most N iterations. On average <=
log2N
• Implementation: N priority encoders
• Up to 100% throughput for uniform traffic
Copyright 1999. All Rights Reserved 109
iSLIP
N 1 1 log2N
Grant Accept
N 2 2 log2N
Grant Accept
State Decision
N N N log2N
Grant Accept
000 000
001 001
010 010
011 011
100 100
101 101
110 110
111 111
• Context
– input-queued switches
– output-queued switches
– the speedup problem
• Early approaches
• Algorithms
• Implementation considerations
A generic switch
Main problem
- Requires high fabric speedup (S = N)
Big advantage
- Speedup of one is sufficient
Main problem
- Can’t guarantee delay due to input contention
Numerical Methods
- use actual and simulated traffic traces
- run different algorithms
- set the “speedup dial” at various values
1
2
1
2
Robustness
- realistic, even adversarial, traffic
not friendly Bernoulli IID
?
Speedup << N
The algorithm
Men = Outputs
Women = Inputs
Matching process
- A variant of the stable marriage problem
- Worst-case number of iterations for SMP = N2
- Worst-case number of iterations in switching =
N
- High probability and average approxly log(N)
- A.L Gupta, N.D. Georgana, “Analysis of a packet switch with input and
and output buffers and speed constraints”, Infocom 91.
- S-T. Chuang et al, “Matching output queueing with a combined input and
and output queued switch”, IEEE JSAC, vol 17, no 6, 1999.
- B. Prabhakar, N. McKeown, “On the speedup required for combined input
and output queued switching”, Automatica, vol 35, 1999.
- P. Krishna et al, “On the speedup required for work-conserving crossbar
switches”, IEEE JSAC, vol 17, no 6, 1999.
- A. Charny, “Providing QoS guarantees in input buffered crossbar switches
with speedup”, PhD Thesis, MIT, 1998.
• The problem
• Switching with crossbar fabrics
• Switching with other fabrics
1 3 5
4 6
Copy networks
Fanout-splitting: higher
throughput, but not as simple.
Leaves “residue”.
Residue
1 2 3 4 5
Output ports
Residue
Concentrated
1 2 3 4 5
Output ports
b
b c x x
a y c
a
x d
y y
e
e
d
Recycle
scheduler
FIFO
Fair Queueing
WR = 1
WG = 5
WP = 2
• Theorem
Good approximation of FQ
Much simpler to implement 500 Quantum size
class requirements
May still need queue management inside the network
Congestion
Admission Control Control
Reservation
Control Routing
Output Datapath:
Policing Switching Scheduling per-packet
processing
Forwarding
Table
Forwarding
Decision
Forwarding
Table
Forwarding
Decision