Fastpass SIGCOMM14 Perry
Fastpass SIGCOMM14 Perry
Fastpass SIGCOMM14 Perry
Jonathan Perry, Amy Ousterhout, Hari Balakrishnan, Devavrat Shah, Hans Fugal
M.I.T. Computer Science & Artificial Intelligence Lab
http://fastpass.mit.edu/
ABSTRACT
An ideal datacenter network should provide several properties, including low median and tail latency, high utilization (throughput),
fair allocation of network resources between users or applications,
deadline-aware scheduling, and congestion (loss) avoidance. Current
datacenter networks inherit the principles that went into the design
of the Internet, where packet transmission and path selection decisions are distributed among the endpoints and routers. Instead, we
propose that each sender should delegate controlto a centralized
arbiterof when each packet should be transmitted and what path it
should follow.
This paper describes Fastpass, a datacenter network architecture
built using this principle. Fastpass incorporates two fast algorithms:
the first determines the time at which each packet should be transmitted, while the second determines the path to use for that packet. In
addition, Fastpass uses an efficient protocol between the endpoints
and the arbiter and an arbiter replication strategy for fault-tolerant
failover. We deployed and evaluated Fastpass in a portion of Facebooks datacenter network. Our results show that Fastpass achieves
high throughput comparable to current networks at a 240 reduction is queue lengths (4.35 Mbytes reducing to 18 Kbytes), achieves
much fairer and consistent flow throughputs than the baseline TCP
(5200 reduction in the standard deviation of per-flow throughput
with five concurrent connections), scalability from 1 to 8 cores in the
arbiter implementation with the ability to schedule 2.21 Terabits/s of
traffic in software on eight cores, and a 2.5 reduction in the number
of TCP retransmissions in a latency-sensitive service at Facebook.
1.
INTRODUCTION
Current network architectures distribute packet transmission decisions among the endpoints (congestion control) and path selection
among the networks switches (routing). The result is strong faulttolerance and scalability, but at the cost of a loss of control over
packet delays and paths taken. Achieving high throughput requires
the network to accommodate bursts of packets, which entails the use
of queues to absorb these bursts, leading to delays that rise and fall.
Mean delays may be low when the load is modest, but tail (e.g., 99th
percentile) delays are rarely low.
Instead, we advocate what may seem like a rather extreme approach to exercise (very) tight control over when endpoints can send
packets and what paths packets take. We propose that each packets
timing be controlled by a logically centralized arbiter, which also
determines the packets path (Fig. 1). If this idea works, then flow
rates can match available network capacity over the time-scale of
individual packet transmission times, unlike over multiple round-trip
times (RTTs) with distributed congestion control. Not only will
persistent congestion be eliminated, but packet latencies will not rise
and fall, queues will never vary in size, tail latencies will remain
small, and packets will never be dropped due to buffer overflow.
This paper describes the design, implementation, and evaluation
of Fastpass, a system that shows how centralized arbitration of
the networks usage allows endpoints to burst at wire-speed while
eliminating congestion at switches. This approach also provides
latency isolation: interactive, time-critical flows dont have to suffer
queueing delays caused by bulk flows in other parts of the fabric.
The idea we pursue is analogous to a hypothetical road traffic control
system in which a central entity tells every vehicle when to depart
and which path to take. Then, instead of waiting in traffic, cars can
zoom by all the way to their destinations.
Fastpass includes three key components:
1. A fast and scalable timeslot allocation algorithm at the arbiter to determine when each endpoints packets should be
sent (3). This algorithm uses a fast maximal matching to
achieve objectives such as max-min fairness or to approximate
minimizing flow completion times.
2. A fast and scalable path assignment algorithm at the arbiter
to assign a path to each packet (4). This algorithm uses a
fast edge-coloring algorithm over a bipartite graph induced
by switches in the network, with two switches connected by
an edge if they have a packet to be sent between them in a
timeslot.
3. A replication strategy for the central arbiter to handle network and arbiter failures, as well as a reliable control protocol
between endpoints and the arbiter (5).
We have implemented Fastpass in the Linux kernel using highprecision timers (hrtimers) to time transmitted packets; we achieve
sub-microsecond network-wide time synchronization using the
IEEE1588 Precision Time Protocol (PTP).
Endpoint
Core
ToR
Host
networking
stack
Fastpass
Arbiter
FCP
client
Endpoints
NIC
Arbiter
Timeslot
allocation
destination
and size
timeslots
and paths
FCP
server
Path
Selection
a central arbiter can compute the timeslots and paths in the network
to jointly achieve these different goals.
2.
FASTPASS ARCHITECTURE
2.1
2.2
Deploying Fastpass
3.
TIMESLOT ALLOCATION
3.1
A pipelined allocator
remaining
demands
(from t=99)
timeslot
allocator
t=100
newly
received
demands
remaining
demands
timeslot
allocator
t=101
list of
allocated
(src,dst)
remaining
demands
...
list of
allocated
(src,dst)
active flows
src
dst
1
1
last
src dst allocation
allocated
srcs & dsts
45
2
3
1
47
48
48
51
4
2
2
Figure 5: Path selection. (a) input matching (b) ToR graph (c)
edge-colored ToR graph (d) edge-colored matching.
3.2
Theoretical guarantees
4.
PATH SELECTION
graph of ToRs (b), where the source and destination ToRs of every
packet are connected. Edge-coloring colors each edge ensuring that
no two edges of the same color are incident on the same ToR (c).
The assignment guarantees that at most one packet occupies the
ingress, and one occupies the egress, of each port (d).
Edge-coloring requires uniform link capacities; in networks with
heterogeneous link capacities, we can construct a virtual network
with homogeneous link capacities on which to assign paths. Here,
we replace each physical switch with high-capacity links with multiple switches with low capacity links that connect to the same
components as the physical switch (e.g., one switch with 40 Gbits/s
links would be replaced by four switches with 10 Gbits/s links). All
packets assigned a path through the duplicate switches in the virtual
topology would be routed through the single high-capacity switch
on the physical topology.
Edge-coloring also generalizes to oversubscribed networks and
networks with multiple tiers. Only traffic that passes through a
higher network tier is edge-colored (e.g., in a two-tier network, only
inter-rack traffic requires path selection). For a three-tier datacenter
with ToR, Agg, and Core switches (and higher-tier ones), paths
can be assigned hierarchically: the edge-coloring of the ToR graph
assigns Agg switches to packets, then an edge-coloring of the Agg
graph chooses Core switches [21, IV].
Fast edge-coloring. A network with n racks and d nodes per rack
can be edge-colored in O(nd log d) time [12, 23]. Fast edge-coloring
algorithms invariably use a simple and powerful building block, the
Euler-split. An Euler-split partitions the edges of a regular graph
where each node has the same degree, 2d, into two regular graphs
of degree d. The algorithm is simple: (1) find an Eulerian cycle (a
cycle that starts and ends at the same node, and contains every edge
exactly once, though nodes may repeat) of the original graph, (2)
assign alternate edges of the cycle to the two new graphs, (3) repeat.
An Euler-split divides the edges into two groups that can be
colored separately. d 1 Euler-splits can edge-color a graph with
power-of-two degree d by partitioning it into d perfect matchings,
which can each be assigned a different color. Graphs with nonpower-of-two degree can be edge-colored using a similar method
that incorporates one search for a perfect matching, and has only
slightly worse asymptotic complexity [23].
The Fastpass path selection implementation maintains the bipartite
graph in a size-optimized bitmap-based data structure that can fit
entirely in a 32 KB L1 cache for graphs with up to 6,000 nodes.
This data structure makes the graph walks needed for Euler-split fast,
and yields sufficiently low latencies for clusters with a few thousand
nodes (7.3).
5.
HANDLING FAULTS
5.1
Arbiter failures
5.2
Network failures
5.3
Communication between endpoints and the arbiter is not scheduled and can experience packet loss. FCP must protect against such
loss. Otherwise, if an endpoint request or the arbiters response is
dropped, a corresponding timeslot would never be allocated, and
some packets would remain stuck in the endpoints queue.
TCP-style cumulative ACKs and retransmissions are not ideal for
FCP. At the time of retransmission, the old packet is out of date: for
a request, the queue in the endpoint might be fuller than it was; for
an allocation, an allocated timeslot might have already passed.
FCP provides reliability by communicating aggregate counts; to
inform the arbiter of timeslots that need scheduling, the endpoint
sends the sum total of timeslots it has requested so far for that
6.
IMPLEMENTATION
6.1
Client
6.2
Multicore Arbiter
The arbiter is made up of three types of cores: comm-cores communicate with endpoints, alloc-cores perform timeslot allocation,
and pathsel-cores assign paths.
The number of cores of each type can be increased to handle
large workloads: each comm-core handles a subset of endpoints, so
endpoints can be divided among more cores when protocol handling
becomes a bottleneck; alloc-cores work concurrently using pipeline
3 On a switched network, MAC addresses could be used. However,
in the presence of routing, IP addresses are required.
6.3
Timing
7.
EVALUATION
7.1
Summary of Results
(A) Under a bulk transfer workload involving multiple
machines, Fastpass reduces median switch queue length
to 18 KB from 4351 KB, with a 1.6% throughput penalty.
(B) Interactivity: under the same workload, Fastpasss
median ping time is 0.23 ms vs. the baselines 3.56 ms,
15.5 lower with Fastpass.
7.2
7.3
7.4
10
7.1
Median
4351
18
90th %ile
5097
36
99th
5224
53
99.9th
5239
305
Density
fastpass
baseline
Median
3.56
0.23
90th %ile
3.89
0.27
99th
3.92
0.32
99.9th
3.95
0.38
Note that with Fastpass ping packets are scheduled in both directions by the arbiter, but even with the added round-trips to the arbiter,
end-to-end latency is substantially lower because queues are much
smaller. In addition, Fastpass achieves low latency for interactive
traffic without requiring the traffic to be designated explicitly as
interactive or bulk, and without using any mechanisms for traffic
isolation in the switches: it results from the fairness properties of
the timeslot allocator.
7.2
Baseline
543.86
628.49
459.75
Fastpass
15.89
0.146
0.087
Improvement
34.2
4304.7
5284.5
These results show that Fastpass exhibits significantly lower variability across the board: its standard deviation of throughput is over
5200 times lower than the baseline when there are five concurrent
connections.
Fastpasss pipelined timeslot allocation algorithm prioritizes flows
based on their last transmission time, so when a new flow starts, it
is immediately allocated a timeslot. From that point on, all flows
contending for the bottleneck will be allocated timeslots in sequence,
yielding immediate convergence and perfect fairness over intervals
as small as 1 MTU per flow (for 5 flows on 10 Gbits/s links, this
yields fairness at the granularity of 6 s).
The benchmark shows low total throughput for one and two
senders because of packet processing overheads, which are usually
reduced by TSO. (In contrast, Experiments A and B use many more
connections, so they achieve high utilization). Fastpass senders also
require additional processing in the Fastpass qdisc, which is limited
to using one core; NIC support (8.3) or a multicore implementation
will alleviate this bottleneck.
7.3
Arbiter performance
Sender
1
2
3
4
6
fastpass
#connections
3
4
5
6
baseline
30 seconds, a new bulk flow arrives until all five are active for 30
seconds, and then one flow terminates every 30 seconds. The entire
experiment therefore lasts 270 seconds.
We calculate each connections throughput over 1-second intervals. The resulting time series for the baseline TCP and for Fastpass
are shown in Figure 9.
The baseline TCPs exhibit much larger variability than Fastpass.
For instance, when the second connection starts, its throughput is
about 20-25% higher than the first connection throughout the 30second interval; similarly, when there are two senders remaining
between time 210 to 240 seconds, the throughputs cross over and
are almost never equal. With more connections, the variation in
throughput for TCP is more pronounced than with Fastpass.
To quantify this observation, we calculate the standard deviation
of the per-connection throughputs achieved in each 1-second interval,
in Mbits/s, when there are 3, 4, and 5 concurrent connections each
for the baseline TCP and Fastpass. We then compute the median
over all standard deviations for a given number of connections (a
median over 60 values when there are 3 or 4 connections and over
30 values when there are 5 connections). The results are:
0
0
50
100
150
200
250
Time (seconds)
2 cores
825.6
4 cores
1545.1
6 cores
1966.4
8 cores
2211.8
0.6
107
106
105
TX
RX
0.4
0.2
0.0
0
50
100
150
200
50
Figure 10: As more requests are handled, the NIC polling rate decreases. The resulting queueing delay can be bounded by distributing
request-handling across multiple comm-cores.
150
200
Figure 11: The arbiter requires 0.5 Gbits/s TX and 0.3 Gbits/s RX
bandwidth to schedule 150 Gbits/s: around 0.3% of network traffic.
1.00
30
0.75
Racks
20
32
CDF
100
16
0.50
RX
0.25
TX
8
10
25
50
75
100
7.4
0.00
10 Mbits/s
100 Mbits/s
1 Gbits/s
10 Gbits/s
8.
8.1
DISCUSSION
Large deployments
fastpass
Rack throughput
(1000s of queries per second)
baseline
400
350
50000
100000
150000
Figure 14: 99th percentile web request service time vs. server load in production traffic.
Fastpass shows a similar latency profile as
baseline.
8.2
200
450
150
100
50
2000
9.
4000
Time (seconds)
6000
Figure 15: Live traffic server load as a function of time. Fastpass is shown in the middle
with baseline before and after. The offered
load oscillates gently with time.
8.4
fastpass
8.3
baseline
6
RELATED WORK
Several systems use centralized controllers to get better load balance and network sharing, but they work at control-plane granularity, which doesnt provide control over packet latencies or allocations
over small time scales.
2000
4000
6000
Time (seconds)
Figure 16: Median server TCP retransmission rate during the live experiment. Fastpass (middle) maintains a 2.5 lower rate of
retransmissions than baseline (left and right).
10.
CONCLUSION
Acknowledgements
We are grateful to Omar Baldonado and Sanjeev Kumar of Facebook
for their enthusiastic support of this collaboration, Petr Lapukhov
and Doug Weimer for their generous assistance with the Facebook
infrastructure, Garrett Wollman and Jon Proulx at MIT CSAIL for
their help and efforts in setting up environments for our early experiments, and David Oran of Cisco for his help. We thank John
Ousterhout, Rodrigo Fonseca, Nick McKeown, George Varghese,
Chuck Thacker, Steve Hand, Andreas Nowatzyk, Tom Rodeheffer,
and the SIGCOMM reviewers for their insightful feedback. This
work was supported in part by the National Science Foundation grant
IIS-1065219. Ousterhout was supported by a Jacobs Presidential
Fellowship and a Hertz Foundation Fellowship. We thank the industrial members of the MIT Center for Wireless Networks and Mobile
Computing for their support and encouragement.
11.
REFERENCES
and
k=1
(1)
the arbiter has not already allocated another packet starting from i or
destined to j in this timeslot. Therefore, at the end of processing the
timeslot, the allocations correspond to a maximal matching between
endpoints in the bipartite graph between endpoints, where an edge
is present between (i, j) if there are packets waiting at endpoint i
destined for j. From the literature on input-queued switches, it is
well-known that any maximal matching provides 50% throughput
guarantees [13, 3]. Building upon these results as well as [30], we
state the following property of our algorithm.
T HEOREM 1. For any < 1, there exists with () = such
that for any allocator,
h
i
N
lim inf E Qi j (t)
.
(3)
t
2(1
)
ij
Further, let V 1 be such that E[G2i j ] V E[Gi j ] for all i, j (bounded
Gi j ); if we allow the Fastpass arbiter to schedule (as well as transmit
through the network) twice per unit timeslot,5 then the induced
average queue-size
h
i N( +V )
lim sup E Qi j (t)
.
(4)
2(1 )
t
ij
Proof Sketch. To establish the lower bound (3) for any scheduling
algorithm, it is sufficient to consider a specific scenario of our setup.
Concretely, let the traffic matrix be uniform, i.e., = [i j ] with
i j = (N1)
for all i 6= j and 0 when i = j; pi j = 1 for all i 6= j; and
let Gi j be Poisson variables with parameter i j . The network can
be viewed as unit-sized packets arriving at each endpoint according
to a Poisson arrival process of rate and processed (transferred by
the network) at unit rate. That is, the queue-size for each endpoint j
is bounded below by that of an M/D/1 queue with load , which
k=1
k=1
k=1
i, j
Telescoping this inequality for t 0 and using the fact that the
system reaches equilibrium due to ergodicity, we obtain the desired
result.
Implications. Equation (3) says that there is some (worst case) input
workload for which any allocator will have an expected aggregate
N
queue length at least as large as 2(1)
. Equation (4) says that with a
speedup of 2 in the network fabric, for every workload, the expected
N(+V )
aggregate queue length will be no larger than 2(1) . Here V is
effectively a bound on burst size; if it is small, say 1, then it is within
a factor of 2 of the lower bound! There is, however, a gap between
theory and practice here, as in switch scheduling; many workloads
observed in practice seem to require only small queues even with no
speedup.
5 Equivalent