6.888: Data Center Congestion Control: Mohammad Alizadeh
6.888: Data Center Congestion Control: Mohammad Alizadeh
888:
Lecture 3
Data Center Congestion Control
Mohammad Alizadeh
Spring 2016
1
100Kbps–100Mbps links Transport
~100ms latency inside the DC
INTERNET
Fabric
10–40Gbps links
~10–100μs latency
Servers
Transport
inside the DC
INTERNET
Fabric
data- map-
web
Servers app cache HPC monitoring
base reduce
What’s Different About DC Transport?
Network characteristics
– Very high link speeds (Gb/s); very low latency (microseconds)
Application characteristics
– Large-scale distributed computation
Cheap switches
– Single-chip shared-memory devices; shallow buffers
4
Data Center Workloads
Mice & Elephants
Short messages
(e.g., query, coordination) Low Latency
Large flows
(e.g., data update, backup) High Throughput
Incast
Worker 1 • Synchronized fan-in congestion
Aggregator
Worker 2
Worker 3
RTOmin = 300 ms
1. Low Latency
– Short messages, queries
2. High Throughput
– Continuous data updates, backups
Receiver
Time
Sender 2
B
B
Throughput
100% 100%
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough.
Window Size
(Rate)
Buffer Size
Throughput
100%
15
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough
Key Observation:
Low variance in sending rate Small buffers suffice
16
DCTCP: Main Idea
Extract multi-bit feedback from single-bit stream of ECN marks
– Reduce window size based on fraction of marked packets.
TCP DCTCP
Window Size (Bytes)
Window Size (Bytes)
Sender side:
– Maintain running average of fraction of packets marked (α).
# of marked ACKs
each RTT : F (1 g) gF
Total # of ACKs
700
Buffer is mostly empty
(Packets)
600
Queue Length(KBytes)
500
400
300
200 DCTCPTCP, 2 flowsIncast by creating a
700
mitigates
Queue Length (Packets)
600
500
400
DCTCP, 2 flows
300
200 TCP
100
0
0
Time (seconds)
0
70 0
0
Que Length(Packets)
60 0
50 0
Time (seconds)
40 0
30 0
20 0 T CP , 2 f low s
DC TC P , 2 flo w s
10 0
0
0
Tim e (s ec on ds )
Why it Works
1. Low Latency
Small buffer occupancies → low queuing delay
2. High Throughput
ECN averaging → smooth rate adjustments, low variance
21
DCTCP Deployments
21
Discussion
22
What You Said?
Austin: “The paper's performance comparison to RED
seems arbitrary, perhaps RED had traction at the time?
Or just convenient as the switches were capable of
implementing it?”
23
Evaluation
Implemented in Windows stack.
Real hardware, 1Gbps and 10Gbps experiments
– 90 server testbed
– Broadcom Triumph 48 1G ports – 4MB shared memory
– Cisco Cat4948 48 1G ports – 16MB shared memory
– Broadcom Scorpion 24 10G ports – 4MB shared memory
Numerous micro-benchmarks
– Throughput and Queue Length – Fairness and Convergence
– Multi-hop – Incast
– Queue Buildup – Static vs Dynamic Buffer Mgmt
– Buffer Pressure
25
Bing Benchmark (scaled 10x)
Incast
Deep buffers fix
Completion Time (ms)
27
A bit of Analysis
B K
How much buffering does DCTCP need for
100% throughput?
Need to quantify queue size oscillations (Stability).
W*+1
W*
(W*+1)(1-α/2)
for TCP:
K > (1/7) C x RTT
K > C x RTT
30
Convergence Time
DCTCP takes at most ~40% more RTTs than TCP
– “Analysis of DCTCP: Stability, Convergence, and Fairness,” SIGMETRICS 2011
TCP DCTCP
31
TIMELY
• End-to-end metric
Hardware Acknowledgements
– avoid processing overhead
Hardware vs Software Timestamps
Timestamps
Paced
Data
RTT Measurement Engine
Propagation &
Queuing Delay
RECEIVER
HW ack
Gradient-based
Increase / Decrease
Algorithm Overview
Gradient-based
Increase / Decrease
gradient = 0
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
gradient > 0
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
gradient < 0
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
RTT
Time
Algorithm Overview
Gradient-based
Increase / Decrease
To navigate the
throughput-latency
tradeoff and
ensure stability.
Why Does Gradient Help Stability?
Source
Source
e(t) + ke'(t)
50
Algorithm Overview
To navigate the
To keep tail
Better Burst throughput-latency
latency within
Tolerance tradeoff and
acceptable limits.
ensure stability.
Discussion
52
Implementation Set-up
TIMELY is implemented in the context of RDMA.
– RDMA write and read primitives used to invoke NIC
services.
SE
P
SE
PAU
A
U
PA
U
PA
US
SE
E
S E
PAUSE
PAUSE
U
PA PA
US
E
E
SE
PAUS
PAU
54
TIMELY vs PFC
55
TIMELY vs PFC
56
What You Said
Amy: “I was surprised to see that TIMELY performed so
much better than DCTCP. Did the lack of an OS-bypass
for DCTCP impact performance? I wish that the authors
had offered an explanation for this result.”
57
Next time: Load Balancing
58
59