0% found this document useful (0 votes)
65 views59 pages

6.888: Data Center Congestion Control: Mohammad Alizadeh

This 3-sentence summary provides the high-level and essential information from the document: The document discusses data center congestion control and introduces DCTCP, which aims to achieve low latency, high throughput, and high burst tolerance for data center networks. DCTCP modifies TCP to use explicit congestion notifications (ECN) to reduce congestion window size based on the fraction of marked packets rather than a binary notification, allowing it to better control queue buildup and latency. The paper evaluates DCTCP through microbenchmarks and on a Bing cluster, finding that it outperforms TCP in minimizing latency while maintaining high throughput.

Uploaded by

setia wulandari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views59 pages

6.888: Data Center Congestion Control: Mohammad Alizadeh

This 3-sentence summary provides the high-level and essential information from the document: The document discusses data center congestion control and introduces DCTCP, which aims to achieve low latency, high throughput, and high burst tolerance for data center networks. DCTCP modifies TCP to use explicit congestion notifications (ECN) to reduce congestion window size based on the fraction of marked packets rather than a binary notification, allowing it to better control queue buildup and latency. The paper evaluates DCTCP through microbenchmarks and on a Bing cluster, finding that it outperforms TCP in minimizing latency while maintaining high throughput.

Uploaded by

setia wulandari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

6.

888:
Lecture 3
Data Center Congestion Control
Mohammad Alizadeh

Spring 2016

1
100Kbps–100Mbps links Transport
~100ms latency inside the DC
INTERNET

Fabric

10–40Gbps links
~10–100μs latency

Servers
Transport
inside the DC
INTERNET

Fabric

Interconnect for distributed compute workloads

data- map-
web
Servers app cache HPC monitoring
base reduce
What’s Different About DC Transport?
Network characteristics
– Very high link speeds (Gb/s); very low latency (microseconds)

Application characteristics
– Large-scale distributed computation

Challenging traffic patterns


– Diverse mix of mice & elephants
– Incast

Cheap switches
– Single-chip shared-memory devices; shallow buffers
4
Data Center Workloads
Mice & Elephants

Short messages
(e.g., query, coordination) Low Latency

Large flows
(e.g., data update, backup) High Throughput
Incast
Worker 1 • Synchronized fan-in congestion

Aggregator
Worker 2

Worker 3

RTOmin = 300 ms

Worker 4 TCP timeout

 Vasudevan et al. (SIGCOMM’09)


Incast in Bing
MLA Query Completion Time (ms)

Requests are jittered over 10ms window.


Jittering trades of median for high percentiles
Jittering switched off around 8:30 am.
7
DC Transport Requirements

1. Low Latency
– Short messages, queries

2. High Throughput
– Continuous data updates, backups

3. High Burst Tolerance


– Incast

The challenge is to achieve these together


8
High Throughput Low Latency

Baseline fabric latency (propagation + switching): 10 microseconds


High Throughput Low Latency

Baseline fabric latency (propagation + switching): 10 microseconds

High throughput requires buffering for rate mismatches


… but this adds significant queuing latency
Data Center TCP
TCP in the Data Center

TCP [Jacobsen et al.’88] is widely used in the data center


– More than 99% of the traffic

Operators work around TCP problems


‒ Ad-hoc, inefficient, often expensive solutions
‒ TCP is deeply ingrained in applications

Practical deployment is hard


 keep it simple!
Review: The TCP Algorithm
Sender 1 Additive Increase:
W  W+1 per round-trip time
Multiplicative Decrease:
W  W/2 per drop or ECN mark

ECN Mark (1 bit)


Window Size (Rate)

Receiver

Time

Sender 2

ECN = Explicit Congestion Notification


TCP Buffer Requirement

Bandwidth-delay product rule of thumb:


– A single flow needs C×RTT buffers for 100% Throughput.

B < C×RTT B ≥ C×RTT


Buffer Size

B
B
Throughput

100% 100%
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough.

Window Size
(Rate)

Buffer Size

Throughput
100%

15
Reducing Buffer Requirements
Appenzeller et al. (SIGCOMM ‘04):
– Large # of flows: is enough

Can’t rely on stat-mux benefit in the DC.


– Measurements show typically only 1-2 large flows at each server

Key Observation:
Low variance in sending rate  Small buffers suffice

16
DCTCP: Main Idea
Extract multi-bit feedback from single-bit stream of ECN marks
– Reduce window size based on fraction of marked packets.

ECN Marks TCP DCTCP


1011110111 Cut window by 50% Cut window by 40%

0000000001 Cut window by 50% Cut window by 5%

TCP DCTCP
Window Size (Bytes)
Window Size (Bytes)

Time (sec) Time (sec)


DCTCP: Algorithm
Switch side: B Mark K Don’t
Mark
– Mark packets when Queue Length > K.

Sender side:
– Maintain running average of fraction of packets marked (α).

# of marked ACKs
each RTT : F     (1 g)  gF
Total # of ACKs

 Adaptive window decreases: 


W  (1 )W
 2
– Note: decrease factor between 1 and 2.
DCTCP vs TCP
Experiment: 2 flows (Win 7 stack), Broadcom 1Gbps Switch

700
Buffer is mostly empty
(Packets)

600
Queue Length(KBytes)

500
400
300
200 DCTCPTCP, 2 flowsIncast by creating a
700

mitigates
Queue Length (Packets)

600
500
400

DCTCP, 2 flows
300
200 TCP

ECN Marking Thresh = 30KB


DCTCP

large buffer headroom


100

100
0
0
Time (seconds)

0
70 0
0
Que Length(Packets)

60 0
50 0

Time (seconds)
40 0
30 0
20 0 T CP , 2 f low s
DC TC P , 2 flo w s
10 0
0
0
Tim e (s ec on ds )
Why it Works
1. Low Latency
 Small buffer occupancies → low queuing delay

2. High Throughput
 ECN averaging → smooth rate adjustments, low variance

3. High Burst Tolerance


 Large buffer headroom → bursts fit
 Aggressive marking → sources react before packets are
dropped

21
DCTCP Deployments

21
Discussion

22
What You Said?
Austin: “The paper's performance comparison to RED
seems arbitrary, perhaps RED had traction at the time?
Or just convenient as the switches were capable of
implementing it?”

23
Evaluation
Implemented in Windows stack.
Real hardware, 1Gbps and 10Gbps experiments
– 90 server testbed
– Broadcom Triumph 48 1G ports – 4MB shared memory
– Cisco Cat4948 48 1G ports – 16MB shared memory
– Broadcom Scorpion 24 10G ports – 4MB shared memory

Numerous micro-benchmarks
– Throughput and Queue Length – Fairness and Convergence
– Multi-hop – Incast
– Queue Buildup – Static vs Dynamic Buffer Mgmt
– Buffer Pressure

Bing cluster benchmark


Bing Benchmark (baseline)
Background Flows Query Flows

25
Bing Benchmark (scaled 10x)
Incast
Deep buffers fix
Completion Time (ms)

incast, but increase


latency
DCTCP good for both
incast & latency

Query Traffic Short messages


(Incast bursts) (Delay-sensitive)
What You Said
Amy: “I find it unsatisfying that the details of many
congestion control protocols (such at these) are so
complicated! ... can we create a parameter-less
congestion control protocol that is similar in behavior to
DCTCP or TIMELY?”

Hongzi: “Is there a general guideline to tune the


parameters, like alpha, beta, delta, N, T_low, T_high, in
the system?”

27
A bit of Analysis
B K
How much buffering does DCTCP need for
100% throughput?
 Need to quantify queue size oscillations (Stability).

Packets sent in this


Window Size RTT are marked.

W*+1
W*

(W*+1)(1-α/2)

# of pkts in last RTT of Period


a=
# of pkts in Period Time
22
A bit of Analysis
B K
How small can queues be without loss of
throughput?
 Need to quantify queue size oscillations (Stability).

for TCP:
K > (1/7) C x RTT
K > C x RTT

What assumptions does the


model make?
22
What You Said
Anurag: “In both the papers, one of the difference I saw
from TCP was that these protocols don’t have the “slow
start” phase, where the rate grows exponentially
starting from 1 packet/RTT.”

30
Convergence Time
DCTCP takes at most ~40% more RTTs than TCP
– “Analysis of DCTCP: Stability, Convergence, and Fairness,” SIGMETRICS 2011

Intuition: DCTCP makes smaller adjustments than TCP, but makes


them much more frequently

TCP DCTCP

31
TIMELY

 Slides by Radhika Mittal (Berkeley)


Qualities of RTT
• Fine-grained and informative

• Quick response time

• No switch support needed

• End-to-end metric

• Works seamlessly with QoS


RTT correlates with queuing delay
What You Said
Ravi: “The first thing that struck me while reading these
papers was how different their approaches were. DCTCP even
states that delay-based protocols are "susceptible to noise in
the very low latency environment of data centers" and that
"the accurate measurement of such small increases in
queuing delay is a daunting task". Then, I noticed that there
is a 5 year gap between these two papers… “

Arman: “They had to resort to extraordinary measures to


ensure that the timestamps accurately reflect the time at
which a packet was put on wire…”
35
Accurate RTT Measurement
Hardware Assisted RTT Measurement
Hardware Timestamps
– mitigate noise in measurements

Hardware Acknowledgements
– avoid processing overhead
Hardware vs Software Timestamps

Kernel Timestamps introduce significant noise in RTT


measurements compared to HW Timestamps.
Impact of RTT Noise

Throughput degrades with increasing noise in RTT.


Precise RTT measurement is crucial.
TIMELY Framework
Overview
Data

RTT RTT Rate Rate


Measurement Computation Pacing Engine
Engine Engine

Timestamps
Paced
Data
RTT Measurement Engine

Serialization Delay RTT


tsend tcompletion
SENDER

Propagation &
Queuing Delay

RECEIVER
HW ack

RTT = tcompletion – tsend – Serialization Delay


Algorithm Overview

Gradient-based
Increase / Decrease
Algorithm Overview

Gradient-based
Increase / Decrease

gradient = 0
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease

gradient > 0
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease

gradient < 0
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease
RTT

Time
Algorithm Overview

Gradient-based
Increase / Decrease

To navigate the
throughput-latency
tradeoff and
ensure stability.
Why Does Gradient Help Stability?
Source

e(t) =RTT (t) - RTT0

Source

e(t) + ke'(t)

Feedback higher order derivatives


Observe not only error, but change in error – “anticipate” future state
49
What You Said
Arman: “I also think that deducing the queue length
from the gradient model could lead to miscalculations.
For example, consider an Incast scenario, where many
senders transmit simultaneously through the same
path. Noting that every packet will see a long, yet
steady, RTT, they will compute a near-zero gradient and
hence the congestion will continue.”

50
Algorithm Overview

Additive Gradient-based Multiplicativ


Increase Increase / Decrease e Decrease
Tlow Thigh

To navigate the
To keep tail
Better Burst throughput-latency
latency within
Tolerance tradeoff and
acceptable limits.
ensure stability.
Discussion

52
Implementation Set-up
TIMELY is implemented in the context of RDMA.
– RDMA write and read primitives used to invoke NIC
services.

Priority Flow Control is enabled in the network fabric.


– RDMA transport in the NIC is sensitive to packet drops.
– PFC sends out pause frames to ensure lossless network.
“Congestion Spreading” in Lossless Networks

SE

P
SE
PAU

A
U
PA

U
PA
US

SE
E
S E

PAUSE
PAUSE
U
PA PA
US
E
E

SE
PAUS

PAU
54
TIMELY vs PFC

55
TIMELY vs PFC

56
What You Said
Amy: “I was surprised to see that TIMELY performed so
much better than DCTCP. Did the lack of an OS-bypass
for DCTCP impact performance? I wish that the authors
had offered an explanation for this result.”

57
Next time: Load Balancing

58
59

You might also like