Chap3 Fall10
Chap3 Fall10
Chap3 Fall10
Our goals:
understand principles
behind transport
layer services:
multiplexing/
demultiplexing
reliable data transfer
flow control
congestion control
UDP: connectionless
transport
TCP: connection-oriented
transport
TCP congestion control
Transport Layer
3-1
logical communication
network
data link
physical
g
lo
network
data link
physical
d
en
den
al
ic
network
data link
physical
t
or
sp
an
tr
application
transport
network
data link
physical
network
data link
physical
network
data link
physical
application
transport
network
data link
physical
Transport Layer
3-2
Household analogy:
communication
between hosts
communication
between processes
processes = kids
app messages = letters
in envelopes
hosts = houses
transport protocol =
Aye and Blent
network-layer protocol
= postal service
Transport Layer
3-3
reliable, in-order
delivery (TCP)
unreliable, unordered
delivery: UDP
no-frills extension of
best-effort IP
delay guarantees
bandwidth guarantees
network
data link
physical
t
or
sp
an
tr
network
data link
physical
network
data link
physical
d
en
den
al
ic
congestion control
flow control
connection setup
g
lo
application
transport
network
data link
physical
network
data link
physical
network
data link
physical
application
transport
network
data link
physical
Transport Layer
3-4
Multiplexing/demultiplexing
Multiplexing at send host:
gathering data from multiple
sockets, enveloping data with
header (later used for
demultiplexing)
= process
P3
P1
P1
application
transport
network
P2
P4
application
transport
network
link
link
physical
host 1
physical
host 2
physical
host 3
Transport Layer
3-5
32 bits
source port #
dest port #
application
data
(message)
TCP/UDP segment format
Transport Layer
3-6
Connectionless demultiplexing
IP datagrams with
different source IP
addresses and/or source
port numbers directed
to same socket
Transport Layer
3-7
Connection-oriented demux
source IP address
source port number
dest IP address
dest port number
Transport Layer
3-8
Connection-oriented demux
(cont)
P1
P4
P5
P2
P6
P1P3
SP: 5775
DP: 80
S-IP: B
D-IP:C
client
IP: A
SP: 9157
DP: 80
S-IP: A
D-IP:C
server
IP: C
SP: 9157
DP: 80
S-IP: B
D-IP:C
Client
IP:B
Transport Layer
3-9
Connection-oriented demux
Threaded Web Server
P1
P4
P4
P5
P2
P6
P1P3
SP: 5775
DP: 80
S-IP: B
D-IP:C
client
IP: A
SP: 9157
DP: 80
S-IP: A
D-IP:C
server
IP: C
SP: 9157
DP: 80
S-IP: B
D-IP:C
Client
IP:B
Transport Layer
3-10
Internet transport
protocol
best effort service, UDP
segments may be:
lost
delivered out of order
to app
connectionless:
no handshaking between
UDP sender, receiver
each UDP segment
handled independently
of others
Transport Layer
3-11
UDP: more
often used for streaming
multimedia apps
loss tolerant
rate sensitive
Length, in
bytes of UDP
segment,
including
header
DNS
SNMP
reliable transfer over UDP:
add reliability at
application layer
application-specific
error recovery!
32 bits
source port #
dest port #
length
checksum
Application
data
(message)
UDP segment format
Transport Layer
3-12
UDP checksum
Goal: detect errors (e.g., flipped bits) in transmitted
segment
Sender:
Receiver:
as sequence of 16-bit
integers
checksum: addition (1s
complement sum) of
segment contents with
wraparound of carry out
bit
sender puts checksum
value into UDP checksum
field
received segment
check if computed checksum
equals checksum field value:
NO - error detected
YES - no error detected.
compute checksum of
Transport Layer
3-13
network
layer
Transport Layer
3-14
send
side
deliver_data(): called by
rdt to deliver data to upper
receive
side
3-15
state
1
event
actions
state
2
Transport Layer
3-16
Rdt1.0:
no bit errors
no loss of packets
Wait for
call from
above
rdt_send(data)
packet = make_pkt(data)
udt_send(packet)
sender
Wait for
call from
below
rdt_rcv(packet)
extract (packet,data)
deliver_data(data)
receiver
Transport Layer
3-17
error detection
receiver feedback: control msgs (ACK,NAK) rcvr->sender
Transport Layer
3-18
sender
receiver
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Transport Layer
3-19
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Transport Layer
3-20
rdt_rcv(rcvpkt) &&
corrupt(rcvpkt)
udt_send(NAK)
Wait for
call from
below
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
extract(rcvpkt,data)
deliver_data(data)
udt_send(ACK)
Transport Layer
3-21
happened at receiver!
cant just retransmit:
possible duplicate
What to do?
sender ACKs/NAKs
Handling duplicates:
sender adds
sequence
3-22
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt)
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isNAK(rcvpkt) )
udt_send(sndpkt)
( corrupt(rcvpkt) ||
isNAK(rcvpkt) )
udt_send(sndpkt)
Wait for
ACK or
NAK 0
Wait for
ACK or
NAK 1
Wait for
call 1 from
above
rdt_send(data)
sndpkt = make_pkt(1, data, checksum)
udt_send(sndpkt)
Transport Layer
3-23
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
rdt_rcv(rcvpkt) && (corrupt(rcvpkt)
Wait for
1 from
below
rdt_rcv(rcvpkt) &&
not corrupt(rcvpkt) &&
has_seq0(rcvpkt)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(ACK, chksum)
udt_send(sndpkt)
Transport Layer
3-24
rdt2.1: discussion
Sender:
seq # added to pkt
two seq. #s (0,1) will
suffice. Why?
must check if received
ACK/NAK corrupted
twice as many states
Receiver:
must check if received
packet is duplicate
Transport Layer
3-25
Transport Layer
3-26
sender FSM
fragment
rdt_rcv(rcvpkt) &&
(corrupt(rcvpkt) ||
has_seq1(rcvpkt))
udt_send(sndpkt)
Wait for
0 from
below
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
receiver FSM
fragment
Transport Layer
3-27
3-28
rdt3.0 sender
rdt_send(data)
sndpkt = make_pkt(0, data, checksum)
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,1) )
Wait
for
ACK0
Wait for
call 0from
above
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,1)
timeout
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt)
&& notcorrupt(rcvpkt)
&& isACK(rcvpkt,0)
stop_timer
stop_timer
timeout
udt_send(sndpkt)
start_timer
rdt_rcv(rcvpkt) &&
( corrupt(rcvpkt) ||
isACK(rcvpkt,0) )
Wait
for
ACK1
Wait for
call 1 from
above
rdt_send(data)
rdt_rcv(rcvpkt)
Transport Layer
3-29
rdt3.0 in action
Transport Layer
3-30
rdt3.0 in action
Transport Layer
3-31
Performance of rdt3.0
rdt3.0 works, but performance stinks
example: 1 Gbps link, 15 ms e-e prop. delay, 1KB packet:
Ttransmit =
sender
L/R
RTT + L / R
.008
30.008
= 0.00027
3-32
receiver
RTT
sender
L/R
RTT + L / R
.008
30.008
= 0.00027
Transport Layer
3-33
Pipelined protocols
Pipelining: sender allows multiple, in-flight, yet-tobe-acknowledged pkts
selective repeat
Transport Layer
3-34
receiver
RTT
Increase utilization
by a factor of 3!
sender
3*L/R
RTT + L / R
.024
30.008
= 0.0008
Transport Layer
3-35
Utilization=N(L/R)/(RTT+L/R) if NL/R<RTT+L/R
and the sender pauses after it transmits a window
of packets until it receives first ACK
Utilization=1 if
NL/R>RTT+L/R and the
sender does not pause
Transport Layer
3-36
Go-Back-N
Sender:
k-bit seq # in pkt header
window of up to N, consecutive unacked pkts allowed
Transport Layer
3-37
base=1
nextseqnum=1
Wait
rdt_rcv(rcvpkt)
&& corrupt(rcvpkt)
timeout
start_timer
udt_send(sndpkt[base])
udt_send(sndpkt[base+1])
udt_send(sndpkt[nextseqnum-1])
rdt_rcv(rcvpkt) &&
notcorrupt(rcvpkt)
base = getacknum(rcvpkt)+1
If (base == nextseqnum)
stop_timer
else
start_timer
Transport Layer
3-38
expectedseqnum=1
sndpkt =
make_pkt(0,ACK,chksum)
Wait
rdt_rcv(rcvpkt)
&& notcurrupt(rcvpkt)
&& hasseqnum(rcvpkt,expectedseqnum)
extract(rcvpkt,data)
deliver_data(data)
sndpkt = make_pkt(expectedseqnum,ACK,chksum)
udt_send(sndpkt)
expectedseqnum++
out-of-order pkt:
3-39
GBN in
action
Transport Layer
3-40
Selective Repeat
sender window
N consecutive seq #s
again limits seq #s of sent, unACKed pkts
Transport Layer
3-41
Transport Layer
3-42
Selective repeat
sender
data from above :
receiver
pkt n in [rcvbase, rcvbase+N-1]
send ACK(n)
timeout(n):
ACK(n) in [sendbase,sendbase+N-1]:
mark pkt n as received
if n smallest unACKed pkt,
out-of-order: buffer
pkt n in [rcvbase-N,rcvbase-1]
ACK(n)
otherwise:
ignore
Transport Layer
3-43
Transport Layer
3-44
Selective repeat:
dilemma
Example:
seq #s: 0, 1, 2, 3
window size=3
receiver sees no
difference in two
scenarios!
incorrectly passes
duplicate data as new
in (a)
Q: what relationship
between seq # size
and window size?
Transport Layer
3-45
sender
expectedSN+N-1
senders window
receiver
expectedSN
Transport Layer
3-46
sender
expectedSN-1
senders window
receiver
expectedSN
Transport Layer
3-47
2k N+1
expectedSN+N-1
snd_base=expectedSN-N
sender
receiver
expectedSN
Transport Layer
3-48
sender
receiver
rcv_base+N-1
senders window
receivers window
rcv_base
rcv_base+N-1
Transport Layer
3-49
sender
receiver
rcv_base-1
senders window
receivers window
rcv_base
rcv_base+N-1
Transport Layer
3-50
2k 2N
rcv_base+N-1
snd_base=rcv_base-N
sender
receiver
receivers window
rcv_base
rcv_base+N-1
Transport Layer
3-51
TCP: Overview
point-to-point:
no message boundaries
pipelined:
socket
door
application
reads data
TCP
send buffer
TCP
receive buffer
socket
door
connection-oriented:
application
writes data
stream:
handshaking (exchange
of control msgs) inits
sender, receiver state
before data exchange
flow controlled:
segment
Transport Layer
3-52
source port #
dest port #
sequence number
acknowledgement number
head not
UA P R S F
len used
checksum
Receive window
Urg data pnter
counting
by bytes
of data
(not segments!)
# bytes
rcvr willing
to accept
application
data
(variable length)
Transport Layer
3-53
Transport Layer
3-54
timeout
unnecessary
retransmissions
too long: slow reaction
to segment loss
Transport Layer
3-55
Transport Layer
3-56
RTT (milliseconds)
300
250
200
150
100
1
15
22
29
36
43
50
57
64
71
78
85
92
99
106
time (seconnds)
SampleRTT
Estimated RTT
Transport Layer
3-57
EstimatedRTT:
DevRTT = (1-)*DevRTT +
*|SampleRTT-EstimatedRTT|
(typically, = 0.25)
Then set timeout interval:
TimeoutInterval = EstimatedRTT + 4*DevRTT
Transport Layer
3-58
Retransmissions are
triggered by:
timeout events
duplicate acks
Initially consider
simplified TCP sender:
Transport Layer
3-59
timeout:
retransmit segment that
caused timeout (first
segment in the window)
restart timer
Ack rcvd:
If acknowledges previously
unacked segments
Transport Layer
3-60
NextSeqNum = InitialSeqNum
SendBase = InitialSeqNum
loop (forever) {
switch(event)
event: data received from application above
create TCP segment with sequence number NextSeqNum
if (timer currently not running)
start timer
pass segment to IP
NextSeqNum = NextSeqNum + length(data)
event: timer timeout
retransmit not-yet-acknowledged segment with
smallest sequence number
start timer
event: ACK received, with ACK field value of y
if (y > SendBase) {
SendBase = y
if (there are currently not-yet-acknowledged segments)
start timer
}
} /* end of loop forever */
TCP
sender
(simplified)
Comment:
SendBase-1: last
cumulatively
acked byte
Example:
SendBase-1 = 71;
y= 73, so the rcvr
wants 73+ ;
y > SendBase, so
that new data is
acked
Transport Layer
3-61
A CK
ta
=100
loss
Seq=9
2
, 8 byt
es da
Host B
Seq=9
2
Seq=92 timeout
, 8 byt
es da
ta
=100
K
C
A
SendBase
= 100
Sendbase
= 100
SendBase
= 120
SendBase
= 120
Seq=
1
time
, 8 byt
es da
00, 2
0 byt
ta
es da
ta
0
10
=
K
120
=
C
K
A AC
Seq=9
2
Seq=92 timeout
timeout
Seq=9
2
time
Host A
Host B
, 8 byt
es da
ta
= 12
CK
premature timeout
Transport Layer
3-62
Host B
timeout
Seq=9
2
SendBase
= 120
Seq=1
0
, 8 byt
es da
ta
=100
K
C
A
0, 20
bytes
data
loss
=120
K
C
A
time
Cumulative ACK scenario
Transport Layer
3-63
Event at Receiver
3-64
Fast Retransmit
If sender receives 3
ACKs for the same
data, it supposes that
segment after ACKed
data was lost:
Transport Layer
3-65
Fast Retransmit
Resend
Host B
seq # x1
seq # x2
seq # x3
seq # x4
seq # x5
triple
duplicate
ACKs
ACK x1
ACK x1
ACK x1
ACK x1
resen
d
seq X
timeout
a segment
after 3 duplicate
ACKs since a
duplicate ACK
means that an outof sequence
segment was
received
duplicate ACKs due
to packet
reordering!
if window is small
dont get duplicate
ACKs!
Host A
time
Transport Layer
3-66
fast retransmit
Transport Layer
3-67
flow control
speed-matching
service: matching the
send rate to the
receiving apps drain
rate
Transport Layer
3-68
guarantees receive
buffer doesnt overflow
= RcvWin
= RcvBuffer-[LastByteRcvd LastByteRead]
Transport Layer
3-69
Sender blocked
Sender
sends 2K
of data
4K
2K SeqNo=0
RcvWin
AckNo=2048
=2048
2K
2K SeqNo=2
048
4K
vWin=0
c
R
6
9
0
4
=
o
N
Ack
3K
Win=1024
v
c
R
6
9
0
4
=
o
AckN
Transport Layer
3-70
Transport Layer
3-71
Host A
Host B
out
in : original data
unlimited shared
output link buffers
large delays
when congested
maximum
achievable
throughput
Transport Layer
3-72
Host A
in : original data
out
Host B
Transport Layer
3-73
> out
in
retransmission of delayed (not lost) packet makes
(than perfect case) for same
R/2
out
R/2
in
larger
R/2
in
a.
R/2
out
out
out
R/3
in
b.
R/2
R/4
in
R/2
c.
costs of congestion:
more work (retrans) for given goodput
unneeded retransmissions: link carries multiple copies of pkt
Transport Layer
3-74
Q: what happens as
in
and increase ?
multihop paths
timeout/retransmit
in
Host A
in : original data
out
Host B
Transport Layer
3-75
o
u
t
H
o
s
t
B
3-76
network
congestion inferred from
end-system observed loss,
delay
approach taken by TCP
Network-assisted
congestion control:
routers provide feedback
to end systems
single bit indicating
congestion (SNA,
DECbit, TCP/IP ECN,
ATM)
explicit rate sender
should send at
Transport Layer
3-77
LastByteSent-LastByteAcked
CongWin
3-78
sending rate
TCPs
sawtooth
behavior
time
details to follow
Transport Layer
3-79
roughly,
rate =
cwnd
RTT
cwnd
bytes
bytes/sec
RTT
ACK(s)
Transport Layer
3-80
Transport Layer
3-81
RTT
MSS
example: MSS = 500 bytes &
RTT = 200 msec
initial rate = 20 kbps
available bandwidth may be >>
MSS/RTT
desirable to quickly ramp up to
respectable rate
increase rate exponentially until
first loss event or when threshold
reached
double cwnd every RTT
done by incrementing cwnd by 1
for every ACK received
Host A
Host B
one segm
ent
two segm
en
ts
four segm
ents
time
Transport Layer
3-82
congestion
window size grows
very rapidly
cwnd = 1
ent 1
ACK for segm
cwnd = 2
TCP
CongWin ssthresh
segment 1
cwnd = 5
cwnd = 6
cwnd = 7
cwnd = 8
segment 2
segment 3
ents 2
ACK for segm
ents 3
ACK for segm
segment 4
segment 5
segment 6
segment 7
ents 4
ACK for segm
ents 5
ACK for segm
ents 6
ACK for segm
ents 7
ACK for segm
Transport Layer
3-83
AIMD
ACKs: increase cwnd
by 1 MSS per RTT:
additive increase
loss: cut cwnd in half
(non-timeout-detected
loss ): multiplicative
decrease
3-84
Congestion Avoidance
Transport Layer
3-85
cwnd = 3
cwnd = 4
Assume that
ssthresh
t=
6
t=
4
cwnd = 9
t=
2
14
12
10
8
6
4
2
0
t=
0
ssthresh = 8
cwnd = 5
cwnd = 6
cwnd = 7
cwnd = 8
Roundtrip times
cwnd = 10
Transport Layer
3-86
CA
ssthresh
SS
Transport Layer
3-87
Responses to Congestion
TCP assumes there is congestion if it detects a packet
loss
A TCP sender can detect lost packets via loss events:
Timeout of a retransmission timer
Receipt of 3 duplicate ACKs (fast retransmit)
TCP interprets a Timeout as a binary congestion signal.
When a timeout occurs, the sender performs:
ssthresh = CongWin / 2
CongWin = 1
3-88
Philosophy:
Transport Layer
3-89
Slow Start
(exponential
increase phase) is
continued until
CongWin reaches
half of the level
where the loss
event occurred
last time.
CongWin is
increased slowly
after (linear
increase in
Congestion
Avoidance phase).
3-90
ssthresh
ssthresh
TCP Tahoe
Transmission round
Transport Layer
3-91
When CongWin is below Threshold, sender in slowstart phase, window grows exponentially.
When CongWin is above Threshold, sender is in
congestion-avoidance phase, window grows linearly.
When a triple duplicate ACK occurs, Threshold set
to CongWin/2 and CongWin set to Threshold.
When timeout occurs, Threshold set to CongWin/2
and CongWin is set to 1 MSS.
The actual sender window size is determined based
on the congestion and flow control algorithms
SenderWin=min(RcvWin,CongWin)
Transport Layer
3-92
State
Commentary
ACK receipt
for previously
unacked
data
Slow Start
(SS)
Resulting in a doubling of
CongWin every RTT
ACK receipt
for previously
unacked
data
Congestion
Avoidance
(CA)
CongWin = CongWin+MSS *
(MSS/CongWin)
Loss event
detected by
triple
duplicate
ACK
SS or CA
Threshold = CongWin/2,
CongWin = Threshold,
Set state to Congestion
Avoidance
Fast recovery,
implementing multiplicative
decrease. CongWin will not
drop below 1 MSS.
Timeout
SS or CA
Threshold = CongWin/2,
CongWin = 1 MSS,
Set state to Slow Start
Duplicate
ACK
SS or CA
Transport Layer
3-93
TCP throughput
Q: whats average throughout of TCP as
slow start
Transport Layer
3-94
1.22 MSS
RTT L
L = 210-10 Wow
new versions of TCP for high-speed
Transport Layer
3-95
TCP Fairness
Fairness goal: if K TCP sessions share same
bottleneck link of bandwidth R, each should have
average rate of R/K
TCP connection 1
TCP
connection 2
bottleneck
router
capacity R
Transport Layer
3-96
Connection 2 throughput
Connection 1 throughput R
Transport Layer
3-97
Fairness (more)
Fairness and UDP
Multimedia apps often
do not use TCP
Transport Layer
3-98
Transport Layer
3-99
client
close
close
FIN
timed wait
FIN
AC K
server
A CK
closed
Transport Layer 3-100
client
closing
server
FIN
AC K
closing
FIN
timed wait
closed
closed
Transport Layer 3-101
TCP server
lifecycle
TCP client
lifecycle
TCP/IP parameters
tcp_synack_retries
tcp_window_scaling
Maximum window size of 65535 bytes not enough for for really
fast networks. The window scaling options allows for almost
gigabyte windows, which is good for connections with large
delay-bandwidth product.
tcp_max_syn_backlog
tcp_fin_timeout
tcp_rmem
tcp_smem
3-105