Computer Networks
Computer Networks
Release 1.0
Peter L Dordal
CONTENTS
0 Preface
0.1 Classroom Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.2 Progress Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
0.3 Technical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 An Overview of Networks
1.1 Layers . . . . . . . . . . . . .
1.2 Bandwidth and Throughput . .
1.3 Packets . . . . . . . . . . . . .
1.4 Datagram Forwarding . . . . .
1.5 Topology . . . . . . . . . . . .
1.6 Routing Loops . . . . . . . . .
1.7 Congestion . . . . . . . . . . .
1.8 Packets Again . . . . . . . . .
1.9 LANs and Ethernet . . . . . .
1.10 IP - Internet Protocol . . . . .
1.11 DNS . . . . . . . . . . . . .
1.12 Transport . . . . . . . . . . .
1.13 Firewalls . . . . . . . . . . .
1.14 Network Address Translation
1.15 IETF and OSI . . . . . . . . .
1.16 Berkeley Unix . . . . . . . .
1.17 Epilog . . . . . . . . . . . . .
1.18 Exercises . . . . . . . . . . .
2 Ethernet
2.1 10-Mbps classic Ethernet
2.2 100 Mbps (Fast) Ethernet
2.3 Gigabit Ethernet . . . . .
2.4 Ethernet Switches . . . .
2.5 Spanning Tree Algorithm
2.6 Virtual LAN (VLAN) . .
2.7 Epilog . . . . . . . . . .
2.8 Exercises . . . . . . . . .
3 Other LANs
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
7
8
9
11
12
12
13
14
16
21
22
24
25
26
29
29
29
.
.
.
.
.
.
.
.
33
33
43
44
45
47
51
52
52
55
i
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
56
57
68
70
73
75
78
80
80
4 Links
4.1 Encoding and Framing . . .
4.2 Time-Division Multiplexing
4.3 Epilog . . . . . . . . . . .
4.4 Exercises . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
85
85
89
93
93
5 Packets
5.1 Packet Delay . . . . . .
5.2 Packet Delay Variability
5.3 Packet Size . . . . . . .
5.4 Error Detection . . . . .
5.5 Epilog . . . . . . . . .
5.6 Exercises . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
95
95
98
99
101
105
106
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
109
109
113
117
123
123
7 IP version 4
7.1 The IPv4 Header . . . . . . . . . . . . . . . .
7.2 Interfaces . . . . . . . . . . . . . . . . . . . .
7.3 Special Addresses . . . . . . . . . . . . . . .
7.4 Fragmentation . . . . . . . . . . . . . . . . .
7.5 The Classless IP Delivery Algorithm . . . . .
7.6 IP Subnets . . . . . . . . . . . . . . . . . . .
7.7 Address Resolution Protocol: ARP . . . . . .
7.8 Dynamic Host Configuration Protocol (DHCP)
7.9 Internet Control Message Protocol . . . . . .
7.10 Unnumbered Interfaces . . . . . . . . . . . .
7.11 Mobile IP . . . . . . . . . . . . . . . . . . .
7.12 Epilog . . . . . . . . . . . . . . . . . . . . .
7.13 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
127
127
129
130
131
133
135
140
143
145
148
149
150
150
8 IP version 6
ii
.
.
.
.
.
.
.
.
.
.
.
.
153
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
153
154
156
156
158
159
160
164
164
165
167
167
167
9 Routing-Update Algorithms
9.1 Distance-Vector Routing-Update Algorithm
9.2 Distance-Vector Slow-Convergence Problem
9.3 Observations on Minimizing Route Cost . .
9.4 Loop-Free Distance Vector Algorithms . . .
9.5 Link-State Routing-Update Algorithm . . .
9.6 Routing on Other Attributes . . . . . . . . .
9.7 Epilog . . . . . . . . . . . . . . . . . . . .
9.8 Exercises . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
169
169
174
176
178
180
183
184
184
.
.
.
.
.
.
.
.
189
189
191
192
192
196
197
209
210
10 Large-Scale IP Routing
10.1 Classless Internet Domain Routing: CIDR
10.2 Hierarchical Routing . . . . . . . . . . .
10.3 Legacy Routing . . . . . . . . . . . . . .
10.4 Provider-Based Routing . . . . . . . . .
10.5 Geographical Routing . . . . . . . . . .
10.6 Border Gateway Protocol, BGP . . . . .
10.7 Epilog . . . . . . . . . . . . . . . . . . .
10.8 Exercises . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11 UDP Transport
11.1 User Datagram Protocol UDP . . .
11.2 Fundamental Transport Issues . . . .
11.3 Trivial File Transport Protocol, TFTP
11.4 Remote Procedure Call (RPC) . . . .
11.5 Epilog . . . . . . . . . . . . . . . . .
11.6 Exercises . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
215
215
224
225
230
234
234
12 TCP Transport
12.1 The End-to-End Principle . . .
12.2 TCP Header . . . . . . . . . . .
12.3 TCP Connection Establishment
12.4 TCP and WireShark . . . . . .
12.5 TCP simplex-talk . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
237
238
238
239
243
245
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
249
251
252
253
255
256
256
257
257
258
258
259
259
260
260
260
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
263
264
268
272
274
276
279
280
282
283
283
284
284
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
289
289
290
297
304
305
307
309
311
313
315
315
315
315
iv
15.2 RTTs . . . . .
15.3 Highspeed TCP
15.4 TCP Vegas . .
15.5 FAST TCP . .
15.6 TCP Westwood
15.7 TCP Veno . . .
15.8 TCP Hybla . .
15.9 TCP Illinois . .
15.10 H-TCP . . . .
15.11 TCP CUBIC .
15.12 Epilog . . . .
15.13 Exercises . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
324
325
326
328
330
332
333
334
335
336
340
341
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
347
347
349
361
376
387
389
395
395
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
399
399
400
408
414
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
415
415
416
416
417
418
430
431
434
440
444
446
448
450
452
452
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19 Quality of Service
457
19.1 Net Neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 458
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
458
459
461
462
466
468
472
472
473
474
478
480
481
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
483
485
485
486
488
489
490
491
495
500
501
509
511
513
524
532
541
.
.
.
.
.
.
.
.
.
.
545
546
547
555
561
562
564
573
575
578
587
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Bibliography
591
593
vi
Bibliography
595
Index
601
vii
viii
Peter L Dordal
Department of Computer Science
Loyola University Chicago
Contents:
CONTENTS
CONTENTS
0 PREFACE
No man but a blockhead ever wrote, except for money. - Samuel Johnson
The textbook world is changing. On the one hand, open source software and creative-commons licensing
have been great successes; on the other hand, unauthorized PDFs of popular textbooks are widely available,
and it is time to consider flowing with rather than fighting the tide. Hence this open textbook, released for
free under the Creative Commons license described below. Mene, mene, tekel pharsin.
Perhaps the last straw, for me, was patent 8195571 for a roundabout method to force students to purchase
textbooks. (A simpler strategy might be to include the price of the book in the course.) At some point,
faculty have to be advocates for their students rather than, well, Hirudinea.
This is not to say that I have anything against for-profit publishing. It is just that this particular book does not
and will not belong to that category. In this it is in good company: there is Wikipedia, there is Gnu/Linux,
and there is an increasing number of other free online textbooks out there. The market inefficiencies of
traditional publishing are sobering: the return to authors of advanced textbooks is at best modest, and costs
to users are quite high.
This text is released under the Creative Commons license Attribution-NonCommercial-NoDerivs; this text
is like a conventional book, in other words, except that it is free. You may copy the work and distribute it
to others, but reuse requires attribution. Creation of derivative works eg modifying chapters or creating
additional chapters and distributing them as part of this work also requires permission.
The work may not be used for commercial purposes without permission. Permission is likely to be granted
for use and distribution of all or part of the work in for-profit and commercial training programs, provided
there is no direct charge to recipients for the work and provided the free nature of the work is made clear to
recipients (eg by including this preface). However, such permission must always be requested. Alternatively,
participants in commercial programs may be instructed to download the work individually.
The official book website (potentially subject to change) is intronetworks.cs.luc.edu. The book is available
there as online html, as a zipped archive of html files, in .pdf format, and in other formats as may prove
useful.
The book can also be used as a networks supplement or companion to other resources for a variety of
other courses that overlap to some greater or lesser degree with networking. At Loyola, earlier versions of
this material have been used coupled with a second textbook in courses in computer security, network
management, telecommunications, and even introduction-to-computing courses for non-majors. Another
possibility is an alternative or nontraditional presentation of networking itself. It is when used in concert
with other works, in particular, that this books being free is of marked advantage.
Finally, I hope the book may also be useful as a reference work. To this end, I have attempted to ensure that
the indexing and cross-referencing is sufficient to support the drop-in reader. Similarly, obscure notation is
kept to a minimum.
Much is sometimes made, in the world of networking textbooks, about top-down versus bottom-up sequencing. This book is not really either, although the chapters are mostly numbered in bottom-up fashion.
Instead, the first chapter provides a relatively complete overview of the LAN, IP and transport network layers
(along with a few other things), allowing subsequent chapters to refer to all network layers without forward
reference, and, more importantly, allowing the chapters to be covered in a variety of different orders. As a
practical matter, when I use this text to teach Loyolas Introduction to Computer Networks course, I cover
the IP and TCP material more or less in parallel.
A distinctive feature of the book is the extensive coverage of TCP: TCP dynamics, newer versions of TCP
such as TCP Cubic, and a chapter on using the ns-2 simulator to explore actual TCP behavior. This has
its roots in a longstanding goal to find better ways to present competition and congestion in the classroom.
Another feature is the detailed chapter on queuing disciplines.
One thing this book makes little attempt to cover in detail is the application layer; the token example included is SNMP. While SNMP actually makes a pretty good example of a self-contained application, my
recommendation to instructors who wish to cover more familiar examples is to combine this text with the
appropriate application documentation.
For those interested in using the book for a traditional networks course, I with some trepidation offer the
following set of core material. In solidarity with those who prefer alternatives to a bottom-up ordering, I
emphasize that this represents a set and not a sequence.
1 An Overview of Networks
Selected sections from 2 Ethernet, particularly switched Ethernet
Selected sections from 3.3 Wi-Fi
Selected sections from 5 Packets
6 Abstract Sliding Windows
7 IP version 4 and/or 8 IP version 6
Selected sections from 9 Routing-Update Algorithms and 10 Large-Scale IP Routing
11 UDP Transport
12 TCP Transport
13 TCP Reno and Congestion Management
With some care in the topic-selection details, the above can be covered in one semester along with a survey
of selected important network applications, or the basics of network programming, or the introductory con-
0 Preface
figuration of switches and routers, or coverage of additional material from this book, or some other set of
additional topics. Of course, non-traditional networks courses may focus on a quite different sets of topics.
Peter Dordal
Shabbona, Illinois
=
The characters above should look roughly as they do in the following image:
If no available browser displays these properly, I recommend the pdf or epub formats. Generally Firefox
handles the necessary characters out of the box, as does Internet Explorer, but Chrome does not.
The diagrams in the body of the text are now all in bitmap .png format, although a few diagrams rendered
with line-drawing characters still appear in the exercises. I would prefer to use the vector-graphics .svg
format, but as of January 2014 most browsers do not appear to support zooming in on .svg images, which is
really the whole point.
0 Preface
1 AN OVERVIEW OF NETWORKS
Somewhere there might be a field of interest in which the order of presentation of topics is well agreed upon.
Computer networking is not it.
There are many interconnections in the field of networking, as in most technical fields, and it is difficult
to find an order of presentation that does not involve endless forward references to future chapters; this
is true even if as is done here a largely bottom-up ordering is followed. I have therefore taken here a
different approach: this first chapter is a summary of the essentials LANs, IP and TCP across the board,
and later chapters expand on the material here.
Local Area Networks, or LANs, are the physical networks that provide the connection between machines
within, say, a home, school or corporation. LANs are, as the name says, local; it is the IP, or Internet
Protocol, layer that provides an abstraction for connecting multiple LANs into, well, the Internet. Finally,
TCP deals with transport and connections and actually sending user data.
This chapter also contains some important other material. The section on datagram forwarding, central
to packet-based switching and routing, is essential. This chapter also discusses packets generally, congestion, and sliding windows, but those topics are revisited in later chapters. Firewalls and network address
translation are also covered here and not elsewhere.
1.1 Layers
These three topics LANs, IP and TCP are often called layers; they constitute the Link layer, the Internetwork layer, and the Transport layer respectively. Together with the Application layer (the software you use),
these form the four-layer model for networks. A layer, in this context, corresponds strongly to the idea
of a programming interface or library (though some of the layers are not accessible to ordinary users): an
application hands off a chunk of data to the TCP library, which in turn makes calls to the IP library, which
in turn calls the LAN layer for actual delivery.
The LAN layer is in charge of actual delivery of packets, using LAN-layer-supplied addresses. It is often
conceptually subdivided into the physical layer dealing with, eg, the analog electrical, optical or radio
signaling mechanisms involved, and above that an abstracted logical LAN layer that describes all the
digital that is, non-analog operations on packets; see 2.1.2 The LAN Layer. The physical layer is
generally of direct concern only to those designing LAN hardware; the kernel software interface to the LAN
corresponds to the logical LAN layer. This LAN physical/logical division gives us the Internet five-layer
model. This is less a formal hierarchy as an ad hoc classification method. We will return to this below in
1.15 IETF and OSI.
transmission rate, taking into account things like transmission overhead, protocol inefficiencies and perhaps
even competing traffic. It is generally measured at a higher network layer than the data rate.
The term bandwidth can be used to refer to either of these, though we here try to use it mostly as a synonym
for data rate. The term comes from radio transmission, where the width of the frequency band available is
proportional, all else being equal, to the data rate that can be achieved.
In discussions about TCP, the term goodput is sometimes used to refer to what might also be called
application-layer throughput: the amount of usable data delivered to the receiving application. Specifically, retransmitted data is counted only once when calculating goodput but might be counted twice under
some interpretations of throughput.
Data rates are generally measured in kilobits per second (Kbps) or megabits per second (Mbps); in the
context of data rates, a kilobit is 103 bits (not 210 ) and a megabit is 106 bits. The use of the lowercase b
means bits; data rates expressed in terms of bytes often use an upper-case B.
1.3 Packets
Packets are modest-sized buffers of data, transmitted as a unit through some shared set of links. Of necessity,
packets need to be prefixed with a header containing delivery information. In the common case known as
datagram forwarding, the header contains a destination address; headers in networks using so-called
virtual-circuit forwarding contain instead an identifier for the connection. Almost all networking today
(and for the past 50 years) is packet-based, although we will later look briefly at some circuit-switched
options for voice telephony.
At the LAN layer, packets can be viewed as the imposition of a buffer (and addressing) structure on top
of low-level serial lines; additional layers then impose additional structure. Informally, packets are often
referred to as frames at the LAN layer, and as segments at the Transport layer.
The maximum packet size supported by a given LAN (eg Ethernet, Token Ring or ATM) is an intrinsic
attribute of that LAN. Ethernet allows a maximum of 1500 bytes of data. By comparison, TCP/IP packets
originally often held only 512 bytes of data, while early Token Ring packets could contain up to 4KB of
data. While there are proponents of very large packet sizes, larger even than 64KB, at the other extreme the
ATM (Asynchronous Transfer Mode) protocol uses 48 bytes of data per packet, and there are good reasons
for believing in modest packet sizes.
One potential issue is how to forward packets from a large-packet LAN to (or through) a small-packet LAN;
in later chapters we will look at how the IP (or Internet Protocol) layer addresses this.
1 An Overview of Networks
Generally each layer adds its own header. Ethernet headers are typically 14 bytes, IP headers 20 bytes, and
TCP headers 20 bytes. If a TCP connection sends 512 bytes of data per packet, then the headers amount to
10% of the total, a not-unreasonable overhead. For one common Voice-over-IP option, packets contain 160
bytes of data and 54 bytes of headers, making the header about 25% of the total. Compressing the 160 bytes
of audio, however, may bring the data portion down to 20 bytes, meaning that the headers are now 73% of
the total; see 19.11.4 RTP and VoIP.
In datagram-forwarding networks the appropriate header will contain the address of the destination and
perhaps other delivery information. Internal nodes of the network called routers or switches will then make
sure that the packet is delivered to the requested destination.
The concept of packets and packet switching was first introduced by Paul Baran in 1962 ([PB62]). Barans
primary concern was with network survivability in the event of node failure; existing centrally switched
protocols were vulnerable to central failure. In 1964, Donald Davies independently developed many of the
same concepts; it was Davies who coined the term packet.
It is perhaps worth noting that packets are buffers built of 8-bit bytes, and all hardware today agrees what
a byte is (hardware agrees by convention on the order in which the bits of a byte are to be transmitted).
8-bit bytes are universal now, but it was not always so. Perhaps the last great non-byte-oriented hardware
platform, which did indeed overlap with the Internet era broadly construed, was the DEC-10, which had a
36-bit word size; a word could hold five 7-bit ASCII characters. The early Internet specifications introduced
the term octet (an 8-bit byte) and required that packets be sequences of octets; non-octet-oriented hosts had
to be able to convert. Thus was chaos averted. Note that there are still byte-oriented data issues; as one
example, binary integers can be represented as a sequence of bytes in either big-endian or little-endian byte
order. RFC 1700 specifies that Internet protocols use big-endian byte order, therefore sometimes called
network byte order.
next_hop
0
1
2
2
2
The table for S2 might be as follows, where we have consolidated destinations A and C for visual simplicity.
S2
destination
A,C
D
E
B
next_hop
0
1
2
3
Alternatively, we could replace the interface information with next-node, or neighbor, information, as all
the links above are point-to-point and so each interface connects to a unique neighbor. In that case, S1s
table might be written as follows (with consolidation of the entries for B, D and E):
S1
destination
A
C
B,D,E
next_hop
A
C
S2
A central feature of datagram forwarding is that each packet is forwarded in isolation; the switches involved do not have any awareness of any higher-layer logical connections established between endpoints.
This is also called stateless forwarding, in that the forwarding tables have no per-connection state. RFC
1122 put it this way (in the context of IP-layer datagram forwarding):
To improve robustness of the communication system, gateways are designed to be stateless,
forwarding each IP datagram independently of other datagrams. As a result, redundant paths can
be exploited to provide robust service in spite of failures of intervening gateways and networks.
Datagram forwarding is sometimes allowed to use other information beyond the destination address. In
theory, IP routing can be done based on the destination address and some quality-of-service information,
allowing, for example, different routing to the same destination for high-bandwidth bulk traffic and for lowlatency real-time traffic. In practice, many ISPs ignore quality-of-service information in the IP header, and
route only based on the destination.
By convention, switching devices acting at the LAN layer and forwarding packets based on the LAN address
are called switches (or, in earlier days, bridges), while such devices acting at the IP layer and forwarding
on the IP address are called routers. Datagram forwarding is used both by Ethernet switches and by IP
routers, though the destinations in Ethernet forwarding tables are individual nodes while the destinations in
IP routers are entire networks (that is, sets of nodes).
10
1 An Overview of Networks
In IP routers within end-user sites it is common for a forwarding table to include a catchall default entry,
matching any IP address that is nonlocal and so needs to be routed out into the Internet at large. Unlike the
consolidated entries for B, D and E in the table above for S1, which likely would have to be implemented as
actual separate entries, a default entry is a single record representing where to forward the packet if no other
destination match is found. Here is a forwarding table for S1, above, with a default entry replacing the last
three entries:
S1
destination
A
C
default
next_hop
0
1
2
Default entries make sense only when we can tell by looking at an address that it does not represent a
nearby node. This is common in IP networks because an IP address encodes the destination network, and
routers generally know all the local networks. It is however rare in Ethernets, because there is generally
no correlation between Ethernet addresses and locality. If S1 above were an Ethernet switch, and it had
some means of knowing that interfaces 0 and 1 connected directly to individual hosts, not switches and S1
knew the addresses of these hosts then making interface 2 a default route would make sense. In practice,
however, Ethernet switches do not know what kind of device connects to a given interface.
1.5 Topology
In the network diagrammed in the previous section, there are no loops; graph theorists might describe this
by saying the network graph is acyclic, or is a tree. In a loop-free network there is a unique path between
any pair of nodes. The forwarding-table algorithm has only to make sure that every destination appears in
the forwarding tables; the issue of choosing between alternative paths does not arise.
However, if there are no loops then there is no redundancy: any broken link will result in partitioning the
network into two pieces that cannot communicate. All else being equal (which it is not, but never mind for
now), redundancy is a good thing. However, once we start including redundancy, we have to make decisions
among the multiple paths to a destination. Consider, for a moment, the following network:
1.5 Topology
11
D might feel that the best path to B is DECB (perhaps because it believes the AD link is to be avoided).
If E similarly decides the best path to B is EDAB, and if D and E both choose their next_hop for B
based on these best paths, then a linear routing loop is formed: D routes to B via E and E routes to B via D.
Although each of D and E have identified a usable path, that path is not in fact followed. Moral: successful
datagram routing requires cooperation and a consistent view of the network.
1.7 Congestion
Switches introduce the possibility of congestion: packets arriving faster than they can be sent out. This can
happen with just two interfaces, if the inbound interface has a higher bandwidth than the outbound interface;
another common source of congestion is traffic arriving on multiple inputs and all destined for the same
output.
12
1 An Overview of Networks
Whatever the reason, if packets are arriving for a given outbound interface faster than they can be sent, a
queue will form for that interface. Once that queue is full, packets will be dropped. The most common
strategy (though not the only one) is to drop any packets that arrive when the queue is full.
The term congestion may refer either to the point where the queue is just beginning to build up, or to the
point where the queue is full and packets are lost. In their paper [CJ89], Chiu and Jain refer to the first point
as the knee; this is where the slope of the load v throughput graph flattens. They refer to the second point as
the cliff; this is where packet losses may lead to a precipitous decline in throughput. Other authors use the
term contention for knee-congestion.
In the Internet, most packet losses are due to congestion. This is not because congestion is especially bad
(though it can be, at times), but rather that other types of losses (eg due to packet corruption) are insignificant
by comparison.
When to Upgrade?
Deciding when a network really does have insufficient bandwidth is not a technical issue but an economic one. The number of customers may increase, the cost of bandwidth may decrease or customers
may simply be willing to pay more to have data transfers complete in less time; customers here can
be external or in-house. Monitoring of links and routers for congestion can, however, help determine
exactly what parts of the network would most benefit from upgrade.
We emphasize that the presence of congestion does not mean that a network has a shortage of bandwidth.
Bulk-traffic senders (though not real-time senders) attempt to send as fast as possible, and congestion is
simply the networks feedback that the maximum transmission rate has been reached.
Congestion is a sign of a problem in real-time networks, which we will consider in 19 Quality of Service.
In these networks losses due to congestion must generally be kept to an absolute minimum; one way to
achieve this is to limit the acceptance of new connections unless sufficient resources are available.
13
forwarding delay is hard to avoid (though some switches do implement cut-through switching to begin
forwarding a packet before it has fully arrived), but if one is sending a long train of packets then by keeping
multiple packets en route at the same time one can essentially eliminate the significance of the forwarding
delay; see 5.3 Packet Size.
Total packet delay from sender to receiver is the sum of the following:
Bandwidth delay, ie sending 1000 Bytes at 20 Bytes/millisecond will take 50 ms. This is a per-link
delay.
Propagation delay due to the speed of light. For example, if you start sending a packet right now
on a 5000-km cable across the US with a propagation speed of 200 m/sec (= 200 km/ms, about 2/3
the speed of light in vacuum), the first bit will not arrive at the destination until 25 ms later. The
bandwidth delay then determines how much after that the entire packet will take to arrive.
Store-and-forward delay, equal to the sum of the bandwidth delays out of each router along the path
Queuing delay, or waiting in line at busy routers. At bad moments this can exceed 1 sec, though that
is rare. Generally it is less than 10 ms and often is less than 1 ms. Queuing delay is the only delay
component amenable to reduction through careful engineering.
See 5.1 Packet Delay for more details.
14
1 An Overview of Networks
Time and Collisions. While Ethernet collisions definitely reduce throughput, in the larger view they should
perhaps be thought of as a part of a remarkably inexpensive shared-access mediation protocol.
In unswitched Ethernets every packet is received by every host and it is up to the network card in each host
to determine if the arriving packet is addressed to that host. It is almost always possible to configure the
card to forward all arriving packets to the attached host; this poses a security threat and password sniffers
that surreptitiously collected passwords via such eavesdropping used to be common.
Password Sniffing
In the fall of 1994 at Loyola University I remotely changed the root password on several CS-department
unix machines at the other end of campus, using telnet. I told no one. Within two hours, someone else
logged into one of these machines, using the new password, from a host in Europe. Password sniffing
was the likely culprit.
Two months later was the so-called Christmas Day Attack (12.9.1 ISNs and spoofing). One of the
hosts used to launch this attack was Loyolas hacked apollo.it.luc.edu. It is unclear the degree to which
password sniffing played a role in that exploit.
Due to both privacy and efficiency concerns, almost all Ethernets today are fully switched; this ensures that
each packet is delivered only to the host to which it is addressed. One advantage of switching is that it
effectively eliminates most Ethernet collisions; while in principle it replaces them with a queuing issue, in
practice Ethernet switch queues so seldom fill up that they are almost invisible even to network managers
(unlike IP router queues). Switching also prevents host-based eavesdropping, though arguably a better
solution to this problem is encryption. Perhaps the more significant tradeoff with switches, historically, was
that Once Upon A Time they were expensive and unreliable; tapping directly into a common cable was dirt
cheap.
Ethernet addresses are six bytes long. Each Ethernet card (or network interface) is assigned a (supposedly)
unique address at the time of manufacture; this address is burned into the cards ROM and is called the cards
physical address or hardware address or MAC (Media Access Control) address. The first three bytes of
the physical address have been assigned to the manufacturer; the subsequent three bytes are a serial number
assigned by that manufacturer.
By comparison, IP addresses are assigned administratively by the local site. The basic advantage of having
addresses in hardware is that hosts automatically know their own addresses on startup; no manual configuration or server query is necessary. It is not unusual for a site to have a large number of identically configured
workstations, for which all network differences derive ultimately from each workstations unique Ethernet
address.
The network interface continually monitors all arriving packets; if it sees any packet containing a destination
address that matches its own physical address, it grabs the packet and forwards it to the attached CPU (via a
CPU interrupt).
Ethernet also has a designated broadcast address. A host sending to the broadcast address has its packet
received by every other host on the network; if a switch receives a broadcast packet on one port, it forwards
the packet out every other port. This broadcast mechanism allows host A to contact host B when A does
not yet know Bs physical address; typical broadcast queries have forms such as Will the designated server
please answer or (from the ARP protocol) will the host with the given IP address please tell me your
physical address.
15
Traffic addressed to a particular host that is, not broadcast is said to be unicast.
Because Ethernet addresses are assigned by the hardware, knowing an address does not provide any direct
indication of where that address is located on the network. In switched Ethernet, the switches must thus have
a forwarding-table record for each individual Ethernet address on the network; for extremely large networks
this ultimately becomes unwieldy. Consider the analogous situation with postal addresses: Ethernet is
somewhat like attempting to deliver mail using social-security numbers as addresses, where each postal
worker is provided with a large catalog listing each persons SSN together with their physical location. Real
postal mail is, of course, addressed hierarchically using ever-more-precise specifiers: state, city, zipcode,
street address, and name / room#. Ethernet, in other words, does not scale well to large sizes.
Switched Ethernet works quite well, however, for networks with up to 10,000-100,000 nodes. Forwarding
tables with size in that range are straightforward to manage.
To forward packets correctly, switches must know where all active destination addresses in the LAN are
located; Ethernet switches do this by a passive learning algorithm. (IP routers, by comparison, use active
protocols.) Typically a host physical address is entered into a switchs forwarding table when a packet from
that host is first received; the switch notes the packets arrival interface and source address and assumes
that the same interface is to be used to deliver packets back to that sender. If a given destination address
has not yet been seen, and thus is not in the forwarding table, Ethernet switches still have the backup
delivery option of forwarding to everyone, by treating the destination address like the broadcast address,
and allowing the host Ethernet cards to sort it out. Since this broadcast-like process is not generally used
for more than one packet (after that, the switches will have learned the correct forwarding-table entries), the
risk of eavesdropping is minimal.
The host,interface forwarding table is often easier to think of as host,next_hop, where the next_hop node
is whatever switch or host is at the immediate other end of the link connecting to the given interface. In a
fully switched network where each link connects only two interfaces, the two perspectives are equivalent.
1 An Overview of Networks
first byte
0-127
128-191
192-223
network bits
8
16
24
host bits
24
16
8
name
class A
class B
class C
application
a few very large networks
institution-sized networks
sized for smaller entities
For example, the original IP address allocation for Loyola University Chicago was 147.126.0.0, a class B.
In binary, 147 is 10010011. The network/host division point is not carried within the IP header; in fact,
nowadays the division into network and host is dynamic, and can be made at different positions in the
address at different levels of the network.
IP addresses, unlike Ethernet addresses, are administratively assigned. Once upon a time, you would get
your Class B network prefix from the Internet Assigned Numbers Authority, or IANA (they now delegate
this task), and then you would in turn assign the host portion in a way that was appropriate for your local
site. As a result of this administrative assignment, an IP address usually serves not just as an endpoint
identifier but also as a locator, containing embedded location information.
The Class A/B/C definition above was spelled out in 1981 in RFC 791, which introduced IP. Class D was
added in 1986 by RFC 988; class D addresses must begin with the bits 1110. These addresses are for
multicast, that is, sending an IP packet to every member of a set of recipients (ideally without actually
transmitting it more than once on any one link).
The network portion of an IP address is sometimes called the network number or network address or
network prefix; as we shall see below, most forwarding decisions are made using only the network portion.
It is commonly denoted by setting the host bits to zero and ending the resultant address with a slash followed
by the number of network bits in the address: eg 12.0.0.0/8 or 147.126.0.0/16. Note that 12.0.0.0/8 and
12.0.0.0/9 represent different things; in the latter, the second byte of any host address extending the network
address is constrained to begin with a 0-bit. An anonymous block of IP addresses might be referred to only
by the slash and following digit, eg we need a /22 block to accommodate all our customers.
All hosts with the same network address (same network bits) must be located together on the same LAN; as
we shall see below, if two hosts share the same network address then they will assume they can reach each
other directly via the underlying LAN, and if they cannot then connectivity fails. A consequence of this rule
is that outside of the site only the network bits need to be looked at to route a packet to the site.
Each individual LAN technology has a maximum packet size it supports; for example, Ethernet has a maximum packet size of about 1500 bytes but the once-competing Token Ring had a maximum of 4 KB. Today
the world has largely standardized on Ethernet and almost entirely standardized on Ethernet packet-size limits, but this was not the case when IP was introduced and there was real concern that two hosts on separate
large-packet networks might try to exchange packets too large for some small-packet intermediate network
to carry.
Therefore, in addition to routing and addressing, the decision was made that IP must also support fragmentation: the division of large packets into multiple smaller ones (in other contexts this may also be called
segmentation). The IP approach is not very efficient, and IP hosts go to considerable lengths to avoid fragmentation. IP does require that packets of up to 576 bytes be supported, and so a common legacy strategy
was for a host to limit a packet to at most 512 user-data bytes whenever the packet was to be sent via a
router; packets addressed to another host on the same LAN could of course use a larger packet size. Despite
its limited use, however, fragmentation is essential conceptually, in order for IP to be able to support large
packets without knowing anything about the intervening networks.
IP is a best effort system; there are no IP-layer acknowledgments or retransmissions. We ship the packet
off, and hope it gets there. Most of the time, it does.
1.10 IP - Internet Protocol
17
Architecturally, this best-effort model represents what is known as connectionless networking: the IP layer
does not maintain information about endpoint-to-endpoint connections, and simply forwards packets like a
giant LAN. Responsibility for creating and maintaining connections is left for the next layer up, the TCP
layer. Connectionless networking is not the only way to do things: the alternative could have been some
form connection-oriented internetworking, in which routers do maintain state information about individual
connections. Later, in 3.7 Virtual Circuits, we will examine how virtual-circuit networking can be used to
implement a connection-oriented approach; virtual-circuit switching is the primary alternative to datagram
switching.
Connectionless (IP-style) and connection-oriented networking each have advantages. Connectionless networking is conceptually more reliable: if routers do not hold connection state, then they cannot lose connection state. The path taken by the packets in some higher-level connection can easily be dynamically rerouted.
Finally, connectionless networking makes it hard for providers to bill by the connection; once upon a time
(in the era of dollar-a-minute phone calls) this was a source of mild astonishment to many new users. The
primary advantage of connection-oriented networking, however, is that the routers are then much better positioned to accept reservations and to make quality-of-service guarantees. This remains something of a
sore point in the current Internet: if you want to use Voice-over-IP, or VoIP, telephones, or if you want to
engage in video conferencing, your packets will be treated by the Internet core just the same as if they were
low-priority file transfers. There is no priority service option.
Perhaps the most common form of IP packet loss is router queue overflows, representing network congestion.
Packet losses due to packet corruption are rare (eg less than one in 104 ; perhaps much less). But in a
connectionless world a large number of hosts can simultaneously decide to send traffic through one router,
in which case queue overflows are hard to avoid.
1.10.1 IP Forwarding
IP routers use datagram forwarding, described in 1.4 Datagram Forwarding above, to deliver packets, but
the destination values listed in the forwarding tables are network prefixes representing entire LANs
instead of individual hosts. The goal of IP forwarding, then, becomes delivery to the correct LAN; a separate
process is used to deliver to the final host once the final LAN has been reached.
The entire point, in fact, of having a network/host division within IP addresses is so that routers need to list
only the network prefixes of the destination addresses in their IP forwarding tables. This strategy is the key
to IP scalability: it saves large amounts of forwarding-table space, it saves time as smaller tables allow faster
lookup, and it saves the bandwidth that would be needed for routers to keep track of individual addresses.
To get an idea of the forwarding-table space savings, there are currently (2013) around a billion hosts on the
Internet, but only 300,000 or so networks listed in top-level forwarding tables. When network prefixes are
used as forwarding-table destinations, matching an actual packet address to a forwarding-table entry is no
longer a matter of simple equality comparison; routers must compare appropriate prefixes.
IP forwarding tables are sometimes also referred to as routing tables; in this book, however, we make
at least a token effort to use forwarding to refer to the packet forwarding process, and routing to refer
to mechanisms by which the forwarding tables are maintained and updated. (If we were to be completely
consistent here, we would use the term forwarding loop rather than routing loop.)
Now let us look at a simple example of how IP forwarding (or routing) works. We will assume that all
network nodes are either hosts user machines, with a single network connection or routers, which do
packet-forwarding only. Routers are not directly visible to users, and always have at least two different
18
1 An Overview of Networks
network interfaces representing different networks that the router is connecting. (Machines can be both
hosts and routers, but this introduces complications.)
Suppose A is the sending host, sending a packet to a destination host D. The IP header of the packet will
contain Ds IP address in the destination address field (it will also contain As own address as the source
address). The first step is for A to determine whether D is on the same LAN as itself or not; that is, whether
D is local. This is done by looking at the network part of the destination address, which we will denote by
Dnet . If this net address is the same as As (that is, if it is equal numerically to Anet ), then A figures D is on
the same LAN as itself, and can use direct LAN delivery. It looks up the appropriate physical address for D
(probably with the ARP protocol, 7.7 Address Resolution Protocol: ARP), attaches a LAN header to the
packet in front of the IP header, and sends the packet straight to D via the LAN.
If, however, Anet and Dnet do not match D is non-local then A looks up a router to use. Most ordinary
hosts use only one router for all non-local packet deliveries, making this choice very simple. A then forwards
the packet to the router, again using direct delivery over the LAN. The IP destination address in the packet
remains D in this case, although the LAN destination address will be that of the router.
When the router receives the packet, it strips off the LAN header but leaves the IP header with the IP
destination address. It extracts the destination D, and then looks at Dnet . The router first checks to see
if any of its network interfaces are on the same LAN as D; recall that the router connects to at least one
additional network besides the one for A. If the answer is yes, then the router uses direct LAN delivery to the
destination, as above. If, on the other hand, Dnet is not a LAN to which the router is connected directly, then
the router consults its internal forwarding table. This consists of a list of networks each with an associated
next_hop address. These net,next_hop tables compare with switched-Ethernets host,next_hop tables;
the former type will be smaller because there are many fewer nets than hosts. The next_hop addresses in the
table are chosen so that the router can always reach them via direct LAN delivery via one of its interfaces;
generally they are other routers. The router looks up Dnet in the table, finds the next_hop address, and uses
direct LAN delivery to get the packet to that next_hop machine. The packets IP header remains essentially
unchanged, although the router most likely attaches an entirely new LAN header.
The packet continues being forwarded like this, from router to router, until it finally arrives at a router that
is connected to Dnet ; it is then delivered by that final router directly to D, using the LAN.
To make this concrete, consider the following diagram:
With Ethernet-style forwarding, R2 would have to maintain entries for each of A,B,C,D,E,F. With IP forwarding, R2 has just two entries to maintain in its forwarding table: 200.0.0/24 and 200.0.1/24. If A sends
to D, at 200.0.1.37, it puts this address into the IP header, notes that 200.0.0 = 200.0.1, and thus concludes
D is not a local delivery. A therefore sends the packet to its router R1, using LAN delivery. R1 looks up the
destination network 200.0.1 in its forwarding table and forwards the packet to R2, which in turn forwards it
to R3. R3 now sees that it is connected directly to the destination network 200.0.1, and delivers the packet
via the LAN to D, by looking up Ds physical address.
In this diagram, IP addresses for the ends of the R1R2 and R2R3 links are not shown. They could be
assigned global IP addresses, but they could also use private IP addresses. Assuming these links are
19
point-to-point links, they might not actually need IP addresses at all; we return to this in 7.10 Unnumbered
Interfaces.
One can think of the network-prefix bits as analogous to the zip code on postal mail, and the host bits
as analogous to the street address. The internal parts of the post office get a letter to the right zip code,
and then an individual letter carrier gets it to the right address. Alternatively, one can think of the network
bits as like the area code of a phone number, and the host bits as like the rest of the digits. Newer protocols that support different net/host division points at different places in the network sometimes called
hierarchical routing allow support for addressing schemes that correspond to, say, zip/street/user, or
areacode/exchange/subscriber.
The Invertebrate Internet
Once upon a time, each leaf node connected through its provider to the backbone, and traffic between
any two nodes (or at least any two nodes not sharing a provider) passed through the backbone. The
backbone still carries a lot of traffic, but it is now also common for large providers such as Google
to connect (or peer) directly with large residential ISPs such as Comcast. See, for example, peering.google.com.
We will refer to the Internet backbone as those IP routers that specialize in large-scale routing on the
commercial Internet, and which generally have forwarding-table entries covering all public IP addresses;
note that this is essentially a business definition rather than a technical one. We can revise the table-size
claim of the previous paragraph to state that, while there are many private IP networks, there are about
300,000 visible to the backbone. A forwarding table of 300,000 entries is quite feasible; a table a hundred
times larger is not, let alone a thousand times larger.
IP routers at non-backbone sites generally know all locally assigned network prefixes, eg 200.0.0/24 and
200.0.1/24 above. If a destination does not match any locally assigned network prefix, the packet needs
to be routed out into the Internet at large; for typical non-backbone sites this almost always this means
the packet is sent to the ISP that provides Internet connectivity. Generally the local routers will contain a
catchall default entry covering all nonlocal networks; this means that the router needs an explicit entry only
for locally assigned networks. This greatly reduces the forwarding-table size. The Internet backbone can be
approximately described, in fact, as those routers that do not have a default entry.
For most purposes, the Internet can be seen as a combination of end-user LANs together with point-to-point
links joining these LANs to the backbone, point-to-point links also tie the backbone together. Both LANs
and point-to-point links appear in the diagram above.
Just how routers build their destnet,next_hop forwarding tables is a major topic itself, which we cover in
9 Routing-Update Algorithms. Unlike Ethernet, IP routers do not have a broadcast delivery mechanism
as a fallback, so the tables must be constructed in advance. (There is a limited form of IP broadcast, but it
is basically intended for reaching the local LAN only, and does not help at all with delivery in the event that
the network is unknown.)
Most forwarding-table-construction algorithms used on a set of routers under common management fall into
either the distance-vector or the link-state category. In the distance-vector approach, often used at smaller
sites, routers exchange information with their immediately neighboring routers; tables are built up this
way through a sequence of such periodic exchanges. In the link-state approach, routers rapidly propagate
information about the state of each link; all routers in the organization receive this link-state information and
each one uses it to build and maintain a map of the entire network. The forwarding table is then constructed
20
1 An Overview of Networks
1.11 DNS
IP addresses are hard to remember (nearly impossible in IPv6). The domain name system, or DNS, comes
to the rescue by creating a way to convert hierarchical text names to IP addresses. Thus, for example, one can
type www.luc.edu instead of 147.126.1.230. Virtually all Internet software uses the same basic library
calls to convert DNS names to actual addresses.
1.11 DNS
21
One thing DNS makes possible is changing a websites IP address while leaving the name alone. This
allows moving a site to a new provider, for example, without requiring users to learn anything new. It
is also possible to have several different DNS names resolve to the same IP address, and through some
modest trickery have the http (web) server at that IP address handle the different DNS names as completely
different websites.
DNS is hierarchical and distributed; indeed, it is the classic example of a widely distributed database.
In looking up www.cs.luc.edu three different DNS servers may be queried: for cs.luc.edu, for
luc.edu, and for .edu. Searching a hierarchy can be cumbersome, so DNS search results are normally
cached locally. If a name is not found in the cache, the lookup may take a couple seconds. The DNS
hierarchy need have nothing to do with the IP-address hierarchy.
Besides address lookups, DNS also supports a few other kinds of searches. The best known is probably
reverse DNS, which takes an IP address and returns a name. This is slightly complicated by the fact that
one IP address may be associated with multiple DNS names, so DNS must either return a list, or return one
name that has been designated the canonical name.
1.12 Transport
Think about what types of communications one might want over the Internet:
Interactive communications such as via ssh or telnet, with long idle times between short bursts
Bulk file transfers
Request/reply operations, eg to query a database or to make DNS requests
Real-time voice traffic, at (without compression) 8KB/sec, with constraints on the variation in delivery time (known as jitter; see 19.11.3 RTP Control Protocol for a specific numeric interpretation)
Real-time video traffic. Even with substantial compression, video generally requires much more
bandwidth than voice
While separate protocols might be used for each of these, the Internet has standardized on the Transmission
Control Protocol, or TCP, for the first three (though there are periodic calls for a new protocol addressing
the third item above), and TCP is sometimes pressed into service for the last two. TCP is thus the most
common transport layer for application data.
The IP layer is not well-suited to transport. IP routing is a best-effort mechanism, which means packets
can and do get lost sometimes. Data that does arrive can arrive out of order. The sender has to manage
division into packets; that is, buffering. Finally, IP only supports sending to a specific host; normally, one
wants to send to a given application running on that host. Email and web traffic, or two different web
sessions, should not be commingled!
TCP extends IP with the following features:
reliability: TCP numbers each packet, and keeps track of which are lost and retransmits them after a
timeout, and holds early-arriving out-of-order packets for delivery at the correct time. Every arriving
data packet is acknowledged by the receiver; timeout and retransmission occurs when an acknowledgment isnt received by the sender within a given time.
22
1 An Overview of Networks
connection-orientation: Once a TCP connection is made, an application sends data simply by writing
to that connection. No further application-level addressing is needed.
stream-orientation: The application can write 1 byte at a time, or 100KB at a time; TCP will buffer
and/or divide up the data into appropriate sized packets.
port numbers: these provide a way to specify the receiving application for the data, and also to
identify the sending application.
throughput management: TCP attempts to maximize throughput, while at the same time not contributing unnecessarily to network congestion.
TCP endpoints are of the form host,port; these pairs are known as socket addresses, or sometimes as just
sockets though the latter refers more properly to the operating-system objects that receive the data sent to
the socket addresses. Servers (or, more precisely, server applications) listen for connections to sockets they
have opened; the client is then any endpoint that initiates a connection to a server.
When you enter a host name in a web browser, it opens a TCP connection to the servers port 80 (the standard
web-traffic port), that is, to the server socket with socket-address server,80. If you have several browser
tabs open, each might connect to the same server socket, but the connections are distinguishable by virtue
of using separate ports (and thus having separate socket addresses) on the client end (that is, your end).
A busy server may have thousands of connections to its port 80 (the web port) and hundreds of connections
to port 25 (the email port). Web and email traffic are kept separate by virtue of the different ports used. All
those clients to the same port, though, are kept separate because each comes from a unique host,port pair. A
TCP connection is determined by the host,port socket address at each end; traffic on different connections
does not intermingle. That is, there may be multiple independent connections to www.luc.edu,80. This is
somewhat analogous to certain business telephone numbers of the operators are standing by type, which
support multiple callers at the same time to the same number. Each call is answered by a different operator
(corresponding to a different cpu process), and different calls do not overhear each other.
TCP uses the sliding-windows algorithm, 6 Abstract Sliding Windows, to keep multiple packets en route
at any one time. The window size represents the number of packets simultaneously en route; if the window
size is 10, for example, then at any one time 10 packets are out there (perhaps 5 data packets and 5 returning
acknowledgments). As each acknowledgment arrives, the window slides forward and the data packet 10
packets ahead is sent. For example, consider the moment when the ten packets 20-29 are in transit. When
ACK[20] is received, Data[30] is sent, and so now packets 21-30 are in transit. When ACK[21] is received,
Data[31] is sent, so packets 22-31 are in transit.
Sliding windows minimizes the effect of store-and-forward delays, and propagation delays, as these then
only count once for the entire windowful and not once per packet. Sliding windows also provides an automatic, if partial, brake on congestion: the queue at any switch or router along the way cannot exceed the
window size. In this it compares favorably with constant-rate transmission, which, if the available bandwidth falls below the transmission rate, always leads to a significant percentage of dropped packets. Of
course, if the window size is too large, a sliding-windows sender may also experience dropped packets.
The ideal window size, at least from a throughput perspective, is such that it takes one round-trip time to send
an entire window, so that the next ACK will always be arriving just as the sender has finished transmitting the
window. Determining this ideal size, however, is difficult; for one thing, the ideal size varies with network
load. As a result, TCP approximates the ideal size. The most common TCP strategy that of so-called TCP
Reno is that the window size is slowly raised until packet loss occurs, which TCP takes as a sign that it
has reached the limit of available network resources. At that point the window size is reduced to half its
1.12 Transport
23
previous value, and the slow climb resumes. The effect is a sawtooth graph of window size with time,
which oscillates (more or less) around the optimal window size. For an idealized sawtooth graph, see
13.1.1 The Steady State; for some real (simulation-created) sawtooth graphs see 16.4.1 Some TCP Reno
cwnd graphs.
While this window-size-optimization strategy has its roots in attempting to maximize the available bandwidth, it also has the effect of greatly limiting the number of packet-loss events. As a result, TCP has come
to be the Internet protocol charged with reducing (or at least managing) congestion on the Internet, and
relatedly with ensuring fairness of bandwidth allocations to competing connections. Core Internet routers
at least in the classical case essentially have no role in enforcing congestion or fairness restrictions at all.
The Internet, in other words, places responsibility for congestion avoidance cooperatively into the hands of
end users. While cheating is possible, this cooperative approach has worked remarkably well.
While TCP is ubiquitous, the real-time performance of TCP is not always consistent: if a packet is lost,
the receiving TCP host will not turn over anything further to the receiving application until the lost packet
has been retransmitted successfully; this is often called head-of-line blocking. This is a serious problem
for sound and video applications, which can discretely handle modest losses but which have much more
difficulty with sudden large delays. A few lost packets ideally should mean just a few brief voice dropouts
(pretty common on cell phones) or flicker/snow on the video screen (or just reuse of the previous frame);
both of these are better than pausing completely.
The basic alternative to TCP is known as UDP, for User Datagram Protocol. UDP, like TCP, provides port
numbers to support delivery to multiple endpoints within the receiving host, in effect to a specific process on
the host. As with TCP, a UDP socket consists of a host,port pair. UDP also includes, like TCP, a checksum
over the data. However, UDP omits the other TCP features: there is no connection setup, no lost-packet
detection, no automatic timeout/retransmission, and the application must manage its own packetization.
The Real-time Transport Protocol, or RTP, sits above UDP and adds some additional support for voice and
video applications.
1.13 Firewalls
One problem with having a program on your machine listening on an open TCP port is that someone may
connect and then, using some flaw in the software on your end, do something malicious to your machine.
Damage can range from the unintended downloading of personal data to compromise and takeover of your
entire machine, making it a distributor of viruses and worms or a steppingstone in later break-ins of other
machines.
A strategy known as buffer overflow has been the basis for a great many total-compromise attacks. The idea
is to identify a point in a server program where it fills a memory buffer with network-supplied data without
careful length checking; almost any call to the C library function gets(buf) will suffice. The attacker
then crafts an oversized input string which, when read by the server and stored in memory, overflows the
buffer and overwrites subsequent portions of memory, typically containing the stack-frame pointers. The
usual goal is to arrange things so that when the server reaches the end of the currently executing function,
control is returned not to the calling function but instead to the attackers own payload code located within
the string.
A firewall is a program to block connections deemed potentially risky, eg those originating from outside
the site. Generally ordinary workstations do not ever need to accept connections from the Internet; client
24
1 An Overview of Networks
machines instead initiate connections to (better-protected) servers. So blocking incoming connections works
pretty well; when necessary (eg for games) certain ports can be selectively unblocked.
The original firewalls were routers. Incoming traffic to servers was often blocked unless it was sent to one
of a modest number of open ports; for non-servers, typically all inbound connections were blocked. This
allowed internal machines to operate reasonably safely, though being unable to accept incoming connections
is sometimes inconvenient. Nowadays per-machine firewalls in addition to router-based firewalls are
common: you can configure your machine not to accept inbound connections to most (or all) ports regardless
of whether software on your machine requests such a connection. Outbound connections can, in many cases,
also be prevented.
remote port
80
80
inside host
A
B
inside port
3000
3000
A packet to C from A,3000 would be rewritten by NR so that the source was NR,3000. A packet from
C,80 addressed to NR,3000 would be rewritten and forwarded to A,3000. Similarly, a packet from
D,80 addressed to NR,3000 would be rewritten and forwarded to B,3000; the NAT table takes into
account the sending socket address as well as the destination.
Now suppose B opens a connection to C,80, also from inside port 3000. This time NR must remap the
port number, because that is the only way to distinguish between packets from C,80 to A and to B. The
new table is
remote host
C
D
C
remote port
80
80
80
inside host
A
B
B
inside port
3000
3000
3000
25
Typically NR would not create TCP connections between itself and C,80 and D,80; the NAT table does
forwarding but the endpoints of the connection are still at the inside hosts. However, NR might very well
monitor the TCP connections to know when they have closed, and so can be removed from the table.
It is common for Voice-over-IP (VoIP) telephony using the SIP protocol (RFC 3261) to prefer to use UDP
port 5060 at both ends. If a VoIP server is outside the NAT router (which must be the case as the server
must generally be publicly visible) and a telephone is inside, likely port 5060 will pass through without
remapping, though the telephone will have to initiate the connection. But if there are two phones inside, one
of them will appear to be connecting to the server from an alternative port.
VoIP systems run into a much more serious problem with NAT, however. A call ultimately between two
phones is typically first negotiated between the phones respective VoIP servers. Once the call is set up, the
servers would prefer to step out of the loop, and have the phones exchange voice packets directly. The SIP
protocol was designed to handle this by having each phone report to its respective server the UDP socket
(IP address,port pair) it intends to use for the voice exchange; the servers then report these phone sockets
to each other, and from there to the opposite phones. This socket information is rendered incorrect by NAT,
however, certainly the IP address and quite likely the port as well. If only one of the phones is behind a
NAT firewall, it can initiate the voice connection to the other phone, but the other phone will see the voice
packets arriving from a different socket than promised and will likely not recognize them as part of the call.
If both phones are behind NAT firewalls, they will not be able to connect to one another at all. The common
solution is for the VoIP server of a phone behind a NAT firewall to remain in the communications path,
forwarding packets to its hidden partner.
If a site wants to make it possible to allow connections to hosts behind a NAT router or other firewall, one
option is tunneling. This is the creation of a virtual LAN link that runs on top of a TCP connection
between the end user and one of the sites servers; the end user can thus appear to be on one of the organizations internal LANs; see 3.1 Virtual Private Network. Another option is to open up a specific port: in
essence, a static NAT-table entry is made connecting a specific port on the NAT router to a specific internal
host and port (usually the same port). For example, all UDP packets to port 5060 on the NAT router might
be forwarded to port 5060 on internal host A, even in the absence of any prior packet exchange.
NAT routers work very well when the communications model is of client-side TCP connections, originating
from the inside and with public outside servers as destination. The NAT model works less well for peerto-peer networking, where your computer and a friends, each behind a different NAT router, wish to
establish a connection. NAT routers also often have trouble with UDP protocols, due to the tendency for
such protocols to have the public server reply from a different port than the one originally contacted. For
example, if host A behind a NAT router attempts to use TFTP (11.3 Trivial File Transport Protocol, TFTP),
and sends a packet to port 69 of public server C, then C is likely to reply from some new port, say 3000,
and this reply is likely to be dropped by the NAT router as there will be no entry there yet for traffic from
C,3000.
26
1 An Overview of Networks
27
It seems clear that the primary reasons the OSI protocols failed in the marketplace were their ponderous
bureaucracy for protocol management, their principle that protocols be completed before implementation
began, and their insistence on rigid adherence to the specifications to the point of non-interoperability. In
contrast, the IETF had (and still has) a two working implementations rule for a protocol to become a
Draft Standard. From RFC 2026:
A specification from which at least two independent and interoperable implementations from different
code bases have been developed, and for which sufficient successful operational experience has been
obtained, may be elevated to the Draft Standard level. [emphasis added]
This rule has often facilitated the discovery of protocol design weaknesses early enough that the problems
could be fixed. The OSI approach is a striking failure for the waterfall design model, when competing
with the IETFs cyclic prototyping model. However, it is worth noting that the IETF has similarly been
unable to keep up with rapid changes in html, particularly at the browser end; the OSI mistakes were mostly
evident only in retrospect.
Trying to fit protocols into specific layers is often both futile and irrelevant. By one perspective, the RealTime Protocol RTP lives at the Transport layer, but just above the UDP layer; others have put RTP into the
Application layer. Parts of the RTP protocol resemble the Session and Presentation layers. A key component
of the IP protocol is the set of various router-update protocols; some of these freely use higher-level layers.
Similarly, tunneling might be considered to be a Link-layer protocol, but tunnels are often created and
maintained at the Application layer.
A sometimes-more-successful approach to understanding layers is to view them instead as parts of a
protocol graph. Thus, in the following diagram we have two protocols at the transport layer (UDP and
RTP), and one protocol (ARP) not easily assigned to a layer.
28
1 An Overview of Networks
1.17 Epilog
This completes our tour of the basics. In the remaining chapters we will expand on the material here.
1.18 Exercises
1. Give forwarding tables for each of the switches S1-S4 in the following network with destinations A, B,
C, D. For the next_hop column, give the neighbor on the appropriate link rather than the interface number.
A
S1
S2
S3
S4
2. Give forwarding tables for each of the switches S1-S4 in the following network with destinations A, B,
C, D. Again, use the neighbor form of next_hop rather than the interface form. Try to keep the route to
each destination as short as possible. What decision has to be made in this exercise that did not arise in the
preceding exercise?
A
S1
S2
S4
S3
3. Consider the following arrangement of switches and destinations. Give forwarding tables (in neighbor
form) for S1-S4 that include default forwarding entries; the default entries should point toward S5. Eliminate all table entries that are implied by the default entry (that is, if the default entry is to S3, eliminate all
other entries for which the next hop is S3).
A
S1
D
S3
S2
S4
S5
29
4. Four switches are arranged as below. The destinations are S1 through S4 themselves.
S1
S2
S4
S3
(a). Give the forwarding tables for S1 through S4 assuming packets to adjacent nodes are sent along the
connecting link, and packets to diagonally opposite nodes are sent clockwise.
(b). Give the forwarding tables for S1 through S4 assuming the S1S4 link is not used at all, not even for
S1S4 traffic.
5. Suppose we have switches S1 through S4; the forwarding-table destinations are the switches themselves. The table
S2: S1,S1 S3,S3 S4,S3
S3: S1,S2 S2,S2 S4,S4
From the above we can conclude that S2 must be directly connected to both S1 and S3 as its table lists them
as next_hops; similarly, S3 must be directly connected to S2 and S4.
(a). Must S1 and S4 be directly connected? If so, explain; if not, give a network in which there is no direct
link between them, consistent with the tables above.
(b). Now suppose S3s table is changed to the following. In this case must S1 and S4 be directly
connected? Why or why not?
S3: S1,S4 S2,S2 S4,S4
While the table for S4 is not given, you may assume that forwarding does work correctly. However, you
should not assume that paths are the shortest possible; in particular, you should not assume that each switch
will always reach its directly connected neighbors by using the direct connection.
6. (a) Suppose a network is as follows, with the only path from A to C passing through B:
...
...
30
S4
S10
1 An Overview of Networks
S2
S5
S11
S3
S6
S12
Suppose S1-S6 have the forwarding tables below. For each destination A,B,C,D,E,F, suppose a packet is
sent to the destination from S1. Give the switches it passes through, including the initial switch S1, up until
the final switch S10-S12.
S1: (A,S4), (B,S2), (C,S4), (D,S2), (E,S2), (F,S4)
S2: (A,S5), (B,S5), (D,S5), (E,S3), (F,S3)
S3: (B,S6), (C,S2), (E,S6), (F,S6)
S4: (A,S10), (C,S5), (E,S10), (F,S5)
S5: (A,S6), (B,S11), (C,S6), (D,S6), (E,S4), (F,S2)
S6: (A,S3), (B,S12), (C,S12), (D,S12), (E,S5), (F,S12)
8. In the previous exercise, the routes taken by packets A-D are reasonably direct, but the routes for E and F
are rather circuitous.
Some routing applications assign weights to different links, and attempt to choose a path with the lowest
total link weight.
(a). Assign weights to the seven links S1S2, S2S3, S1S4, S2S5, S3S6, S4S5 and S5S6 so that
destination Es route in the previous exercise becomes the optimum (lowest total link weight) path.
(b). Assign (different!) weights to the seven links that make destination Fs route in the previous exercise
optimal.
Hint: you can do this by assigning a weight of 1 to all links except to one or two bad links; the bad links
get a weight of 10. In each of (a) and (b) above, the route taken will be the route that avoids all the bad
links. You must treat (a) entirely differently from (b); there is no assignment of weights that can account for
both routes.
9. Suppose we have the following three Class C IP networks, joined by routers R1R4. Give the forwarding
table for each router. For networks directly connected to a router (eg 200.0.1/24 and R1), include the network
in the table but list the next hop as direct.
R1
R4
R3
200.0.1/24
R2
200.0.2/24
200.0.3/24
1.18 Exercises
31
32
1 An Overview of Networks
2 ETHERNET
We now turn to a deeper analysis of the ubiquitous Ethernet LAN protocol. Current user-level Ethernet today
(2013) is usually 100 Mbps, with Gigabit Ethernet standard in server rooms and backbones, but because
Ethernet speed scales in odd ways, we will start with the 10 Mbps formulation. While the 10 Mbps speed is
obsolete, and while even the Ethernet collision mechanism is largely obsolete, collision management itself
continues to play a significant role in wireless networks.
33
Classic Ethernet came in version 1 [1980, DEC-Intel-Xerox], version 2 [1982, DIX], and IEEE 802.3. There
are some minor electrical differences between these, and one rather substantial packet-format difference. In
addition to these, the Berkeley Unix trailing-headers packet format was used for a while.
There were three physical formats for 10 Mbps Ethernet cable: thick coax (10BASE-5), thin coax (10BASE2), and, last to arrive, twisted pair (10BASE-T). Thick coax was the original; economics drove the successive
development of the later two. The cheaper twisted-pair cabling eventually almost entirely displaced coax, at
least for host connections.
The original specification included support for repeaters, which were in effect signal amplifiers although
they might attempt to clean up a noisy signal. Repeaters processed each bit individually and did no buffering.
In the telecom world, a repeater might be called a digital regenerator. A repeater with more than two ports
was commonly called a hub; hubs allowed branching and thus much more complex topologies.
Bridges later known as switches came along a short time later. While repeaters act at the bit layer,
a switch reads in and forwards an entire packet as a unit, and the destination address is likely consulted
to determine to where the packet is forwarded. Originally, switches were seen as providing interconnection (bridging) between separate Ethernets, but later a switched Ethernet was seen as one large virtual
Ethernet. We return to switching below in 2.4 Ethernet Switches.
Hubs propagate collisions; switches do not. If the signal representing a collision were to arrive at one port of
a hub, it would, like any other signal, be retransmitted out all other ports. If a switch were to detect a collision
one one port, no other ports would be involved; only packets received successfully are ever retransmitted
out other ports.
In coaxial-cable installations, one long run of coax snaked around the computer room or suite of offices;
each computer connected somewhere along the cable. Thin coax allowed the use of T-connectors to attach
hosts; connections were made to thick coax via taps, often literally drilled into the coax central conductor.
In a standalone installation one run of coax might be the entire Ethernet; otherwise, somewhere a repeater
would be attached to allow connection to somewhere else.
Twisted-pair does not allow mid-cable attachment; it is only used for point-to-point links between hosts,
switches and hubs. In a twisted-pair installation, each cable runs between the computer location and a
central wiring closest (generally much more convenient than trying to snake coax all around the building).
Originally each cable in the wiring closet plugged into a hub; nowadays the hub has likely been replaced by
a switch.
There is still a role for hubs today when one wants to monitor the Ethernet signal from A to B (eg for
intrusion detection analysis), although some switches now also support a form of monitoring.
All three cable formats could interconnect, although only through repeaters and hubs, and all used the same
10 Mbps transmission speed. While twisted-pair cable is still used by 100 Mbps Ethernet, it generally needs
to be a higher-performance version known as Category 5, versus the 10 Mbps Category 3.
Here is the format of a typical Ethernet packet (DIX specification):
The destination and source addresses are 48-bit quantities; the type is 16 bits, the data length is variable up
to a maximum of 1500 bytes, and the final CRC checksum is 32 bits. The checksum is added by the Ethernet
34
2 Ethernet
hardware, never by the host software. There is also a preamble, not shown: a block of 1 bits followed by a
0, in the front of the packet, for synchronization. The type field identifies the next higher protocol layer; a
few common type values are 0x0800 = IP, 0x8137 = IPX, 0x0806 = ARP.
The IEEE 802.3 specification replaced the type field by the length field, though this change never caught on.
The two formats can be distinguished as long as the type values used are larger than the maximum Ethernet
length of 1500 (or 0x05dc); the type values given in the previous paragraph all meet this condition.
Each Ethernet card has a (hopefully unique) physical address in ROM; by default any packet sent to this
address will be received by the board and passed up to the host system. Packets addressed to other physical
addresses will be seen by the card, but ignored (by default). All Ethernet devices also agree on a broadcast
address of all 1s: a packet sent to the broadcast address will be delivered to all attached hosts.
It is sometimes possible to change the physical address of a given card in software. It is almost universally
possible to put a given card into promiscuous mode, meaning that all packets on the network, no matter
what the destination address, are delivered to the attached host. This mode was originally intended for
diagnostic purposes but became best known for the security breach it opens: it was once not unusual to find
a host with network board in promiscuous mode and with a process collecting the first 100 bytes (presumably
including userid and password) of every telnet connection.
35
As long as the manufacturer involved is diligent in assigning the second three bytes, every manufacturerprovided Ethernet address should be globally unique. Lapses, however, are not unheard of.
36
2 Ethernet
If we need to send less than 46 bytes of data (for example, a 40-byte TCP ACK packet), the Ethernet packet
must be padded out to the minimum length. As a result, all protocols running on top of Ethernet need to
provide some way to specify the actual data length, as it cannot be inferred from the received packet size.
As a specific example of a collision occurring as late as possible, consider the diagram below. A and B are
5 units apart, and the bandwidth is 1 byte/unit. A begins sending helloworld at T=0; B starts sending just
as As message arrives, at T=5. B has listened before transmitting, but As signal was not yet evident. A
doesnt discover the collision until 10 units have elapsed, which is twice the distance.
Here are typical maximum values for the delay in 10 Mbps Ethernet due to various components. These
are taken from the Digital-Intel-Xerox (DIX) standard of 1982, except that point-to-point link cable is
replaced by standard cable. The DIX specification allows 1500m of coax with two repeaters and 1000m
of point-to-point cable; the table below shows 2500m of coax and four repeaters, following the later IEEE
802.3 Ethernet specification. Some of the more obscure delays have been eliminated. Entries are one-way
delay times, in bits. The maximum path may have four repeaters, and ten transceivers (simple electronic
devices between the coax cable and the NI cards), each with its drop cable (two transceivers per repeater,
plus one at each endpoint).
Ethernet delay budget
item
coax
transceiver cables
transceivers
repeaters
encoders
length
2500M
500M
delay, in bits
110 bits
25 bits
40 bits, max 10 units
25 bits, max 4 units
20 bits, max 10 units
The total here is 220 bits; in a full accounting it would be 232. Some of the numbers shown are a little high,
but there are also signal rise time delays, sense delays, and timer delays that have been omitted. It works out
fairly closely.
Implicit in the delay budget table above is the length of a bit. The speed of propagation in copper is about
0.77c, where c=3108 m/sec = 300 m/sec is the speed of light in vacuum. So, in 0.1 microseconds (the
37
time to send one bit at 10 Mbps), the signal propagates approximately 0.77c10-7 = 23 meters.
Ethernet packets also have a maximum packet size, of 1500 bytes. This limit is primarily for the sake
of fairness, so one station cannot unduly monopolize the cable (and also so stations can reserve buffers
guaranteed to hold an entire packet). At one time hardware vendors often marketed their own incompatible
extensions to Ethernet which enlarged the maximum packet size to as much as 4KB. There is no technical
reason, actually, not to do this, except compatibility.
The signal loss in any single segment of cable is limited to 8.5 db, or about 14% of original strength.
Repeaters will restore the signal to its original strength. The reason for the per-segment length restriction
is that Ethernet collision detection requires a strict limit on how much the remote signal can be allowed to
lose strength. It is possible for a station to detect and reliably read very weak remote signals, but not at the
same time that it is transmitting locally. This is exactly what must be done, though, for collision detection
to work: remote signals must arrive with sufficient strength to be heard even while the receiving station is
itself transmitting. The per-segment limit, then, has nothing to do with the overall length limit; the latter is
set only to ensure that a sender is guaranteed of detecting a collision, even if it sends the minimum-sized
packet.
2 Ethernet
assume that collision detection always takes one slot time (it will take much less for nodes closer together)
and that the slot start-times for each station are synchronized; this allows us to measure time in slots. A solid
arrow at the start of a slot means that sender began transmission in that slot; a red X signifies a collision. If
a collision occurs, the backoff value k is shown underneath. A dashed line shows the station waiting k slots
for its next attempt.
At T=0 we assume the transmitting station finishes, and all the Ai transmit and collide. At T=1, then, each
of the Ai has discovered the collision; each chooses a random k<2. Let us assume that A1 chooses k=1, A2
chooses k=1, A3 chooses k=0, A4 chooses k=0, and A5 chooses k=1.
Those stations choosing k=0 will retransmit immediately, at T=1. This means A3 and A4 collide again, and
at T=2 they now choose random k<4. We will Assume A3 chooses k=3 and A4 chooses k=0; A3 will try
again at T=2+3=5 while A4 will try again at T=2, that is, now.
At T=2, we now have the original A1, A2, and A5 transmitting for the second time, while A4 trying again
for the third time. They collide. Let us suppose A1 chooses k=2, A2 chooses k=1, A5 chooses k=3, and A4
chooses k=6 (A4 is choosing k<8 at random). Their scheduled transmission attempt times are now A1 at
T=3+2=5, A2 at T=4, A5 at T=6, and A4 at T=9.
At T=3, nobody attempts to transmit. But at T=4, A2 is the only station to transmit, and so successfully
seizes the channel. By the time T=5 rolls around, A1 and A3 will check the channel, that is, listen first, and
wait for A2 to finish. At T=9, A4 will check the channel again, and also begin waiting for A2 to finish.
A maximum of 1024 hosts is allowed on an Ethernet. This number apparently comes from the maximum
range for the backoff time as 0 k < 1024. If there are 1024 hosts simultaneously trying to send, then,
once the backoff range has reached k<1024 (N=10), we have a good chance that one station will succeed in
seizing the channel, that is; the minimum value of all the random ks chosen will be unique.
This backoff algorithm is not fair, in the sense that the longer a station has been waiting to send, the lower
its priority sinks. Newly transmitting stations with N=0 need not delay at all. The Ethernet capture effect,
below, illustrates this unfairness.
39
2.1.7 Errors
Packets can have bits flipped or garbled by electrical noise on the cable; estimates of the frequency with
which this occurs range from 1 in 104 to 1 in 106 . Bit errors are not uniformly likely; when they occur,
they are likely to occur in bursts. Packets can also be lost in hubs, although this appears less likely. Packets
can be lost due to collisions only if the sending host makes 16 unsuccessful transmission attempts and gives
up. Ethernet packets contain a 32-bit CRC error-detecting code (see 5.4.1 Cyclical Redundancy Check:
CRC) to detect bit errors. Packets can also be misaddressed by the sending host, or, most likely of all, they
can arrive at the receiving host at a point when the receiver has no free buffers and thus be dropped by a
higher-layer protocol.
40
2 Ethernet
As a first look at contention intervals, assume that there are N stations waiting to transmit at the start of the
interval. It turns out that, if all follow the exponential backoff algorithm, we can expect O(N) slot times
before one station successfully acquires the channel; thus, Ethernets are happiest when N is small and there
are only a few stations simultaneously transmitting. However, multiple stations are not necessarily a severe
problem. Often the number of slot times needed turns out to be about N/2, and slot times are short. If N=20,
then N/2 is 10 slot times, or 640 bytes. However, one packet time might be 1500 bytes. If packet intervals
are 1500 bytes and contention intervals are 640 byes, this gives an overall throughput of 1500/(640+1500)
= 70% of capacity. In practice, this seems to be a reasonable upper limit for the throughput of classic
shared-media Ethernet.
41
42
2 Ethernet
43
separately but not to the aggregated whole. In a fully switched (that is, no hubs) 100BASE-TX LAN, each
collision domain is simply a single twisted-pair link, subject to the 100-meter maximum length.
Fast Ethernet also introduced the concept of full-duplex Ethernet: two twisted pairs could be used, one
for each direction. Full-duplex Ethernet is limited to paths not involving hubs, that is, to single station-tostation links, where a station is either a host or a switch. Because such a link has only two potential senders,
and each sender has its own transmit line, full-duplex Ethernet is collision-free.
Fast Ethernet uses 4B/5B encoding, covered in 4.1.4 4B/5B.
Fast Ethernet 100BASE-TX does not particularly support links between buildings, due to the networkdiameter limitation. However, fiber-optic point-to-point links are quite effective here, provided full-duplex
is used to avoid collisions. We mentioned above that the coax-based 100BASE-FX standard allowed a
maximum half-duplex run of 400 meters, but 100BASE-FX is much more likely to use full duplex, where
the maximum cable length rises to 2,000 meters.
44
2 Ethernet
In developing faster Ethernet speeds, economics plays at least as important a role as technology. As new
speeds reach the market, the earliest adopters often must take pains to buy cards, switches and cable known
to work together; this in effect amounts to installing a proprietary LAN. The real benefit of Ethernet,
however, is arguably that it is standardized, at least eventually, and thus a site can mix and match its cards
and devices. Having a given Ethernet standard support existing cable is even more important economically;
the costs of replacing cable often dwarf the costs of the electronics.
45
If the destination address D is the broadcast address, or, for many switches, a multicast address, broadcast
is required.
In the diagram above, each switchs tables are indicated by listing near each interface the destinations known
to be reachable by that interface. The entries shown are the result of the following packets:
A sends to B; all switches learn where A is
B sends to A; this packet goes directly to A; only S3, S2 and S1 learn where B is
C sends to B; S4 does not know where B is so this packet goes to S5; S2 does know where B is so the
packet does not go to S1.
Switches do not automatically discover directly connected neighbors; S1 does not learn about A until A
transmits a packet.
Once all the switches have learned where all (or most of) the hosts are, packet routing becomes optimal. At
this point packets are never sent on links unnecessarily; a packet from A to B only travels those links that
lie along the (unique) path from A to B. (Paths must be unique because switched Ethernet networks cannot
have loops, at least not active ones. If a loop existed, then a packet sent to an unknown destination would be
forwarded around the loop endlessly.)
Switches have an additional advantage in that traffic that does not flow where it does not need to flow is
much harder to eavesdrop on. On an unswitched Ethernet, one host configured to receive all packets can
eavesdrop on all traffic. Early Ethernets were notorious for allowing one unscrupulous station to capture,
for instance, all passwords in use on the network. On a fully switched Ethernet, a host physically only sees
the traffic actually addressed to it; other traffic remains inaccessible.
Typical switches have room for table with 104 - 106 entries, though maxing out at 105 entries may be more
common; this is usually enough to learn about all hosts in even a relatively large organization. A switched
Ethernet can fail when total traffic becomes excessive, but excessive total traffic would drown any network
(although other network mechanisms might support higher bandwidth). The main limitations specific to
switching are the requirement that the topology must be loop-free (thus disallowing duplicate paths which
might otherwise provide redundancy), and that all broadcast traffic must always be forwarded everywhere.
As a switched Ethernet grows, broadcast traffic comprises a larger and larger percentage of the total traffic,
and the organization must at some point move to a routing architecture (eg as in 7.6 IP Subnets).
One of the differences between an inexpensive Ethernet switch and a pricier one is the degree of internal
parallelism it can support. If three packets arrive simultaneously on ports 1, 2 and 3, and are destined for
respective ports 4, 5 and 6, can the switch actually transmit the packets simultaneously? A simple switch
likely has a single CPU and a single memory bus, both of which can introduce transmission bottlenecks.
46
2 Ethernet
For commodity five-port switches, at most two simultaneous transmissions can occur; such switches can
generally handle that degree of parallelism. It becomes harder as the number of ports increases, but at some
point the need to support full parallel operation can be questioned; in many settings the majority of traffic
involves one or two server or router ports. If a high degree of parallelism is in fact required, there are various
architectures known as switch fabrics that can be used; these typically involve multiple simple processor
elements.
47
When a switch sees a new root candidate, it sends BPDUs on all interfaces, indicating the distance. The
switch includes the interface leading towards the root.
Once this process is complete, each switch knows
its own path to the root
which of its ports any further-out switches will be using to reach the root
for each port, its directly connected neighboring switches
Now the switch can prune some (or all!) of its interfaces. It disables all interfaces that are not enabled by
the following rules:
1. It enables the port via which it reaches the root
2. It enables any of its ports that further-out switches use to reach the root
3. If a remaining port connects to a segment to which other segment-neighbor switches connect as well,
the port is enabled if the switch has the minimum cost to the root among those segment-neighbors, or,
if a tie, the smallest ID among those neighbors, or, if two ports are tied, the port with the smaller ID.
4. If a port has no directly connected switch-neighbors, it presumably connects to a host or segment, and
the port is enabled.
Rules 1 and 2 construct the spanning tree; if S3 reaches the root via S2, then Rule 1 makes sure S3s port
towards S2 is open, and Rule 2 makes sure S2s corresponding port towards S3 is open. Rule 3 ensures that
each network segment that connects to multiple switches gets a unique path to the root: if S2 and S3 are
segment-neighbors each connected to segment N, then S2 enables its port to N and S3 does not (because
2<3). The primary concern here is to create a path for any host nodes on segment N; S2 and S3 will create
their own paths via Rules 1 and 2. Rule 4 ensures that any stub segments retain connectivity; these would
include all hosts directly connected to switch ports.
S1 has the lowest ID, and so becomes the root. S2 and S4 are directly connected, so they will enable the
interfaces by which they reach S1 (Rule 1) while S1 will enable its interfaces by which S2 and S4 reach it
48
2 Ethernet
(Rule 2).
S3 has a unique lowest-cost route to S1, and so again by Rule 1 it will enable its interface to S2, while by
Rule 2 S2 will enable its interface to S3.
S5 has two choices; it hears of equal-cost paths to the root from both S2 and S4. It picks the lower-numbered
neighbor S2; the interface to S4 will never be enabled. Similarly, S4 will never enable its interface to S5.
Similarly, S6 has two choices; it selects S3.
After these links are enabled (strictly speaking it is interfaces that are enabled, not links, but in all cases here
either both interfaces of a link will be enabled or neither), the network in effect becomes:
Eventually, all switches discover S1 is the root (because 1 is the smallest of {1,2,3,4,5,6}). S2, S3 and S4
are one (unique) hop away; S5, S6 and S7 are two hops away.
49
Algorhyme
I think that I shall never see
a graph more lovely than a tree.
A tree whose crucial property
is loop-free connectivity.
A tree that must be sure to span
so packet can reach every LAN.
First, the root must be selected.
By ID, it is elected.
Least-cost paths from root are traced.
In the tree, these paths are placed.
A mesh is made by folks like me,
then bridges find a spanning tree.
Radia Perlman
For the switches one hop from the root, Rule 1 enables S2s port 1, S3s port 1, and S4s port 1. Rule 2
enables the corresponding ports on S1: ports 1, 5 and 4 respectively. Without the spanning-tree algorithm
S2 could reach S1 via port 2 as well as port 1, but port 1 has a smaller number.
S5 has two equal-cost paths to the root: S5S4S1 and S5S3S1. S3 is the switch with the
lower ID; its port 2 is enabled and S5 port 2 is enabled.
S6 and S7 reach the root through S2 and S3 respectively; we enable S6 port 1, S2 port 3, S7 port 2 and S3
port 3.
The ports still disabled at this point are S1 ports 2 and 3, S2 port 2, S4 ports 2 and 3, S5 port 1, S6 port 2
and S7 port 1.
Now we get to Rule 3, dealing with how segments (and thus their hosts) connect to the root. Applying Rule
3,
We do not enable S2 port 2, because the network (B) has a direct connection to the root, S1
We do enable S4 port 3, because S4 and S5 connect that way and S4 is closer to the root. This enables
connectivity of network D. We do not enable S5 port 1.
S6 and S7 are tied for the path-length to the root. But S6 has smaller ID, so it enables port 2. S7s
port 1 is not enabled.
Finally, Rule 4 enables S4 port 2, and thus connectivity for host J. It also enables S1 port 2; network F has
two connections to S1 and port 2 is the lower-numbered connection.
All this port-enabling is done using only the data collected during the root-discovery phase; there is no
additional negotiation. The BPDU exchanges continue, however, so as to detect any changes in the topology.
If a link is disabled, it is not used even in cases where it would be more efficient to use it. That is, traffic
from F to B is sent via B1, D, and B5; it never goes through B7. IP routing, on the other hand, uses the
50
2 Ethernet
shortest path. To put it another way, all spanning-tree Ethernet traffic goes through the root node, or along
a path to or from the root node.
The traditional (IEEE 802.1D) spanning-tree protocol is relatively slow; the need to go through the treebuilding phase means that after switches are first turned on no normal traffic can be forwarded for ~30
seconds. Faster, revised protocols have been proposed to reduce this problem.
Another issue with the spanning-tree algorithm is that a rogue switch can announce an ID of 0, thus likely
becoming the new root; this leaves that switch well-positioned to eavesdrop on a considerable fraction of
the traffic. One of the goals of the Cisco Root Guard feature is to prevent this; another goal of this and
related features is to put the spanning-tree topology under some degree of administrative control. One likely
wants the root switch, for example, to be geographically at least somewhat centered.
In the diagram above, S1 and S3 each have both red and blue ports. The switch network S1-S4 will deliver
traffic only when the source and destination ports are the same color. Red packets can be forwarded to the
blue VLAN only by passing through the router R, entering Rs red port and leaving its blue port. R may
apply firewall rules to restrict redblue traffic.
51
When the source and destination ports are on the same switch, nothing needs to be added to the packet; the
switch can keep track of the color of each of its ports. However, switch-to-switch traffic must be additionally
tagged to indicate the source. Consider, for example, switch S1 above sending packets to S3 which has nodes
R3 (red) and B3 (blue). Traffic between S1 and S3 must be tagged with the color, so that S3 will know to
what ports it may be delivered. The IEEE 802.1Q protocol is typically used for this packet-tagging; a 32-bit
color tag is inserted into the Ethernet header after the source address and before the type field. The first
16 bits of this field is 0x8100, which becomes the new Ethernet type field and which identifies the frame as
tagged.
Double-tagging is possible; this would allow an ISP to have one level of tagging and its customers to have
another level.
2.7 Epilog
Ethernet dominates the LAN layer, but is not one single LAN protocol: it comes in a variety of speeds
and flavors. Higher-speed Ethernet seems to be moving towards fragmenting into a range of physical-layer
options for different types of cable, but all based on switches and point-to-point linking; different Ethernet
types can be interconnected only with switches. Once Ethernet finally abandons physical links that are
bi-directional (half-duplex links), it will be collision-free and thus will no longer need a minimum packet
size.
Other wired networks have largely disappeared (or have been renamed Ethernet). Wireless networks,
however, are here to stay, and for the time being at least have inherited the original Ethernets collisionmanagement concerns.
2.8 Exercises
1. Simulate the contention period of five Ethernet stations that all attempt to transmit at T=0 (presumably
when some sixth station has finished transmitting), in the style of the diagram in 2.1.4 Exponential Backoff
Algorithm. Assume that time is measured in slot times, and that exactly one slot time is needed to detect a
collision (so that if two stations transmit at T=1 and collide, and one of them chooses a backoff time k=0,
then that station will transmit again at T=2). Use coin flips or some other source of randomness.
2. Suppose we have Ethernet switches S1 through S3 arranged as below. All forwarding tables are initially
empty.
S1
S2
S3
52
2 Ethernet
3. Suppose we have the Ethernet switches S1 through S4 arranged as below. All forwarding tables are
empty; each switch uses the learning algorithm of 2.4 Ethernet Switches.
B
S4
A
S1
S2
S3
S1
S2
S3
Hint: Destination D must be in S3s forwarding table, but must not be in S2s.
5. Given the Ethernet network with learning switches below, with (disjoint) unspecified parts represented
by ?, explain why it is impossible for a packet sent from A to B to be forwarded by S1 only to S2, but to be
forwarded by S2 out all of S2s other ports.
?
|
S1
?
|
S2
6. In the diagram of 2.4 Ethernet Switches, suppose node D is connected to S5, and, with the tables as
shown below the diagram, D sends to B.
(a). Which switches will see this packet, and thus learn about D?
(b). Which of the switches in part (a) do not already know where B is and will use fallback-to-broadcase
(ie, will forward the packet out all non-arrival interfaces)?
7. Suppose two Ethernet switches are connected in a loop as follows; S1 and S2 have their interfaces 1 and
2 labeled. These switches do not use the spanning-tree algorithm.
2.8 Exercises
53
Suppose A attempts to send a packet to destination B, which is unknown. S1 will therefore forward the
packet out interfaces 1 and 2. What happens then? How long will As packet circulate?
8. The following network is like that of 2.5.1 Example 1: Switches Only, except that the switches are
numbered differently. Again, the ID of switch Sn is n, so S1 will be the root. Which links end up pruned
by the spanning-tree algorithm, and why?
S1
S4
S6
S3
S5
S2
9. Suppose you want to develop a new protocol so that Ethernet switches participating in a VLAN all keep
track of the VLAN color associated with every destination. Assume that each switch knows which of its
ports (interfaces) connect to other switches and which may connect to hosts, and in the latter case knows the
color assigned to that port.
(a). Suggest a way by which switches might propagate this destination-color information to other switches.
(b). What happens if a port formerly reserved for connection to another switch is now used for a host?
54
2 Ethernet
3 OTHER LANS
In the wired era, one could get along quite well with nothing but Ethernet and the occasional long-haul pointto-point link joining different sites. However, there are important alternatives out there. Some, like token
ring, are mostly of historical importance; others, like virtual circuits, are of great conceptual importance but
so far of only modest day-to-day significance.
And then there is wireless. It would be difficult to imagine contemporary laptop networking, let alone mobile
devices, without it. In both homes and offices, Wi-Fi connectivity is the norm. A return to being tethered by
wires is almost unthinkable.
55
After the VPN is set up, the home hosts tun0 interface appears to be locally connected to Site A, and thus
the home host is allowed to connect to the private area within Site A. The home hosts forwarding table will
be configured so that traffic to Site As private addresses is routed via interface tun0.
VPNs are also commonly used to connect entire remote offices to headquarters. In this case the remote-office
end of the tunnel will be at that offices local router, and the tunnel will carry traffic for all the workstations
in the remote office.
To improve security, it is common for the residential (or remote-office) end of the VPN connection to use
the VPN connection as the default route for all traffic except that needed to maintain the VPN itself. This
may require a so-called host-specific forwarding-table entry at the residential end to allow the packets that
carry the VPN tunnel traffic to be routed correctly via eth0. This routing strategy means that potential
intruders cannot access the residential host and thus the workplace internal network through the original
residential Internet access. A consequence is that if the home worker downloads a large file from a nonworkplace site, it will travel first to the workplace, then back out to the Internet via the VPN connection, and
finally arrive at the home.
3 Other LANs
3.3 Wi-Fi
Wi-Fi is a trademark denoting any of several IEEE wireless-networking protocols in the 802.11 family,
specifically 802.11a, 802.11b, 802.11g, 802.11n, and 802.11ac. Like classic Ethernet, Wi-Fi must deal
with collisions; unlike Ethernet, however, Wi-Fi is unable to detect collisions in progress, complicating the
backoff and retransmission algorithms. Wi-Fi is designed to interoperate freely with Ethernet at the logical
LAN layer; that is, Ethernet and Wi-Fi traffic can be freely switched from the wired side to the wireless side.
Band Width
To radio engineers, band width means the frequency range used by a signal, not data rate; in keeping
with this we will in this section and 3.4 WiMAX use the term data rate instead of bandwidth. We
will use the terms channel width or width of the frequency band for the frequency range. All else
being equal, the data rate achievable with a radio signal is proportional to the channel width.
Generally, Wi-Fi uses the 2.4 GHz ISM (Industrial, Scientific and Medical) band used also by microwave
ovens, though 802.11a uses a 5 GHz band, 802.11n supports that as an option and the new 802.11ac has
returned to using 5 GHz exclusively. The 5 GHz band has reduced ability to penetrate walls, often resulting
in a lower effective range. Wi-Fi radio spectrum is usually unlicensed, meaning that no special permission
is needed to transmit but also that others may be trying to use the same frequency band simultaneously; the
availability of unlicensed channels in the 5 GHz band continues to evolve.
The table below summarizes the different Wi-Fi versions. All bit rates assume a single spatial stream;
channel widths are nominal.
IEEE name
802.11a
802.11b
802.11g
802.11n
802.11ac
frequency
5 GHz
2.4 GHz
2.4 GHz
2.4/5 GHz
5 GHz
channel width
20 MHz
20 MHz
20 MHz
20-40 MHz
20-160 MHz
The maximum bit rate is seldom achieved in practice. The effective bit rate must take into account, at a
minimum, the time spent in the collision-handling mechanism. More significantly, all the Wi-Fi variants
above use dynamic rate scaling, below; the bit rate is reduced up to tenfold (or more) in environments with
higher error rates, which can be due to distance, obstructions, competing transmissions or radio noise. All
this means that, as a practical matter, getting 150 Mbps out of 802.11n requires optimum circumstances; in
particular, no competing senders and unimpeded line-of-sight transmission. 802.11n lower-end performance
can be as little as 10 Mbps, though 40-100 Mbps (for a 40 MHz channel) may be more typical.
The 2.4 GHz ISM band is divided by international agreement into up to 14 officially designated channels,
each about 5 MHz wide, though in the United States use may be limited to the first 11 channels. The 5 GHz
3.3 Wi-Fi
57
band is similarly divided into 5 MHz channels. One Wi-Fi sender, however, needs several of these official
channels; the typical 2.4 GHz 802.11g transmitter uses an actual frequency range of up to 22 MHz, or up
to five channels. As a result, to avoid signal overlap Wi-Fi use in the 2.4 GHz band is often restricted to
official channels 1, 6 and 11. The end result is that unrelated Wi-Fi transmitters can and do interact with and
interfere with each other.
The United States requires users of the 5 GHz band to avoid interfering with weather and military applications in the same frequency range. Once that is implemented, however, there are more 5 MHz channels at
this frequency than in the 2.4 GHz ISM band, which is one of the reasons 802.11ac can run faster (below).
Wi-Fi designers can improve speed through a variety of techniques, including
improved radio modulation techniques
improved error-correcting codes
smaller guard intervals between symbols
increasing the channel width
allowing multiple spatial streams via multiple antennas
The first two in this list seem by now to be largely tapped out; the third reduces the range but may increase
the data rate by 11%.
The largest speed increases are obtained by increasing the number of 5 MHz channels used. For example,
the 65 Mbps bit rate above for 802.11n is for a nominal frequency range of 20 MHz, comparable to that
of 802.11g. However, in areas with minimal competition from other signals, 802.11n supports using a 40
MHz frequency band; the bit rate then goes up to 135 Mbps (150 Mbps with a smaller guard interval). This
amounts to using two of the three available 2.4 GHz Wi-Fi bands. Similarly, the wide range in 802.11ac
bit rates reflects support for using channel widths ranging from 20 MHz up to 160 MHz (32 5-MHz official
channels).
For all the categories in the table above, additional bits are used for error-correcting codes. For 802.11g
operating at 54 Mbps, for example, the actual raw bit rate is (4/3)54 = 72 Mbps, sent in symbols consisting
of six bits as a unit.
58
3 Other LANs
59
it waits for time SIFS and sends the ACK; at the instant when the end of the SIFS interval is reached, the
receiver will be the only station authorized to send. Any other stations waiting the longer IFS period will see
the ACK before the IFS time has elapsed and will thus not interfere with the ACK; similarly, any stations
with a running backoff-wait clock will continue to have that clock suspended.
3.3.2.1 Wi-Fi RTS/CTS
Wi-Fi stations optionally also use a request-to-send/clear-to-send (RTS/CTS) protocol. Usually this is
used only for larger packets; often, the RTS/CTS threshold (the size of the largest packet not sent using RTS/CTS) is set (as part of the Access Point configuration) to be the maximum packet size, effectively
disabling this feature. The idea here is that a large packet that is involved in a collision represents a significant waste of potential throughput; for large packets, we should ask first.
The RTS packet which is small is sent through the normal procedure outlined above; this packet includes
the identity of the destination and the size of the data packet the station desires to transmit. The destination
station then replies with CTS after the SIFS wait period, effectively preventing any other transmission after
the RTS. The CTS packet also contains the data-packet size. The original sender then waits for SIFS after
receiving the CTS, and sends the packet. If all other stations can hear both the RTS and CTS messages, then
once the RTS and CTS are sent successfully no collisions should occur during packet transmission, again
because the only idle times are of length SIFS and other stations should be waiting for time IFS.
3.3.2.2 Hidden-Node Problem
Consider the diagram below. Each station has a 100-meter range. Stations A and B are 150 meters apart and
so cannot hear one another at all; each is 75 meters from C. If A is transmitting and B senses the medium in
preparation for its own transmission, as part of collision avoidance, then B will conclude that the medium is
idle and will go ahead and send.
However, C is within range of both A and B. If A and B transmit simultaneously, then from Cs perspective
a collision occurs. C receives nothing usable. We will call this a hidden-node collision as the senders A
and B are hidden from one another; the general scenario is known as the hidden-node problem.
Note that node D receives only As signal, and so no collision occurs at D.
60
3 Other LANs
The hidden-node problem can also occur if A and B cannot receive one anothers transmissions due to a
physical obstruction such as a radio-impermeable wall:
One of the rationales for the RTS/CTS protocol is the prevention of hidden-node collisions. Imagine that,
instead of transmitting its data packet, A sends an RTS packet, and C responds with CTS. B has not heard
the RTS packet from A, but does hear the CTS from C. A will begin transmitting after a SIFS interval, but B
will not hear As transmission. However, B will still wait, because the CTS packet contained the data-packet
size and thus, implicitly, the length of time all other stations should remain idle. Because RTS packets are
quite short, they are much less likely to be involved in collisions themselves than data packets.
3.3.2.3 Wi-Fi Fragmentation
Conceptually related to RTS/CTS is Wi-Fi fragmentation. If error rates or collision rates are high, a sender
can send a large packet as multiple fragments, each receiving its own link-layer ACK. As we shall see in
5.3.1 Error Rates and Packet Size, if bit-error rates are high then sending several smaller packets often
leads to fewer total transmitted bytes than sending the same data as one large packet.
Wi-Fi packet fragments are reassembled by the receiving node, which may or may not be the final destination.
As with the RTS/CTS threshold, the fragmentation threshold is often set to the size of the maximum packet.
Adjusting the values of these thresholds is seldom necessary, though might be appropriate if monitoring
revealed high collision or error rates. Unfortunately, it is essentially impossible for an individual station
to distinguish between reception errors caused by collisions and reception errors caused by other forms of
noise, and so it is hard to use reception statistics to distinguish between a need for RTS/CTS and a need for
fragmentation.
3.3 Wi-Fi
61
sender may fall back to the next lower bit rate. The actual bit-rate-selection algorithm lives in the particular
Wi-Fi driver in use; different nodes in a network may use different algorithms.
The earliest rate-scaling algorithm was Automatic Rate Fallback, or ARF, [KM97]. The rate decreases after
two consecutive transmission failures (that is, the link-layer ACK is not received), and increases after ten
transmission successes.
A significant problem for rate scaling is that a packet loss may be due either to low-level random noise
(white noise, or thermal noise) or to a collision (which is also a form of noise, but less random); only in the
first case is a lower transmission rate likely to be helpful. If a larger number of collisions is experienced, the
longer packet-transmission times caused by the lower bit rate may increase the frequency of hidden-node
collisions. In fact, a higher transmission rate (leading to shorter transmission times) may help; enabling the
RTS/CTS protocol may also help.
Signal Strength
Most Wi-Fi drivers report the received signal strength. Newer drivers use the IEEE Received Channel
Power Indicator convention; the RCPI is an 8-bit integer proportional to the absolute power received
by the antenna as measured in decibel-milliwatts (dBm). Wi-Fi values range from -10 dBm to -90 dBm
and below. For comparison, the light from the star Polaris delivers about -97 dBm to one eye on a good
night; Venus typically delivers about -73 dBm. A GPS satellite might deliver -127 dBm to your phone.
(Inspired by Wikipedia on DBm.)
A variety of newer rate-scaling algorithms have been proposed; see [JB05] for a summary. One, ReceiverBased Auto Rate (RBAR, [HVB01]), attempts to incorporate the signal-to-noise ratio into the calculation of
the transmission rate. This avoids the confusion introduced by collisions. Unfortunately, while the signalto-noise ratio has a strong theoretical correlation with the transmission bit-error rate, most Wi-Fi radios
will report to the host system the received signal strength. This is not the same as the signal-to-noise ratio,
which is harder to measure. As a result, the RBAR approach has not been quite as effective in practice as
might be hoped.
The Collision-Aware Rate Adaptation algorithm (CARA, [KKCQ06]) attempts (among other things) to infer
that a packet was lost to a collision rather than noise if, after one SIFS interval following the end of the packet
transmission, no link-layer ACK has been received and the channel is still busy. This will detect collisions,
of course, only with longer packets.
Because the actual data in a Wi-Fi packet may be sent at a rate not every participant is close enough to
receive correctly, every Wi-Fi transmission begins with a brief preamble at the minimum bit rate. Link-layer
ACKs, too, are sent at the minimum bit rate.
3 Other LANs
out by an exchange of special management packets may be restricted to stations with hardware (LAN)
addresses on a predetermined list, or to stations with valid cryptographic credentials. Stations may regularly
re-associate to their Access Point, especially if they wish to communicate some status update.
Access Points
Generally, a Wi-Fi access point has special features; Wi-Fi-enabled station devices like phones and
workstations do not act as access points. However, it may be possible to for a station device to become
an access point if the access-point mode is supported by the underlying radio hardware and if suitable
drivers can be found. Under linux, the hostapd package is one option.
Stations in an infrastructure network communicate directly only with their access point. If B and C share
access point A, and B wishes to send a packet to C, then B first forwards the packet to A and A then forwards
it to C. While this introduces a degree of inefficiency, it does mean that the access point and its associated
nodes automatically act as a true LAN: every node can reach every other node. In an ad hoc network, by
comparison, it is quite common for two nodes to be able to reach each other only by forwarding through an
intermediate third node; this is in fact exactly the hidden-node scenario.
Finally, Wi-Fi is by design completely interoperable with Ethernet; if station A is associated with access
point AP, and AP also connects via (cabled) Ethernet to station B, then if A wants to send a packet to B it
sends it using AP as the Wi-Fi destination but with B also included in the header as the actual destination.
Once it receives the packet by wireless, AP acts as an Ethernet switch and forwards the packet to B.
While this forwarding is transparent to senders, the Ethernet and Wi-Fi LAN header formats are entirely
different.
The above diagram illustrates an Ethernet header and the Wi-Fi header for a typical data packet not using
Wi-Fi quality-of-service features. The Ethernet type field usually moves to an IEEE Logical Link Control
header in the Wi-Fi region labeled data. The receiver and transmitter addresses are the MAC addresses
of the nodes receiving and transmitting the (unicast) packet; these may each be different from the ultimate
destination and source addresses. In infrastructure mode one of the receiver or transmitter addresses is the
access point; in typical situations either the receiver is the destination or the sender is the transmitter.
3.3 Wi-Fi
63
for pseudo-security reasons beacon packets can be suppressed). Large installations can create roaming access among multiple access points by assigning all the access points the same SSID. An individual station
will stay with the access point with which it originally associated until the signal strength falls below a certain level, at which point it will seek out other access points with the same SSID and with a stronger signal.
In this way, a large area can be carpeted with multiple Wi-Fi access points, so as to look like one large Wi-Fi
domain.
In order for this to work, traffic to wireless node B must find Bs current access point AP. This is done in
much the same way as, in a wired Ethernet, traffic finds a laptop that has been unplugged, carried to a new
building, and plugged in again. The distribution network is the underlying wired network (eg Ethernet) to
which all the access points connect. If the distribution network is a switched Ethernet supporting the usual
learning mechanism (2.4 Ethernet Switches), then Wi-Fi location update is straightforward. Suppose B
is a wireless node that has been exchanging packets via the distribution network with C (perhaps a router
connecting B to the Internet). When B moves to a new access point, all it has to do is send any packet over
the LAN to C, and the Ethernet switches involved will then learn the route through the switched Ethernet
from C to Bs current AP, and thus to B.
This process may leave other switches not currently communicating with B still holding in their forwarding tables the old location for B. This is not terribly serious, but can be avoided entirely if, after moving, B
sends out an Ethernet broadcast packet.
Ad hoc networks also have SSIDs; these are generated pseudorandomly at startup. Ad hoc networks have
beacon packets as well; all nodes participate in the regular transmission of these via a distributed algorithm.
64
3 Other LANs
To begin the association process, the supplicant contacts the authenticator using the Extensible Authentication Protocol, or EAP, with what amounts to a request to associate to that access point. EAP is a generic
message framework meant to support multiple specific types of authentication; see RFC 3748 and RFC
5247. The EAP request is forwarded to an authentication server, which may exchange (via the authenticator) several challenge/response messages with the supplicant. EAP is usually used in conjunction with
the RADIUS (Remote Authentication Dial-In User Service) protocol (RFC 2865), which is a specific (but
flexible) authentication-server protocol. WPA-Enterprise is sometimes known as 802.1X mode, EAP mode
or RADIUS mode.
One peculiarity of EAP is that EAP communication takes place before the supplicant is given an IP address
(in fact before the supplicant has completed associating itself to the access point); thus, a mechanism must
be provided to support exchange of EAP packets between supplicant and authenticator. This mechanism
is known as EAPOL, for EAP Over LAN. EAP messages between the authenticator and the authentication
server, on the other hand, can travel via IP; in fact, sites may choose to have the authentication server hosted
remotely.
Once the authentication server (eg RADIUS server) is set up, specific per-user authentication methods can
be entered. This can amount to username,password pairs, or some form of security certificate, or often
both. The authentication server will generally allow different encryption protocols to be used for different
supplicants, thus allowing for the possibility that there is not a common protocol supported by all stations.
When this authentication strategy is used, the access point no longer needs to know anything about what
authentication protocol is actually used; it is simply the middleman forwarding EAP packets between the
supplicant and the authentication server. The access point allows the supplicant to associate into the network
once it receives permission to do so from the authentication server.
3.3 Wi-Fi
65
Stations receiving data from the Access Point send the usual ACK after a SIFS interval. A data packet from
the Access Point addressed to station B may also carry, piggybacked in the Wi-Fi header, a Poll request to
another station C; this saves a transmission. Polled stations that send data will receive an ACK from the
Access Point; this ACK may be combined in the same packet with the Poll request to the next station.
At the end of the CFP, the regular contention period or CP resumes, with the usual CSMA/CA strategy.
The time interval between the start times of consecutive CFP periods is typically 100 ms, short enough to
allow some real-time traffic to be supported.
During the CFP, all stations normally wait only the Short IFS, SIFS, between transmissions. This works
because normally there is only one station designated to respond: the Access Point or the polled station.
However, if a station is polled and has nothing to send, the Access Point waits for time interval PIFS (PCF
Inter-Frame Spacing), of length midway between SIFS and IFS above (our previous IFS should now really
be known as DIFS, for DCF IFS). At the expiration of the PIFS, any non-Access-Point station that happens
to be unaware of the CFP will continue to wait the full DIFS, and thus will not transmit. An example of
such a CFP-unaware station might be one that is part of an entirely different but overlapping Wi-Fi network.
The Access Point generally maintains a polling list of stations that wish to be polled during the CFP. Stations
request inclusion on this list by an indication when they associate or (more likely) reassociate to the Access
Point. A polled station with nothing to send simply remains quiet.
PCF mode is not supported by many lower-end Wi-Fi routers, and often goes unused even when it is available. Note that PCF mode is collision-free, so long as no other Wi-Fi access points are active and within
range. While the standard has some provisions for attempting to deal with the presence of other Wi-Fi
networks, these provisions are somewhat imperfect; at a minimum, they are not always supported by other
access points. The end result is that polling is not quite as useful as it might be.
3.3.8 MANETs
The MANET acronym stands for mobile ad hoc network; in practice, the term generally applies to ad hoc
wireless networks of sufficient complexity that some internal routing mechanism is needed to enable full
connectivity. The term mesh network is also used. While MANETs can use any wireless mechanism, we
will assume here that Wi-Fi is used.
MANET nodes communicate by radio signals with a finite range, as in the diagram below.
66
3 Other LANs
Each nodes radio range is represented by a circle centered about that node. In general, two MANET nodes
may be able to communicate only by relaying packets through intermediate nodes, as is the case for nodes
A and G in the diagram above.
In the field, the radio range of each node may not be very circular, due to among other things signal reflection and blocking from obstructions. An additional complication arises when the nodes (or even just
obstructions) are moving in real time (hence the mobile of MANET); this means that a working route may
stop working a short time later. For this reason, and others, routing within MANETs is a good deal more
complex than routing in an Ethernet. A switched Ethernet, for example, is required to be loop-free, so there
is never a choice among multiple alternative routes.
Note that, without successful LAN-layer routing, a MANET does not have full node-to-node connectivity
and thus does not meet the definition of a LAN given in 1.9 LANs and Ethernet. With either LAN-layer or
IP-layer routing, one or more MANET nodes may serve as gateways to the Internet.
Note also that MANETs in general do not support broadcast, unless the forwarding of broadcast messages
throughout the MANET is built in to the routing mechanism. This can complicate the assignment of IP
addresses; the common IPv4 mechanism we will describe in 7.8 Dynamic Host Configuration Protocol
(DHCP) relies on broadcast and so usually needs some adaptation.
Finally, we observe that while MANETs are of great theoretical interest, their practical impact has been
modest; they are almost unknown, for example, in corporate environments. They appear most useful in
emergency situations, rural settings, and settings where the conventional infrastructure network has failed
or been disabled.
3.3.8.1 Routing in MANETs
Routing in MANETs can be done either at the LAN layer, using physical addresses, or at the IP layer with
some minor bending (below) of the rules.
3.3 Wi-Fi
67
Either way, nodes must find out about the existence of other nodes, and appropriate routes must then be
selected. Route selection can use any of the mechanisms we describe later in 9 Routing-Update Algorithms.
Routing at the LAN layer is much like routing by Ethernet switches; each node will construct an appropriate
forwarding table. Unlike Ethernet, however, there may be multiple paths to a destination, direct connectivity
between any particular pair of nodes may come and go, and negotiation may be required even to determine
which MANET nodes will serve as forwarders.
Routing at the IP layer involves the same issues, but at least IP-layer routing-update algorithms have always
been able to handle multiple paths. There are some minor issues, however. When we initially presented
IP forwarding in 1.10 IP - Internet Protocol, we assumed that routers made their decisions by looking
only at the network prefix of the address; if another node had the same network prefix it was assumed to be
reachable directly via the LAN. This model usually fails badly in MANETs, where direct reachability has
nothing to do with addresses. At least within the MANET, then, a modified forwarding algorithm must be
used where every address is looked up in the forwarding table. One simple way to implement this is to have
the forwarding tables contain only host-specific entries as were discussed in 3.1 Virtual Private Network.
Multiple routing algorithms have been proposed for MANETs. Performance of a given algorithm may
depend on the following factors:
The size of the network
Whether some nodes have agreed to serve as routers
The degree of node mobility, especially of routing-node mobility if applicable
Whether the nodes are under common administration, and thus may agree to defer their own transmission interests to the common good
per-node storage and power availability
3.4 WiMAX
WiMAX is a wireless network technology standardized by IEEE 802.16. It supports both stationary subscribers (802.16d) and mobile subscribers (802.16e). The stationary-subscriber version is often used to
provide residential Internet connectivity, in both urban and rural areas. The mobile version is sometimes
referred to as a fourth generation or 4G networking technology; its similar primary competitor is known
as LTE. WiMAX is used in many mobile devices, from smartphones to traditional laptops with wireless
cards installed.
As in the sidebar at the start of 3.3 Wi-Fi we will use the term data rate for what is commonly called
bandwidth to avoid confusion with the radio-specific meaning of the latter term.
WiMAX can use unlicensed frequencies, like Wi-Fi, but its primary use is over licensed radio spectrum.
WiMAX also supports a number of options for the width of its frequency band; the wider the band, the
higher the data rate. Wider bands also allow the opportunity for multiple independent frequency channels.
Downlink (base station to subscriber) data rates can be well over 100 Mbps (uplink rates are usually smaller).
Like Wi-Fi, WiMAX subscriber stations connect to a central access point, though the WiMAX standard
prefers the term base station which we will use henceforth. Stationary-subscriber WiMAX, however, operates on a much larger scale. The coverage radius of a WiMAX base station can be tens of kilometers
68
3 Other LANs
if larger antennas are provided, versus less (sometimes much less) than 100 meters for Wi-Fi; mobilesubscriber WiMAX might have a radius of one or two kilometers. Large-radius base stations are typically
mounted in towers. Subscriber stations are not generally expected to be able to hear other stations; they
interact only with the base station. As WiMAX distances increase, the data rate is reduced.
As with Wi-Fi, the central contention problem is how to schedule transmissions of subscriber stations so
they do not overlap; that is, collide. The base station has no difficulty broadcasting transmissions to multiple
different stations sequentially; it is the transmissions of those stations that must be coordinated. Once a
station completes the network entry process to connect to a base station (below), it is assigned regular
(though not necessarily periodic) transmission slots. These transmission slots may vary in size over time;
the base station may regularly issue new transmission schedules.
The centralized assignment of transmission intervals superficially resembles Wi-Fi PCF mode (3.3.7 WiFi Polling Mode); however, assignment is not done through polling, as propagation delays are too large
(below). Instead, each WiMAX subscriber station is told in effect that it may transmit starting at an assigned
time T and for an assigned length L. The station synchronizes its clock with that of the base station as part
of the network entry process.
Because of the long distances involved, synchronization and transmission protocols must take account of
speed-of-light delays. The round-trip delay across 30 km is 200 sec which is ten times larger than the basic
Wi-Fi SIFS interval; at 160 Mbps, this is the time needed to send 4 KB. If a station is to transmit so that its
message arrives at the base station at a certain time, it must actually begin transmission early by an amount
equal to the one-way station-to-base propagation delay; a special ranging mechanism allows stations to
figure out this delay.
A subscriber station begins the network-entry connection process to a base station by listening for the base
stations transmissions (which may be organized into multiple channels); these message streams contain
regular management messages containing, among other things, information about available data rates in
each direction.
Also included in the base stations message stream is information about start times for ranging intervals.
The station waits for one of these intervals and sends a range-request message to the base station. These
ranging intervals are open to all stations attempting network entry, and if another station transmits at the
same time there will be a collision. However, network entry is only done once (for a given base station) and
so the likelihood of a collision in any one ranging interval is small. An Ethernet/Wi-Fi-like exponentialbackoff process is used if a collision does occur. Ranging intervals are the only times when collisions can
occur; afterwards, all station transmissions are scheduled by the base station.
If there is no collision, the base station responds, and the station now knows the propagation delay and thus
can determine when to transmit so that its data arrives at the base station exactly at a specified time. The
station also determines its transmission signal strength from this ranging process.
Finally, and perhaps most importantly, the station receives from the base station its first timeslot for a
scheduled transmission. These timeslot assignments are included in regular uplink-map packets broadcast
by the base station. Each stations timeslot includes both a start time and a total length; lengths are in the
range of 2 to 20 ms. Future timeslots will be allocated as necessary by the base station, in future uplink-map
packets. Scheduled timeslots may be periodic (as is would be appropriate for voice) or may occur at varying
intervals. WiMAX stations may request any of several quality-of-service levels and the base station may
take these requests into account when determining the schedule. The base station also creates a downlink
schedule, but this does not need to be communicated to the subscriber stations; the base station simply uses
it to decide what to broadcast when to the stations. When scheduling the timeslots, the base station may also
3.4 WiMAX
69
take into account availability of multiple transmission channels and of directional antennas.
Through the uplink-map schedules and individual ranging, each station transmits so that one transmission
finishes arriving just before the next transmission begins arriving, as seen from the perspective of the base
station. Only minimal guard intervals need be included between consecutive transmissions. Two (or
more) consecutive transmissions may in fact be in the air simultaneously, as far-away stations need to
begin transmitting early so their signals will arrive at the base station at the expected time. The following
diagram illustrates this for stations separated by relatively large physical distances.
Mobile stations will need to update their ranging information regularly, but this can be done through future
scheduled transmissions. The distance to the base station is used not only for the mobile stations transmission timing, but also to determine its power level; signals from each mobile station, no matter where located,
should arrive at the base station with about the same power.
When a station has data to send, it includes in its next scheduled transmission a request for a longer transmission interval; if the request is granted, the station may send the data (or at least some of the data) in its
next scheduled transmission slot. When a station is done transmitting, its timeslot shrinks to the minimum,
and may be scheduled less frequently as well, but it does not disappear. Stations without data to send remain
connected to the base station by sending empty messages during these slots.
70
3 Other LANs
71
Trees vs Signal
Photo of the author attempting to improve his 2.4 GHz terrestrial-wireless signal via tree trimming.
Terrestrial fixed wireless was originally popularized for rural areas, where residential density is too low for
economical cable connections. However, some fixed-wireless ISPs now operate in urban areas, often using
WiMAX. One advantage of terrestrial fixed-wireless in remote areas is that the antennas covers a much
smaller geographical area than a satellite, generally meaning that there is more data bandwidth available per
user and the cost per megabyte is much lower.
Outdoor subscriber antennas often use a parabolic dish to improve reception; sizes range from 10 to 50 cm
in diameter. The size of the dish may depend on the distance to the central tower.
While there are standardized fixed-wireless systems, such as WiMAX, there are also a number of proprietary alternatives, including systems from Trango and Canopy. Fixed-wireless systems might, in fact, be
considered one of the last bastions of proprietary LAN protocols. This lack of standardization is due to a
72
3 Other LANs
variety of factors; two primary ones are the relatively modest overall demand for this service and the the fact
that most antennas need to be professionally installed by the ISP to ensure that they are properly mounted,
aligned, grounded and protected from lightning.
Packets will be transmitted in one direction (clockwise in the ring above). Stations in effect forward most
packets around the ring, although they can also remove a packet. (It is perhaps more accurate to think
73
of the forwarding as representing the default cable connectivity; non-forwarding represents the stations
momentarily breaking that connectivity.)
When the network is idle, all stations agree to forward a special, small packet known as a token. When a
station, say A, wishes to transmit, it must first wait for the token to arrive at A. Instead of forwarding the
token, A then transmits its own packet; this travels around the network and is then removed by A. At that
point (or in some cases at the point when A finishes transmitting its data packet) A then forwards the token.
In a small ring network, the ring circumference may be a small fraction of one packet. Ring networks
become large at the point when some packets may be entirely in transit on the ring. Slightly different
solutions apply in each case. (It is also possible that the physical ring exists only within the token-ring
switch, and that stations are connected to that switch using the usual point-to-point wiring.)
If all stations have packets to send, then we will have something like the following:
A waits for the token
A sends a packet
A sends the token to B
B sends a packet
B sends the token to C
C sends a packet
C sends the token to D
...
All stations get an equal number of chances to transmit, and no bandwidth is wasted on collisions.
One problem with token ring is that when stations are powered off it is essential that the packets continue
forwarding; this is usually addressed by having the default circuit configuration be to keep the loop closed.
Another issue is that some station has to watch out in case the token disappears, or in case a duplicate token
appears.
Because of fairness and the lack of collisions, IBM Token Ring was once considered to be the premium LAN
mechanism. As such, a premium price was charged (there was also the matter of licensing fees). But due
to a combination of lower hardware costs and higher bitrates (even taking collisions into account), Ethernet
eventually won out.
There was also a much earlier collision-free hybrid of 10 Mbps Ethernet and Token Ring known as Token
Bus: an Ethernet physical network (often linear) was used with a token-ring-like protocol layer above
that. Stations were physically connected to the (linear) Ethernet but were assigned identifiers that logically
arranged them in a (virtual) ring. Each station had to wait for the token and only then could transmit a
packet; after that it would send the token on to the next station in the virtual ring. As with real Token
Ring, some mechanisms need to be in place to monitor for token loss.
Token Bus Ethernet never caught on. The additional software complexity was no doubt part of the problem,
but perhaps the real issue was that it was not necessary.
74
3 Other LANs
75
taken to be bidirectional, a VCI used from S1 to S3 cannot be reused from S3 to S1 until the first connection
closes.
A to F: A 4
S1 6
S2 4 S4 8
S1 7
S3 3
B to D: B 4
S3 8 S1 7
A to F: A 7
S1 8
C
S2 8
S2 5 S4 9
D
S5 2 F
One may verify that on any one link no two different paths use the same VCI.
We now construct the actual VCI,port tables for the switches S1-S4, from the above; the table for S5 is left
as an exercise. Note that either the VCIin ,portin or the VCIout ,portout can be used as the key; we cannot
have the same pair in both the in columns and the out columns. It may help to display the port numbers for
each switch, as in the upper numbers in following diagram of the above red connection from A to F (lower
numbers are the VCIs):
Switch S1:
VCIin
4
5
6
8
7
portin
0
0
0
1
0
VCIout
6
6
7
7
8
portout
2
1
1
2
2
connection
AF #1
AE
AC
BD
AF #2
Switch S2:
76
3 Other LANs
VCIin
6
7
8
portin
0
0
0
VCIout
4
8
5
portout
1
2
1
connection
AF #1
BD
AF #2
VCIout
3
3
8
portout
2
1
3
connection
AE
AC
BD
VCIout
8
8
9
portout
2
1
2
connection
AF #1
AE
AF #2
Switch S3:
VCIin
6
7
4
portin
3
3
0
Switch S4:
VCIin
4
3
5
portin
3
0
3
The namespace for VCIs is small, and compact (eg contiguous). Typically the VCI and port bitfields can be
concatenated to produce a VCI,Port composite value small enough that it is suitable for use as an array
index. VCIs work best as local identifiers. IP addresses, on the other hand, need to be globally unique, and
thus are often rather sparsely distributed.
Virtual-circuit switching offers the following advantages:
connections can get quality-of-service guarantees, because the switches are aware of connections and
can reserve capacity at the time the connection is made
headers are smaller, allowing faster throughput
headers are small enough to allow efficient support for the very small packet sizes that are optimal for
voice connections. ATM packets, for instance, have 48 bytes of data; see below.
Datagram forwarding, on the other hand, offers these advantages:
Routers have less state information to manage.
Router crashes and partial connection state loss are not a problem.
If a router or link is disabled, rerouting is easy and does not affect any connection state. (As mentioned
in Chapter 1, this was Paul Barans primary concern in his 1962 paper introducing packet switching.)
Per-connection billing is very difficult.
The last point above may once have been quite important; in the era when the ARPANET was being developed, typical daytime long-distance rates were on the order of $1/minute. It is unlikely that early TCP/IP
protocol development would have been as fertile as it was had participants needed to justify per-minute
billing costs for every project.
It is certainly possible to do virtual-circuit switching with globally unique VCIs say the concatenation of
source and destination IP addresses and port numbers. The IP-based RSVP protocol (19.6 RSVP) does
exactly this. However, the fast-lookup and small-header advantages of a compact namespace are then lost.
Note that virtual-circuit switching does not suffer from the problem of idle channels still consuming resources, which is an issue with circuits using time-division multiplexing (eg shared T1 lines)
77
78
3 Other LANs
enters the network; reassembly is done at exit from the ATM path. IPv4 fragmentation, on the other hand,
applies conceptually to IP packets, and may be performed by routers within the network.
For AAL 3/4, we first define a high-level wrapper for an IP packet, called the CS-PDU (Convergence
Sublayer - Protocol Data Unit). This prefixes 32 bits on the front and another 32 bits (plus padding) on the
rear. We then chop this into as many 44-byte chunks as are needed; each chunk goes into a 48-byte ATM
payload, along with the following 32 bits worth of additional header/trailer:
2-bit type field:
10: begin new CS-PDU
00: continue CS-PDU
01: end of CS-PDU
11: single-segment CS-PDU
4-bit sequence number, 0-15, good for catching up to 15 dropped cells
10-bit MessageID field
CRC-10 checksum.
We now have a total of 9 bytes of header for 44 bytes of data; this is more than 20% overhead. This did not
sit well with the IP-over-ATM community (such as it was), and so AAL 5 was developed.
AAL 5 moved the checksum to the CS-PDU and increased it to 32 bits from 10 bits. The MID field was
discarded, as no one used it, anyway (if you wanted to send several different types of messages, you simply
created several virtual circuits). A bit from the ATM header was taken over and used to indicate:
1: start of new CS-PDU
0: continuation of an existing CS-PDU
The CS-PDU is now chopped into 48-byte chunks, which are then used as the entire body of each ATM
cell. With 5 bytes of header for 48 bytes of data, overhead is down to 10%. Errors are detected by the
CS-PDU CRC-32. This also detects lost cells (impossible with a per-cell CRC!), as we no longer have any
cell sequence number.
For both AAL3/4 and AAL5, reassembly is simply a matter of stringing together consecutive cells in order
of arrival, starting a new CS-PDU whenever the appropriate bits indicate this. For AAL3/4 the receiver
has to strip off the 4-byte AAL3/4 headers; for AAL5 the receiver has to verify the CRC-32 checksum once
all cells are received. Different cells from different virtual circuits can be jumbled together on the ATM
backbone, but on any one virtual circuit the cells from one higher-level packet must be sent one right after
the other.
A typical IP packet divides into about 20 cells. For AAL 3/4, this means a total of 200 bits devoted to CRC
codes, versus only 32 bits for AAL 5. It might seem that AAL 3/4 would be more reliable because of this,
but, paradoxically, it was not! The reason for this is that errors are rare, and so we typically have one or at
most two per CS-PDU. Suppose we have only a single error, ie a single cluster of corrupted bits small enough
that it is likely confined to a single cell. In AAL 3/4 the CRC-10 checksum will fail to detect that error (that
is, the checksum of the corrupted packet will by chance happen to equal the checksum of the original packet)
with probability 1/210 . The AAL 5 CRC-32 checksum, however, will fail to detect the error with probability
1/232 . Even if there are enough errors that two cells are corrupted, the two CRC-10s together will fail to
79
detect the error with probability 1/220 ; the CRC-32 is better. AAL 3/4 is more reliable only when we have
errors in at least four cells, at which point we might do better to switch to an error-correcting code.
Moral: one checksum over the entire message is often better than multiple shorter checksums over parts of
the message.
3.9 Epilog
Along with a few niche protocols, we have focused primarily here on wireless and on virtual circuits. Wireless, of course, is enormously important: it is the enabler for mobile devices, and has largely replaced
traditional Ethernet for home and office workstations.
While it is sometimes tempting (in the IP world at least) to write off ATM as a niche technology, virtual
circuits are a serious conceptual alternative to datagram forwarding. As we shall see in 19 Quality of
Service, IP has problems handling real-time traffic, and virtual circuits offer a solution. The Internet has
so far embraced only small steps towards virtual circuits (such as MPLS, 19.12 Multi-Protocol Label
Switching (MPLS)), but they remain a tantalizing strategy.
3.10 Exercises
1. Suppose remote host A uses a VPN connection to connect to host B, with IP address 200.0.0.7. As
normal Internet connection is via device eth0 with IP address 12.1.2.3; As VPN connection is via device
ppp0 with IP address 10.0.0.44. Whenever A wants to send a packet via ppp0, it is encapsulated and
forwarded over the connection to B at 200.0.0.7.
(a). Suppose As IP forwarding table is set up so that all traffic to 200.0.0.7 uses eth0 and all traffic to
anywhere else uses ppp0. What happens if an intruder M attempts to open a connection to A at 12.1.2.3?
What route will packets from A to M take?
(b). Suppose As IP forwarding table is (mis)configured so that all outbound traffic uses ppp0. Describe
what will happen when A tries to send a packet.
2. Suppose remote host A wishes to use a TCP-based VPN connection to connect to host B, with IP address
200.0.0.7. However, the VPN software is not available for host A. Host A is, however, able to run that
software on a virtual machine V hosted by A; A and V have respective IP addresses 10.0.0.1 and 10.0.0.2
on the virtual network connecting them. V reaches the outside world through network address translation
(1.14 Network Address Translation), with A acting as Vs NAT router. When V runs the VPN software,
it forwards packets addressed to B the usual way, through A using NAT. Traffic to any other destination it
encapsulates over the VPN.
Can A configure its IP forwarding table so that it can make use of the VPN? If not, why not? If so, how? (If
you prefer, you may assume V is a physical host connecting to a second interface on A; A still acts as Vs
NAT router.)
80
3 Other LANs
3. Token Bus was a proprietary Ethernet-based network. It worked like Token Ring in that a small token
packet was sent from one station to the next in agreed-upon order, and a station could transmit only when it
had just received the token.
(a). If the data rate is 10 Mbps and the token is 64 bytes long (the 10-Mbps Ethernet minimum packet size),
how long does it take on average to send a packet on an idle network with 40 stations? Ignore the
propagation delay and the gap Ethernet requires between packets.
(b). Repeat part (a) assuming the tokens are only 16 bytes long.
(c). Sketch a protocol by which stations can sort themselves out to decide the order of token transmission;
that is, an order of the stations S0 ... Sn-1 where station Si sends the token to station S(i+1) mod n .
4. The IEEE 802.11 standard states transmission of the ACK frame shall commence after a SIFS period,
without regard to the busy/idle state of the medium; that is, the ACK sender does not listen first for an idle
network. Give a scenario in which the Wi-Fi ACK frame would fail to be delivered in the absence of this
rule, but succeed with it. Hint: this is another example of the hidden-node problem, 3.3.2.2 Hidden-Node
Problem.
5. Suppose the average contention interval in a Wi-Fi network (802.11g) is 64 SlotTimes. The average
packet size is 1 KB, and the data rate is 54 Mbps. At that data rate, it takes about (81000)/54 = 148 sec
to transmit a packet.
6. WiMAX subscriber stations are not expected to hear one another at all. For Wi-Fi non-access-point
stations in an infrastructure (access-point) setting, on the other hand, listening to other non-access-point
transmissions is encouraged.
(a). List some ways in which Wi-Fi non-access-point stations in an infrastructure (access-point) network do
sometimes respond to packets sent by other non-access-point stations. The responses need not be in the
form of transmissions.
(b). Explain why Wi-Fi stations cannot be required to respond as in part (a).
7. Suppose WiMAX subscriber stations can be moving, at speeds of up to 33 meters/sec (the maximum
allowed under 802.16e).
(a). How much earlier (or later) can one subscriber packet arrive? Assume that the ranging process updates
the stations propagation delay once a minute. The speed of light is about 300 meters/sec.
3.10 Exercises
81
(b). With 5000 senders per second, how much time out of each second must be spent on guard intervals
accommodating the early/late arrivals above? You will need to double the time from part (a), as the base
station cannot tell whether the signal from a moving subscriber will arrive earlier or later.
8. [SM90] contained a proposal for sending IP packets over ATM as N cells as in AAL-5, followed by one
cell containing the XOR of all the previous cells. This way, the receiver can recover from the loss of any
one cell. Suppose N=20 here; with the SM90 mechanism, each packet would require 21 cells to transmit;
that is, we always send 5% more. Suppose the cell loss-rate is p (presumably very small). If we send 20
cells without the SM90 mechanism, we have a probability of about 20p that any one cell will be lost, and we
will have to retransmit the entire 20 again. This gives an average retransmission amount of about 20p extra
packets. For what value of p do the with-SM90 and the without-SM90 approaches involve about the same
total number of cell transmissions?
9. In the example in 3.7 Virtual Circuits, give the VCI table for switch S5.
10. Suppose we have the following network:
A
S1
S2
S3
S4
The virtual-circuit switching tables are below. Ports are identified by the node at the other end. Identify all
the connections. Give the path for each connection and the VCI on each link of the path.
Switch S1:
VCIin
1
2
3
portin
A
A
A
VCIout
2
2
3
portout
S3
S2
S2
VCIout
1
3
4
portout
B
S4
S4
VCIout
2
2
portout
S4
C
VCIout
2
3
1
portout
S2
S3
D
Switch S2:
VCIin
2
2
3
portin
S4
S1
S1
Switch S3:
VCIin
2
3
portin
S1
S4
Switch S4:
VCIin
2
3
4
82
portin
S3
S2
S2
3 Other LANs
S1
S2
S3
S4
Give virtual-circuit switching tables for the following connections. Route via a shortest path.
(a). AD
(b). CB, via S4
(c). BD
(d). AD, via whichever of S2 or S3 was not used in part (a)
12. Below is a set of switches S1 through S4. Define VCI-table entries so the virtual circuit from A to B
follows the path
A S1 S2 S4 S3 S1 S2 S4 S3 B
That is, each switch is visited twice.
A
S1
S2
S3
S4
3.10 Exercises
83
84
3 Other LANs
4 LINKS
At the lowest (logical) level, network links look like serial lines. In this chapter we address how packet
structures are built on top of serial lines, via encoding and framing. Encoding determines how bits and
bytes are represented on a serial line; framing allows the receiver to identify the beginnings and endings of
packets.
We then conclude with the high-speed serial lines offered by the telecommunications industry, T-carrier and
SONET, upon which almost all long-haul point-to-point links that tie the Internet together are based.
4.1.1 NRZ
NRZ (Non-Return to Zero) is perhaps the simplest encoding; it corresponds to direct bit-by-bit transmission
of the 0s and 1s in the data. We have two signal levels, lo and hi, we set the signal to one or the other
of these depending on whether the data bit is 0 or 1, as in the diagram below. Note that in the diagram the
signal bits have been aligned with the start of the pulse representing that signal value.
85
NRZ replaces an earlier RZ (Return to Zero) encoding, in which hi and lo corresponded to +1 and -1, and
between each pair of pulses corresponding to consecutive bits there was a brief return to the 0 level.
One drawback to NRZ is that we cannot distinguish between 0-bits and a signal that is simply idle. However,
the more serious problem is the lack of synchronization: during long runs of 0s or long runs of 1s, the
receiver can lose count, eg if the receivers clock is running a little fast or slow. The receivers clock can
and does resynchronize whenever there is a transition from one level to the other. However, suppose bits
are sent at one per s, the sender sends 5 1-bits in a row, and the receivers clock is running 10% fast. The
signal sent is a 5-s hi pulse, but when the pulse ends the receivers clock reads 5.5 s due to the clock
speedup. Should this represent 5 1-bits or 6 1-bits?
4.1.2 NRZI
An alternative that helps here (though not obviously at first) is NRZI, or NRZ Inverted. In this encoding,
we represent a 0-bit as no change, and a 1-bit as a transition from lo to hi or hi to lo:
Now there is a signal transition aligned above every 1-bit; a 0-bit is represented by the lack of a transition.
This solves the synchronization problem for runs of 1-bits, but does nothing to address runs of 0-bits.
However, NRZI can be combined with techniques to minimize runs of 0-bits, such as 4B/5B (below).
4.1.3 Manchester
Manchester encoding sends the data stream using NRZI, with the addition of a clock transition between
each pair of consecutive data bits. This means that the signaling rate is now double the data rate, eg 20
MHz for 10Mbps Ethernet (which does use Manchester encoding). The signaling is as if we doubled the
bandwidth and inserted a 1-bit between each pair of consecutive data bits, removing this extra bit at the
receiver:
86
4 Links
All these transitions mean that the longest the clock has to count is 1 bit-time; clock synchronization is
essentially solved, at the expense of the doubled signaling rate.
4.1.4 4B/5B
In 4B/5B encoding, for each 4-bit nybble of data we actually transmit a designated 5-bit symbol, or code,
selected to have enough 1-bits. A symbol in this sense is a digital or analog transmission unit that decodes
to a set of data bits; the data bits are not transmitted individually.
Specifically, every 5-bit symbol used by 4B/5B has at most one leading 0-bit and at most two trailing 0-bits.
The 5-bit symbols corresponding to the data are then sent with NRZI, where runs of 1s are safe. Note that
the worst-case run of 0-bits has length three. Note also that the signaling rate here is 1.25 times the data
rate. 4B/5B is used in 100-Mbps Ethernet, 2.2 100 Mbps (Fast) Ethernet. The mapping between 4-bit data
values and 5-bit symbols is fixed by the 4B/5B standard:
data
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
symbol
11110
01001
10100
10101
01010
01011
01110
01111
10010
10011
10110
data
1011
1100
1101
1110
1111
IDLE
HALT
START
END
RESET
DEAD
symbol
10111
11010
11011
11100
11101
11111
00100
10001
01101
00111
00000
There are more than sixteen possible symbols; this allows for some symbols to be used for signaling rather
than data. IDLE, HALT, START, END and RESET are shown above, though there are others. These can be
used to include control and status information without fear of confusion with the data. Some combinations
of control symbols do lead to up to four 0-bits in sequence; HALT and RESET have two leading 0-bits.
10-Mbps and 100-Mbps Ethernet pads short packets up to the minimum packet size with 0-bytes, meaning
that the next protocol layer has to be able to distinguish between padding and actual 0-byte data. Although
100-Mbps Ethernet uses 4B/5B encoding, it does not make use of special non-data symbols for packet
padding. Gigabit Ethernet uses PAM-5 encoding (2.3 Gigabit Ethernet), and does use special non-data
symbols (inserted by the hardware) to pad packets; there is thus no ambiguity at the receiving end as to
where the data bytes ended.
The choice of 5-bit symbols for 4B/5B is in principle arbitrary; note however that for data from 0100 to
1101 we simply insert a 1 in the fourth position, and in the last two we insert a 0 in the fourth position. The
4.1 Encoding and Framing
87
first four symbols (those with the most zeroes) follow no obvious pattern, though.
4.1.5 Framing
How does a receiver tell when one packet stops and the next one begins, to keep them from running together?
We have already seen the following techniques for addressing this framing problem: determining where
packets end:
Interpacket gaps (as in Ethernet)
4B/5B and special bit patterns
Putting a length field in the header would also work, in principle, but seems not to be widely used. One
problem with this technique is that restoring order after desynchronization can be difficult.
There is considerable overlap of framing with encoding; for example, the existence of non-data bit patterns
in 4B/5B is due to an attempt to solve the encoding problem; these special patterns can also be used as
unambiguous frame delimiters.
4.1.5.1 HDLC
HDLC (High-level Data Link Control) is a general link-level packet format used for a number of applications, including Point-to-Point Protocol (PPP) (which in turn is used for PPPoE PPP over Ethernet which
is how a great many Internet subscribers connect to their ISP), and Frame Relay, still used as the low-level
protocol for delivering IP packets to many sites via telecommunications lines. HDLC supports the following
two methods for frame separation:
HDLC over asynchronous links: byte stuffing
HDLC over synchronous links: bit stuffing
The basic encapsulation format for HDLC packets is to begin and end each frame with the byte 0x7E, or, in
binary, 0111 1110. The problem is that this byte may occur in the data as well; we must make sure we dont
misinterpret such a data byte as the end of the frame.
Asynchronous serial lines are those with some sort of start/stop indication, typically between bytes; such
lines tend to be slower. Over this kind of line, HDLC uses the byte 0x7D as an escape character. Any
data bytes of 0x7D and 0x7E are escaped by preceding them with an additional 0x7D. (Actually, they are
transmitted as 0x7D followed by (original_byte xor 0x20).) This strategy is fundamentally the same as that
used by C-programming-language character strings: the string delimiter is and the escape character is \.
Any occurrences of or \ within the string are escaped by preceding them with \.
Over synchronous serial lines (typically faster than asynchronous), HDLC generally uses bit stuffing. The
underlying bit encoding involves, say, the reverse of NRZI, in which transitions denote 0-bits and lack of
transitions denote 1-bits. This means that long runs of 1s are now the problem and runs of 0s are safe.
Whenever five consecutive 1-bits appear in the data, eg 011111, a 0-bit is then inserted, or stuffed, by the
transmitting hardware (regardless of whether or not the next data bit is also a 1). The HDLC frame byte of
0x7E = 0111 1110 thus can never appear as encoded data, because it contains six 1-bits in a row. If we had
0x7E in the data, it would be transmitted as 0111 11010.
The HDLC receiver knows that
88
4 Links
This double-violation is the clue to the receiver that the special pattern is to be removed and replaced with
the original eight 0-bits.
89
Once upon a time it was not uncommon to link computers with serial lines, rather than packet networks.
This was most often done for file transfers, but telnet logins were also done this way. The problem with this
approach is that the line had to be dedicated to one application (or one user) at a time.
Packet switching naturally implements multiplexing (sharing) on links; the demultiplexer is the destination
address. Port numbers allow demultiplexing of multiple streams to same destination host.
There are other ways for multiple channels to share a single wire. One approach is frequency-division
multiplexing, or putting each channel on a different carrier frequency. Analog cable television did this.
Some fiber-optic protocols also do this, calling it wavelength-division multiplexing.
But perhaps the most pervasive alternative to packets is the voice telephone systems time division multiplexing, or TDM, sometimes prefixed with the adjective synchronous. The idea is that we decide on a
number of channels, N, and the length of a timeslice, T, and allow each sender to send over the channel for
time T, with the senders taking turns in round-robin style. Each sender gets to send for time T at regular
intervals of NT, thus receiving 1/N of the total bandwidth. The timeslices consume no bandwidth on headers
or addresses, although sometimes there is a small amount of space dedicated to maintaining synchronization
between the two endpoints. Here is a diagram of sending with N=8:
Note, however, that if a sender has nothing to send, its timeslice cannot be used by another sender. Because
so much data traffic is bursty, involving considerable idle periods, TDM has traditionally been rejected for
data networks.
4 Links
The next most common T-carrier / Digital Signal line is perhaps T3/DS3; this represents the TDM multiplexing of 28 DS1 signals. The problem is that some individual DS1s may run a little slow, so an elaborate pulse
stuffing protocol has been developed. This allows extra bits to be inserted at specific points, if necessary, in
such a way that the original component T1s can be exactly recovered even if there are clock irregularities.
The pulse-stuffing solution did not scale well, and so T-carrier levels past T3 were very rarely used.
While T-carrier was originally intended as a way of bundling together multiple DS0 channels on a single
high-speed line, it also allows providers to offer leased digital point-to-point links with data rates in almost
any multiple of the DS0 rate.
4.2.2 SONET
SONET stands for Synchronous Optical NETwork; it is the telecommunications industrys standard mechanism for very-high-speed TDM over optical fiber. While there is now flexibility regarding the the optical
part, the synchronous part is taken quite seriously indeed, and SONET senders and receivers all use very
precisely synchronized clocks (often atomic). The actual bit encoding is NRZI.
Due to the frame structure, below, the longest possible run of 0-bits is ~250 bits (~30 bytes), but is usually
much less. Accurate reception of 250 0-bits requires a clock accurate to within (at a minimum) one part in
500, which is generally within reach. This mechanism solves most of the clock-synchronization problem,
though SONET also has a resynchronization protocol in case the receiver gets lost.
The primary reason for SONETs accurate clocking, however, is not the clock-synchronization problem as
we have been using the term, but rather the problem of demultiplexing and remultiplexing multiple component bitstreams in a setting in which some of the streams may run slow. One of the primary design goals for
SONET was to allow such multiplexing without the need for pulse stuffing, as is used in the Digital Signal
hierarchy. SONET tributary streams are in effect not allowed to run slow (although SONET does provide
for occasional very small byte slips, below). Furthermore, as multiple SONET streams are demultiplexed
at a switching center and then remultiplexed into new SONET streams, synchronization means that none of
the streams falls behind or gets ahead.
The basic SONET format is known as STS-1. Data is organized as a 9x90 byte grid. The first 3 bytes of
each row (that is, the first three columns) form the frame header. Frames are not addressed; SONET is a
point-to-point protocol and a node sends a continuous sequence of frames to each of its neighbors. When the
frames reach their destination, in principle they need to be fully demultiplexed for the data to be forwarded
on. In practice, there are some shortcuts to full demultiplexing.
91
The actual bytes sent are scrambled: the data is XORed with a standard, fixed pseudorandom pattern before
transmission. This introduces many 1-bits, on which clock resynchronization can occur, with a high degree
of probability.
There are two other special columns in a frame, each guaranteed to contain at least one 1-bit, so the maximum run of data bytes is limited to ~30; this is thus the longest run of possible 0s.
The first two bytes of each frame are 0xF628. SONETs frame-synchronization check is based on verifying
these byte values at the start of each frame. If the receiver is ever desynchronized, it begins a frame resynchronization procedure: the receiver searches for those 0xF628 bytes at regular 810-byte (6480-bit)
spacing. After a few frames with 0xF628 in the right place, the receiver is very sure it is looking at the
synchronization bytes and not at a data-byte position. Note that there is no evident byte boundary to a
SONET frame, so the receiver must check for 0xF628 beginning at every bit position.
SONET frames are transmitted at a rate of 8,000 frames/second. This is the canonical byte sampling rate
for standard voice-grade (DS0, or 64 Kbps) lines. Indeed, the classic application of SONET is to transmit
multiple DS0 voice calls using TDM: within a frame, each data byte position is given over to one voice
channel. The same byte position in consecutive frames constitutes one byte every 1/8000 seconds. The
basic STS-1 data rate of 51.84 Mbps is exactly 810 bytes/frame 8 bits/byte 8000 frames/sec.
To a customer who has leased a SONET-based channel to transmit data, a SONET link looks like a very fast
bitstream. There are several standard ways of encoding data packets over SONET. One is to encapsulate the
data as ATM cells, and then embed the cells contiguously in the bitstream. Another is to send IP packets
encoded in the bitstream using HDLC-like bit stuffing, which means that the SONET bytes and the IP bytes
may no longer correspond. The advantage of HDLC encoding is that it makes SONET re-synchronization
vanishingly infrequent. Most IP backbone traffic today travels over SONET links.
Within the 990-byte STS-1 frame, the payload envelope is the 987 region nominally following the three
header columns; this payload region has its own three reserved columns meaning that there are 84 columns
(984 bytes) available for data. This 987-byte payload envelope can float within the physical 990byte frame; that is, if the input frames are running slow then the output physical frames can be transmitted
at the correct rate by letting the payload frames slip backwards, one byte at a time. Similarly, if the input
frames are arriving slightly too fast, they can slip forwards by up to one byte at a time; the extra byte is
stored in a reserved location in the three header columns of the 990 physical frame.
Faster SONET streams are made by multiplexing slower ones. The next step up is STS-3, an STS-3 frame
is three STS-1 frames, for 9270 bytes. STS-3 (or, more properly, the physical layer for STS-3) is also
called OC-3, for Optical Carrier. Beyond STS-3, faster lines are multiplexed combinations of four of the
next-slowest lines. Here are some of the higher levels:
STS
STS-1
STS-3
STS-12
STS-48
STS-192
STS-768
STM
STM-0
STM-1
STM-4
STM-16
STM-64
STM-256
bandwidth
51.84 Mbps
155.52 Mbps
622.08 Mbps (=12*51.84, exactly)
2.488 Gbps
9.953 Gbps
39.8 Gbps
Faster SONET lines have been defined, but a simpler way to achieve very high data rates over optical fiber is
to use wavelength-division multiplexing (that is, frequency-division multiplexing at optical frequencies);
this means we have separate SONET channels at different wavelengths of light.
92
4 Links
SONET provides a wide variety of leasing options at various bandwidths. High-volume customers can lease
an entire STS-1 or larger unit. Alternatively, the 84 columns of an STS-1 frame can be divided into seven
virtual tributary groups, each of twelve columns; these groups can be leased individually or in multiples,
or be further divided into as few as three columns (which works out to be just over the T1 data rate).
4.2.2.1 Other Optical Fiber
4.3 Epilog
This completes our discussion of common physical links. Perhaps the main takeaway point is that transmitting bits over any distance is not quite as simple as it may appear; simple NRZ transmission is not effective.
4.4 Exercises
1. What is encoded by the following NRZI signal? The first two bits are shown.
4.3 Epilog
93
2. Argue that sending 4 0-bits via NRZI requires a clock accurate to within 1 part in 8. Assume that the
receiver resynchronizes its clock whenever a 1-bit transition is received, but that otherwise it attempts to
sample a bit in the middle of the bits timeslot.
3.(a) What bits are encoded by the following Manchester-encoded sequence?
(b). Why is there no ambiguity as to whether the first transition is a clock transition or a data (1-bit)
transition?
(c). Give an example of a signal pattern consisting of an NRZI encoding of 0-bits and 1-bits that does not
contain two consecutive 0-bits and which is not a valid Manchester encoding of data. Such a pattern could
thus could be used as a special non-data marker.
4. What three ASCII letters (bytes) are encoded by the following 4B/5B pattern? (Be careful about uppercase
vs lowercase.)
010110101001110101010111111110
5.(a) Suppose a device is forwarding SONET STS-1 frames. How much clock drift, as a percentage, on
the incoming line would mean that the output payload envelopes must slip backwards by one byte per three
physical frames? | (b). In 4.2.2 SONET it was claimed that sending 250 0-bits required a clock accurate to
within 1 part in 500. Describe how a SONET clock might meet the requirement of part (a) above, and yet
fail at this second requirement. (Hint: in part (a) the requirement is a long-term average).
94
4 Links
5 PACKETS
In this chapter we address a few abstract questions about packets, and take a close look at transmission
times. We also consider how big packets should be, and how to detect transmission errors. These issues are
independent of any particular set of protocols.
95
Finally, a switch may or may not also introduce queuing delay; this will often depend on competing traffic.
We will look at this in more detail in 14 Dynamics of TCP Reno, but for now note that a steady queuing delay
(eg due to a more-or-less constant average queue utilization) looks to each sender more like propagation
delay than bandwidth delay, in that if two packets are sent back-to-back and arrive that way at the queue,
then the pair will experience only a single queuing delay.
Like the previous example except that the propagation delay is increased to 4 ms
The total transmit time is now 4200 sec = 200 sec + 4000 sec
Case 3: A
We now have two links, each with propagation delay 40 sec; bandwidth and packet size as in
Case 1
The total transmit time for one 200-byte packet is now 480 sec = 240 + 240. There are two propagation
delays of 40 sec each; A introduces a bandwidth delay of 200 sec and R introduces a store-and-forward
delay (or second bandwidth delay) of 200 sec.
Case 4: A
96
5 Packets
These ladder diagrams represent the full transmission; a snapshot state of the transmission at any one
instant can be obtained by drawing a horizontal line. In the middle, case 3, diagram, for example, at no
instant are both links active. Note that sending two smaller packets is faster than one large packet. We
expand on this important point below.
Now let us consider the situation when the propagation delay is the most significant component. The crosscontinental US roundtrip delay is typically around 50-100 ms (propagation speed 200 km/ms in cable, 5,00010,000 km cable route, or about 3-6000 miles); we will use 100 ms in the examples here. At 1.0 Mbit, 100ms
is about 12KB, or eight full-sized Ethernet packets. At this bandwidth, we would have four packets and four
returning ACKs strung out along the path. At 1.0 Gbit, in 100ms we can send 12,000 KB, or 800 Ethernet
packets, before the first ACK returns.
At most non-LAN scales, the delay is typically simplified to the round-trip time, or RTT: the time between
sending a packet and receiving a response.
Different delay scenarios have implications for protocols: if a network is bandwidth-limited then protocols
are easier to design. Extra RTTs do not cost much, so we can build in a considerable amount of back-andforth exchange. However, if a network is delay-limited, the protocol designer must focus on minimizing
extra RTTs. As an extreme case, consider wireless transmission to the moon (0.3 sec RTT), or to Jupiter (1
hour RTT).
At my home I formerly had satellite Internet service, which had a roundtrip propagation delay of ~600 ms.
This is remarkably high when compared to purely terrestrial links.
When dealing with reasonably high-bandwidth large-scale networks (eg the Internet), to good approximation most of the non-queuing delay is propagation, and so bandwidth and total delay are effectively
independent. Only when propagation delay is small are the two interrelated. Because propagation delay
dominates at this scale, we can often make simplifications when diagramming. In the illustration below, A
sends a data packet to B and receives a small ACK in return. In (a), we show the data packet traversing
several switches; in (b) we show the data packet as if it were sent along one long unswitched link, and in (c)
we introduce the idealization that bandwidth delay (and thus the width of the packet line) no longer matters.
(Most later ladder diagrams in this book are of this type.)
97
The bandwidth delay product (usually involving round-trip delay, or RTT), represents how much we can
send before we hear anything back, or how much is pending in the network at any one time if we send
continuously. Note that, if we use RTT instead of one-way time, then half the pending packets will be
returning ACKs. Here are a few values
RTT
1 ms
100 ms
100 ms
bandwidth
10 Mbps
1.5 Mbps
600 Mbps
bandwidth delay
1.2 KB
20 KB
8,000 KB
5 Packets
Alternatively, perhaps routers are allowed to reserve a varying amount of bandwidth for high-priority traffic,
depending on demand, and so the bandwidth allocated to the best-effort traffic can vary. Perceived link
bandwidth can also vary over time if packets are compressed at the link layer, and some packets are able to
be compressed more than others.
Finally, if mobile nodes are involved, then the distance and thus the propagation delay can change. This can
be quite significant if one is communicating with a wireless device that is being taken on a cross-continental
road trip.
Despite these sources of fluctuation, we will usually assume that RTTnoLoad is fixed and well-defined, especially when we wish to focus on the queuing component of delay.
R1
R2
R3
R4
Suppose we send either one big packet or five smaller packets. The relative times from A to B are illustrated
in the following figure:
99
The point is that we can take advantage of parallelism: while the R4B link above is handling packet 1,
the R3R4 link is handling packet 2 and the R2R3 link is handling packet 3 and so on. The five smaller
packets would have five times the header capacity, but as long as headers are small relative to the data, this
is not a significant issue.
The sliding-windows algorithm, used by TCP, uses this idea as a continuous process: the sender sends a
continual stream of packets which travel link-by-link so that, in the full-capacity case, all links may be in
use at all times.
100
5 Packets
average seven times; lossless transmission would require 50 packets but we in fact need 750 = 350 packets,
or 7,000,000 bits.
Moral: choose the packet size small enough that most packets do not encounter errors.
To be fair, very large packets can be sent reliably on most cable links (eg TDM and SONET). Wireless,
however, is more of a problem.
101
in binary; if one adds two positive integers and the sum does not overflow the hardware word size, then
ones-complement and the now-universal twos-complement are identical. To form the ones-complement sum
of 16-bit words A and B, first take the ordinary twos-complement sum A+B. Then, if there is an overflow
bit, add it back in as low-order bit. Thus, if the word size is 4 bits, the ones-complement sum of 0101 and
0011 is 1000 (no overflow). Now suppose we want the ones-complement sum of 0101 and 1100. First we
take the exact sum and get 1|0001, where the leftmost 1 is an overflow bit past the 4-bit wordsize. Because
of this overflow, we add this bit back in, and get 0010.
The ones-complement numeric representation has two forms for zero: 0000 and 1111 (it is straightforward
to verify that any 4-bit quantity plus 1111 yields the original quantity; in twos-complement notation 1111
represents -1, and an overflow is guaranteed, so adding back the overflow bit cancels the -1 and leaves us
with the original number). It is a fact that the ones-complement sum is never 0000 unless all bits of all the
summands are 0; if the summands add up to zero by coincidence, then the actual binary representation will
be 1111. This means that we can use 0000 in the checksum to represent checksum not calculated, which
the UDP protocol used to permit.
Ones-complement
Long ago, before Loyola had any Internet connectivity, I wrote a primitive UDP/IP stack to allow me
to use the Ethernet to back up one machine that did not have TCP/IP to another machine that did. We
used private IP addresses of the form 10.0.0.x. I set as many header fields to zero as I could. I paid
no attention to how to implement ones-complement addition; I simply used twos-complement, for the
IP header only, and did not use a UDP checksum at all. Hey, it worked.
Then we got a real Class B address block 147.126.0.0/16, and changed IP addresses. My software no
longer worked. It turned out that, in the original version, the IP header bytes were all small enough
that when I added up the 16-bit words there were no carries, and so ones-complement was the same as
twos-complement. With the new addresses, this was no longer true. As soon as I figured out how to
implement ones-complement addition properly, my backups worked again.
There is another way to look at the (16-bit) ones-complement sum: it is in fact the remainder upon dividing
the message (seen as a very long binary number) by 216 - 1. This is similar to the decimal casting out nines
rule: if we add up the digits of a base-10 number, and repeat the process until we get a single digit, then that
digit is the remainder upon dividing the original number by 10-1 = 9. The analogy here is that the message
is looked at as a very large number written in base-216 , where the digits are the 16-bit words. The process
of repeatedly adding up the digits until we get a single digit amounts to taking the ones-complement
sum of the words.
A weakness of any error-detecting code based on sums is that transposing words leads to the same sum, and
the error is not detected. In particular, if a message is fragmented and the fragments are reassembled in the
wrong order, the ones-complement sum will likely not detect it.
While some error-detecting codes are better than others at detecting certain kinds of systematic errors (for
example, CRC, below, is usually better than the Internet checksum at detecting transposition errors), ultimately the effectiveness of an error-detecting code depends on its length. Suppose a packet P1 is corrupted
randomly into P2, but still has its original N-bit error code EC(P1). This N-bit code will fail to detect the
error that has occurred if EC(P2) is, by chance, equal to EC(P1). The probability that two random N-bit
codes will match is 1/2N (though a small random change in P1 might not lead to a uniformly distributed
random change in EC(P1); see the tail end of the CRC section below).
102
5 Packets
This does not mean, however, that one packet in 2N will be received incorrectly, as most packets are errorfree. If we use a 16-bit error code, and only 1 packet in 100,000 is actually corrupted, then the rate at which
corrupted packets will sneak by is only 1 in 100,000 65536, or about one in 6 109 . If packets are 1500
bytes, you have a good chance (90+%) of accurately transferring a terabyte, and a 37% chance (1/e) at ten
terabytes.
103
P2 guarantees that CRC(P1) = CRC(P2). For the Internet checksum, this is not guaranteed even if we know
only two bits were changed.
Finally, there are also secure hashes, such as MD-5 and SHA-1 and their successors. Nobody knows
(or admits to knowing) how to produce two messages with same hash here. However, these secure-hash
codes are generally not used in network error-correction as they take considerable time to compute; they are
generally used only for secure authentication and other higher-level functions.
Now suppose one bit is corrupted; for simplicity, assume it is one of the data bits. Then exactly one columnparity bit will be incorrect, and exactly one row-parity bit will be incorrect. These two incorrect bits mark
the column and row of the incorrect data bit, which we can then flip to the correct state.
We can make N large, but an essential requirement here is that there be only a single corrupted bit per square.
We are thus likely either to keep N small, or to choose a different code entirely that allows correction of
multiple bits. Either way, the addition of error-correcting codes can easily increase the size of a packet
significantly; some codes double or even triple the total number of bits sent.
The Hamming code is another popular error-correction code; it adds O(log N) additional bits, though if N is
large enough for this to be a material improvement over the O(N1/2 ) performance of 2-D parity then errors
must be very infrequent. If we have 8 data bits, let us number the bit positions 0 through 7. We then write
each bits position as a binary value between 000 and 111; we will call these the position bits of the given
data bit. We now add four code bits as follows:
1. a parity bit over all 8 data bits
104
5 Packets
2. a parity bit over those data bits for which the first digit of the position bits is 1 (these are positions 4,
5, 6 and 7)
3. a parity bit over those data bits for which the second digit of the position bits is 1 (these are positions
010, 011, 110 and 111, or 2, 3, 6 and 7)
4. a parity bit over those data bits for which the third digit of the position bits is 1 (these are positions
001, 011, 101, 111, or 1, 3, 5 and 7)
We can tell whether or not an error has occurred by the first code bit; the remaining three code bits then tell
us the respective three position bits of the incorrect bit. For example, if the #2 code bit above is correct, then
the first digit of the position bits is 0; otherwise it is one. With all three position bits, we have identified the
incorrect data bit.
As a concrete example, suppose the data word is 10110010. The four code bits are thus
1. 0, the (even) parity bit over all eight bits
2. 1, the parity bit over the second half, 10110010
3. 1, the parity bit over the bold bits: 10110010
4. 1, the parity bit over these bold bits: 10110010
If the received data+code is now 10111010 0111, with the bold bit flipped, then the fact that the first code
bit is wrong tells the receiver there was an error. The second code bit is also wrong, so the first bit of the
position bits must be 1. The third code bit is right, so the second bit of the position bits must be 0. The
fourth code bit is also right, so the third bit of the position bits is 0. The position bits are thus binary 100, or
4, and so the receiver knows that the incorrect bit is in position 4 (counting from 0) and can be flipped to the
correct state.
5.5 Epilog
The issues presented here are perhaps not very glamorous, and often play a supporting, behind-the-scenes
role in protocol design. Nonetheless, their influence is pervasive; we may even think of them as part of the
underlying physics of the Internet.
As the early Internet became faster, for example, and propagation delay became the dominant limiting
factor, protocols were often revised to limit the number of back-and-forth exchanges. A classic example is
the Simple Mail Transport Protocol (SMTP), amended by RFC 1854 to allow multiple commands to be sent
together called pipelining instead of individually.
While there have been periodic calls for large-packet support in IPv4, and IPv6 protocols exist for jumbograms in excess of a megabyte, these are very seldom used, due to the store-and-forward costs of large
packets as described in 5.3 Packet Size.
Almost every LAN-level protocol, from Ethernet to Wi-Fi to point-to-point links, incorporates an errordetecting code chosen to reflect the underlying transportation reliability. Ethernet includes a 32-bit CRC
code, for example, while Wi-Fi includes extensive error-correcting codes due to the noisier wireless environment. The Wi-Fi fragmentation option (3.3.2.3 Wi-Fi Fragmentation) is directly tied to 5.3.1 Error
Rates and Packet Size.
5.5 Epilog
105
5.6 Exercises
1. Suppose a link has a propagation delay of 20 sec and a bandwidth of 2 bytes/sec.
(a). How long would it take to transmit a 600-byte packet over such a link?
(b). How long would it take to transmit the 600-byte packet over two such links, with a store-and-forward
switch in between?
(a). How long would it take to send a single 600-byte packet from A to B?
(b). How long would it take to send two back-to-back 300-byte packets from A to B?
(c). How long would it take to send three back-to-back 200-byte packets from A to B?
3. Repeat parts (a) and (b) of the previous exercise, except change the per-link propagation delay from 60
sec to 600 sec.
4. Again suppose the path from A to B has a single switch S in between: A
width and propagation delays are as follows:
link
A S
S B
bandwidth
5 bytes/sec
3 bytes/sec
propagation delay
24 sec
13 sec
(a). How long would it take to send a single 600-byte packet from A to B?
(b). How long would it take to send two back-to-back 300-byte packets from A to B? Note that, because
the S B link is slower, packet 2 arrives at S from A well before S has finished transmitting packet 1 to B.
5. Suppose in the previous exercise, the AS link has the smaller bandwidth of 3 bytes/sec and the SB
link has the larger bandwidth of 5 bytes/sec. Now how long does it take to send two back-to-back 300-byte
packets from A to B?
6. Suppose we have five links, A R1
R2 R3
R4
bytes/ms. Assume we model the per-link propagation delay as 0.
5 Packets
7. Suppose there are N equal-bandwidth links on the path between A and B, as in the diagram below, and
we wish to send M consecutive packets.
S1
...
SN-1
Let BD be the bandwidth delay of a single packet on a single link, and let PD be the propagation delay on a
single link. Show that the total bandwidth delay is (M+N-1)BD, and the total propagation delay is NPD.
Hint: When does the Mth packet leave A? What is its total transit time to B? Why do no packets have to wait
at any Si for the completion of the transmission of an earlier packet?
8. Repeat the analysis in 5.3.1 Error Rates and Packet Size to compare the probable total number of bytes
that need to be sent to transmit 107 bytes using
Assume the bit error rate is 1 in 16 105 , making the error rate per byte about 1 in 2 105 .
9. In the text it is claimed there is no N-bit error code that catches all N-bit errors for N2 (for N=1, a
parity bit works). Prove this claim for N=2. Hint: pick a length M, and consider all M-bit messages with a
single 1-bit. Any such message can be converted to any other with a 2-bit error. Show, using the Pigeonhole
Principle, that for large enough M two messages m1 and m2 must have the same error code, that is, e(m1 ) =
e(m2 ). If this occurs, then the error code fails to detect the error that converted m1 into m2 .
10. In the description in the text of the Internet checksum, the overflow bit was added back in after each
ones-complement addition. Show that the same final result will be obtained if we add up the 16-bit words
using 32-bit twos-complement arithmetic (the normal arithmetic on all modern hardware), and then add the
upper 16 bits of the sum to the lower 16 bits. (If there is an overflow at this last step, we have to add that
back in as well.)
11. Suppose a message is 110010101. Calculate the CRC-3 checksum using the polynomial X3 + 1, that is,
find the 3-bit remainder using divisor 1001.
12. The CRC algorithm presented above requires that we process one bit at a time. It is possible to do the
algorithm N bits at a time (eg N=8), with a precomputed lookup table of size 2N . Complete the steps in the
following description of this strategy for N=3 and polynomial X3 + X + 1, or 1011.
13. Consider the following set of bits sent with 2-D even parity; the data bits are in the 44 upper-left block
and the parity bits are in the rightmost column and bottom row. Which bit is corrupted?
1
5.6 Exercises
107
14. (a) Show that 2-D parity can detect any three errors.
(b). Find four errors that cannot be detected by 2-D parity.
(c). Show that that 2-D parity cannot correct all two-bit errors. Hint: put both bits in the same row or
column.
15. Each of the following 8-bit messages with 4-bit Hamming code contains a single error. Correct the
message.
16. (a) What happens in 2-D parity if the corrupted bit is in the parity column or parity row?
(b). In the following 8-bit message with 4-bit Hamming code, there is an error in the code portion. How
can this be determined?
1001 1110 0100
108
5 Packets
In this chapter we take a general look at how to build reliable data-transport layers on top of unreliable
lower layers. This is achieved through a retransmit-on-timeout policy; that is, if a packet is transmitted
and there is no acknowledgment received during the timeout interval then the packet is resent. As a class,
protocols where one side implements retransmit-on-timeout are known as ARQ protocols, for Automatic
Repeat reQuest.
In addition to reliability, we also want to keep as many packets in transit as the network can support. The
strategy used to achieve this is known as sliding windows. It turns out that the sliding-windows algorithm
is also the key to managing congestion; we return to this in 13 TCP Reno and Congestion Management.
The End-to-End principle, 12.1 The End-to-End Principle, suggests that trying to achieve a reliable transport layer by building reliability into a lower layer is a misguided approach; that is, implementing reliability
at the endpoints of a connection as is described here is in fact the correct mechanism.
109
110
The right half of the diagram, by comparison, illustrates the case of a lost ACK. The receiver has received
a duplicate Data[N]. We have assumed here that the receiver has implemented a retransmit-on-duplicate
strategy, and so its response upon receipt of the duplicate Data[N] is to retransmit ACK[N].
As a final example, note that it is possible for ACK[N] to have been delayed (or, similarly, for the first
Data[N] to have been delayed) longer than the timeout interval. Not every packet that times out is actually
lost!
In this case we see that, after sending Data[N], receiving a delayed ACK[N] (rather than the expected
ACK[N+1]) must be considered a normal event.
In principle, either side can implement retransmit-on-timeout if nothing is received. Either side can also
implement retransmit-on-duplicate; this was done by the receiver in the second example above but not by
the sender in the third example (the sender received a second ACK[N] but did not retransmit Data[N+1]).
111
At least one side must implement retransmit-on-timeout; otherwise a lost packet leads to deadlock as the
sender and the receiver both wait forever. The other side must implement at least one of retransmit-onduplicate or retransmit-on-timeout; usually the former alone. If both sides implement retransmit-on-timeout
with different timeout values, generally the protocol will still work.
Sorcerers Apprentice
The Sorcerers Apprentice bug is named for the legend in which the apprentice casts a spell
on a broom to carry water, one bucket at a time. When the basin is full, the apprentice
chops the broom in half, only to find both halves carrying water. See Disneys Fantasia,
http://www.youtube.com/watch?v=XChxLGnIwCU, at around T = 5:35.
A strange thing happens if one side implements retransmit-on-timeout but both sides implement retransmiton-duplicate, as can happen if the implementer takes the naive view that retransmitting on duplicates is
safer; the moral here is that too much redundancy can be the Wrong Thing. Let us imagine that an
implementation uses this strategy, and that the initial ACK[3] is delayed until after Data[3] is retransmitted
on timeout. In the following diagram, the only packet retransmitted due to timeout is the second Data[3]; all
the other duplications are due to the retransmit-on-duplicate strategy.
112
All packets are sent twice from Data[3] on. The transfer completes normally, but takes double the normal
bandwidth.
113
Window Size
In this chapter we will assume winsize does not change. TCP, however, varies winsize up and down
with the goal of making it as large as possible without introducing congestion; we will return to this in
13 TCP Reno and Congestion Management.
At any instant, the sender may send packets numbered last_ACKed + 1 through last_ACKed + winsize; this
packet range is known as the window. Generally, if the first link in the path is not the slowest one, the sender
will most of the time have sent all these.
If ACK[N] arrives with N > last_ACKed (typically N = last_ACKed+1), then the window slides forward; we
set last_ACKed = N. This also increments the upper edge of the window, and frees the sender to send more
packets. For example, with winsize = 4 and last_ACKed = 10, the window is [11,12,13,14]. If ACK[11]
arrives, the window slides forward to [12,13,14,15], freeing the sender to send Data[15]. If instead ACK[13]
arrives, then the window slides forward to [14,15,16,17] (recall that ACKs are cumulative), and three more
packets become eligible to be sent. If there is no packet reordering and no packet losses (and every packet
is ACKed individually) then the window will slide forward in units of one packet at a time; the next arriving
ACK will always be ACK[last_ACKed+1].
Note that the rate at which ACKs are returned will always be exactly equal to the rate at which the slowest
link is delivering packets. That is, if the slowest link (the bottleneck link) is delivering a packet every
50 ms, then the receiver will receive those packets every 50 ms and the ACKs will return at a rate of
one every 50 ms. Thus, new packets will be sent at an average rate exactly matching the delivery rate;
this is the sliding-windows self-clocking property. Self-clocking has the effect of reducing congestion by
automatically reducing the senders rate whenever the available fraction of the bottleneck bandwidth is
reduced.
Here is a video of sliding windows in action, with winsize = 5. The second link, RB, has a capacity of five
packets in transit either way; the AR link has a capacity of one packet in transit either way.
The packets in the very first RTT represent connection setup. This particular video also demonstrates TCP
Slow Start: in the first data-packet RTT, two packets are sent, and in the second data RTT four packets
are sent. The full window size of five is reached in the third data RTT. For the rest of the connection, at
any moment (except those instants where packets have just been received) there are five packets in flight,
either being transmitted on a link as data, or being transmitted as an ACK, or sitting in a queue (this last
does not happen in the video).
will become clearer below, a winsize smaller than this means underutilization of the network, while a larger
winsize means each packet spends time waiting in a queue somewhere.
Below are simplified diagrams for sliding windows with window sizes of 1, 4 and 6, each with a path
bandwidth of 6 packets/RTT (so bandwidth RTT = 6 packets). The diagram shows the initial packets sent
as a burst; these then would be spread out as they pass through the bottleneck link so that, after the first
burst, packet spacing is uniform. (Real sliding-windows protocols such as TCP generally attempt to avoid
such initial bursts.)
With winsize=1 we send 1 packet per RTT; with winsize=4 we always average 4 packets per RTT. To put
this another way, the three window sizes lead to bottle-neck link utilizations of 1/6, 4/6 and 6/6 = 100%,
respectively.
While it is tempting to envision setting winsize to bandwidth RTT, in practice this can be complicated;
neither bandwidth nor RTT is constant. Available bandwidth can fluctuate in the presence of competing
traffic. As for RTT, if a sender sets winsize too large then the RTT is simply inflated to the point that
bandwidth RTT matches winsize; that is, a connections own traffic can inflate RTTactual to well above
RTTnoLoad . This happens even in the absence of competing traffic.
6.2 Sliding Windows
115
116
The slow links are R2R3 and R3R4. We will refer to the slowest link as the bottleneck link; if there are
(as here) ties for the slowest link, then the first such link is the bottleneck. The bottleneck link is where the
queue will form. If traffic is sent at a rate of 4 packets/ms from A to B, it will pile up in an ever-increasing
queue at R2. Traffic will not pile up at R3; it arrives at R3 at the same rate by which it departs.
Furthermore, if sliding windows is used (rather than a fixed-rate sender), traffic will eventually not queue up
at any router other than R2: data cannot reach B faster than the 3 packets/ms rate, and so B will not return
ACKs faster than this rate, and so A will eventually not send data faster than this rate. At this 3 packets/ms
rate, traffic will not pile up at R1 (or R3 or R4).
There is a significant advantage in speaking in terms of winsize rather than transmission rate. If A sends to
B at any rate greater than 3 packets/ms, then the situation is unstable as the bottleneck queue grows without
bound and there is no convergence to a steady state. There is no analogous instability, however, if A uses
sliding windows, even if the winsize chosen is quite large (although a large-enough winsize will overflow
the bottleneck queue). If a sender specifies a sending window size rather than a rate, then the network will
converge to a steady state in relatively short order; if a queue develops it will be steadily replenished at the
same rate that packets depart, and so will be of fixed size.
We will assume that in the backward BA direction, all connections are infinitely fast, meaning zero
delay; this is often a good approximation because ACK packets are what travel in that direction and they
are negligibly small. In the AB direction, we will assume that the AR1 link is infinitely fast, but
the other four each have a bandwidth of 1 packet/second (and no propagation-delay component). This
makes the R1R2 link the bottleneck link; any queue will now form at R1. The path bandwidth is 1
packet/second, and the RTT is 4 seconds.
As a roughly equivalent alternative example, we might use the following:
117
with the following assumptions: the CS1 link is infinitely fast (zero delay), S1S2 and S2D each
take 1.0 sec bandwidth delay (so two packets take 2.0 sec, per link, etc), and ACKs also have a 1.0 sec
bandwidth delay in the reverse direction.
In both scenarios, if we send one packet, it takes 4.0 seconds for the ACK to return, on an idle network. This
means that the no-load delay, RTTnoLoad , is 4.0 seconds.
(These models will change significantly if we replace the 1 packet/sec bandwidth delay with a 1-second
propagation delay; in the former case, 2 packets take 2 seconds, while in the latter, 2 packets take 1 second.
See exercise 4.)
We assume a single connection is made; ie there is no competition. Bandwidth delay here is 4 packets (1
packet/sec 4 sec RTT)
6.3.1.1 Case 1: winsize = 2
In this case winsize < bandwidthdelay (where delay = RTT). The table below shows what is sent by A and
each of R1-R4 for each second. Every packet is acknowledged 4 seconds after it is sent; that is, RTTactual
= 4 sec, equal to RTTnoLoad ; this will remain true as the winsize changes by small amounts (eg to 1 or 3).
Throughput is proportional to winsize: when winsize = 2, throughput is 2 packets in 4 seconds, or 2/4 =
1/2 packet/sec. During each second, two of the routers R1-R4 are idle. The overall path will have less than
100% utilization.
Time
T
0
1
2
3
4
5
6
7
8
9
A
sends
1,2
3
4
5
6
R1
queues
2
R1
sends
1
2
3
4
5
6
R2
sends
1
2
3
4
R3
sends
1
2
3
4
R4
sends
1
2
3
4
B
ACKs
1
2
3
4
Note the brief pile-up at R1 (the bottleneck link!) on startup. However, in the steady state, there is no
queuing. Real sliding-windows protocols generally have some way of minimizing this initial pileup.
6.3.1.2 Case 2: winsize = 4
When winsize=4, at each second all four slow links are busy. There is again an initial burst leading to a brief
surge in the queue; RTTactual for Data[4] is 7 seconds. However, RTTactual for every subsequent packet is 4
seconds, and there are no queuing delays (and nothing in the queue) after T=2. The steady-state connection
118
throughput is 4 packets in 4 seconds, ie 1 packet/second. Note that overall path throughput now equals the
bottleneck-link bandwidth, so this is the best possible throughput.
T
0
1
2
3
4
5
6
7
8
A sends
1,2,3,4
R1 queues
2,3,4
3,4
4
5
6
7
8
9
R1 sends
1
2
3
4
5
6
7
8
9
R2 sends
R3 sends
R4 sends
B ACKs
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
1
2
3
4
5
6
1
2
3
4
5
At T=4, R1 has just finished sending Data[4] as Data[5] arrives from A; R1 can begin sending packet 5
immediately. No queue will develop.
Case 2 is the congestion knee of Chiu and Jain [CJ89], defined here in 1.7 Congestion.
6.3.1.3 Case 3: winsize = 6
T
0
1
2
3
4
5
6
7
8
9
10
A sends
1,2,3,4,5,6
7
8
9
10
11
12
13
R1 queues
2,3,4,5,6
3,4,5,6
4,5,6
5,6
6,7
7,8
8,9
9,10
10,11
11,12
12,13
R1 sends
1
2
3
4
5
6
7
8
9
10
11
R2 sends
R3 sends
R4 sends
B ACKs
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
Note that packet 7 is sent at T=4 and the acknowledgment is received at T=10, for an RTT of 6.0 seconds. All
later packets have the same RTTactual . That is, the RTT has risen from RTTnoLoad = 4 seconds to 6 seconds.
Note that we continue to send one windowful each RTT; that is, the throughput is still winsize/RTT, but RTT
is now 6 seconds.
One might initially conjecture that if winsize is greater than the bandwidthRTTnoLoad product, then the
entire window cannot be in transit at one time. In fact this is not the case; the sender does usually have the
entire window sent and in transit, but RTT has been inflated so it appears to the sender that winsize equals
the bandwidthRTT product.
In general, whenever winsize > bandwidthRTTnoLoad , what happens is that the extra packets pile up at
a router somewhere along the path (specifically, at the router in front of the bottleneck link). RTTactual is
inflated by queuing delay to winsize/bandwidth, where bandwidth is that of the bottleneck link; this means
winsize = bandwidthRTTactual . Total throughput is equal to that bandwidth. Of the 6 seconds of RTTactual
in the example here, a packet spends 4 of those seconds being transmitted on one link or another because
119
RTTnoLoad =4. The other two seconds, therefore, must be spent in a queue; there is no other place for packets
wait. Looking at the table, we see that each second there are indeed two packets in the queue at R1.
If the bottleneck link is the very first link, packets may begin returning before the sender has sent the entire
windowful. In this case we may argue that the full windowful has at least been queued by the sender, and
thus has in this sense been sent. Suppose the network, for example, is
where, as before, each link transports 1 packet/sec from A to B and is infinitely fast in the reverse direction.
Then, if A sets winsize = 6, a queue of 2 packets will form at A.
120
no competing traffic and winsize is below the congestion knee winsize < bandwidth RTTnoLoad then
winsize is the limiting factor in throughput. Finally, if there is no competition and winsize bandwidth
RTTnoLoad then the connection is using 100% of the capacity of the bottleneck link and throughput is equal
to the bottleneck-link physical bandwidth. To put this another way,
4. RTTactual = winsize/bottleneck_bandwidth
queue_usage = winsize bandwidth RTTnoLoad
Dividing the first equation by RTTnoLoad , and noting that bandwidth RTTnoLoad = winsize - queue_usage
= transit_capacity, we get
5. RTTactual /RTTnoLoad = winsize/transit_capacity = (transit_capacity + queue_usage) / transit_capacity
Regardless of the value of winsize, in the steady state the sender never sends faster than the bottleneck
bandwidth. This is because the bottleneck bandwidth determines the rate of packets arriving at the far
end, which in turn determines the rate of ACKs arriving back at the sender, which in turn determines the
continued sending rate. This illustrates the self-clocking nature of sliding windows.
We will return in 14 Dynamics of TCP Reno to the issue of bandwidth in the presence of competing traffic.
For now, suppose a sliding-windows sender has winsize > bandwidth RTTnoLoad , leading as above to a
fixed amount of queue usage, and no competition. Then another connection starts up and competes for the
bottleneck link. The first connections effective bandwidth will thus decrease. This means that bandwidth
RTTnoLoad will decrease, and hence the connections queue usage will increase.
The critical winsize value is equal to bandwidth RTTnoLoad ; this is known as the congestion knee. For
winsize below this, we have:
throughput is proportional to winsize
121
delay is constant
queue utilization in the steady state is zero
For winsize larger than the knee, we have
throughput is constant (equal to the bottleneck bandwidth)
delay increases linearly with winsize
queue utilization increases linearly with winsize
Ideally, winsize will be at the critical knee. However, the exact value varies with time: available bandwidth
changes due to the starting and stopping of competing traffic, and RTT changes due to queuing. Standard
TCP makes an effort to stay well above the knee much of the time, presumably on the theory that maximizing
throughput is more important than minimizing queue use.
The power of a connection is defined to be throughput/RTT. For sliding windows below the knee, RTT is
constant and power is proportional to the window size. For sliding windows above the knee, throughput is
constant and delay is proportional to winsize; power is thus proportional to 1/winsize. Here is a graph, akin
to those above, of winsize versus power:
122
}
send ACK[LA]
There are a couple details omitted.
A possible implementation of EA is as an array of packet objects, of size W. We always put packet Data[K]
into position K % W.
At any point between packet arrivals, Data[LA+1] is not in EA, but some later packets may be present.
For the sender side, we begin by sending a full windowful of packets Data[1] through Data[W], and setting
LA=0. When ACK[M] arrives:
if MLA or M>LA+W, ignore the packet
otherwise:
set K = LA+1
set LA = M, the new bottom edge of the window
for (i=K; iLA; i++) send Data[i]
Note that new ACKs may arrive while we are in the loop at that last line. We assume here that the sender
stolidly sends what it may send and only after that does it start to process arriving ACKs. Some implementations may take a more asynchronous approach, perhaps with one thread processing arriving ACKs and
incrementing LA and another thread sending everything it is allowed to send.
6.4 Epilog
This completes our discussion of the sliding-windows algorithm in the abstract setting. We will return to
concrete implementations of this in 11.3.2 TFTP Stop-and-Wait (stop-and-wait) and in 12.13 TCP Sliding
Windows; the latter is one of the most important mechanisms on the Internet.
6.5 Exercises
1. Sketch a ladder diagram for stop-and-wait if Data[3] is lost the first time it is sent. Continue the diagram
to the point where Data[4] is successfully transmitted. Assume an RTT of 1 second, no sender timeout (but
the sender retransmits on duplicate), and a receiver timeout of 2 seconds.
2. Suppose a stop-and-wait receiver has an implementation flaw. When Data[1] arrives, ACK[1] and ACK[2]
are sent, separated by a brief interval; after that, the receiver transmits ACK[N+1] when Data[N] arrives,
rather than the correct ACK[N].
(a). Assume the sender responds to each ACK as it arrives. What is the first ACK[N] that it will be able to
determine is incorrect? Assume no packets are lost.
(b). Is there anything the transmitter can do to detect this receiver problem earlier?
3.
A
6.4 Epilog
123
packet/sec bandwidth delay for the R1R2, R2R3, R3R4 and R4B links. The AR link and
all reverse links (from B to A) are infinitely fast. Carry out the table for 10 seconds.
B. The A
4. Create a table as in 6.3.1 Simple fixed-window-size analysis for a network A R1 R2
R1 ink is infinitely fast; the R1R2 and R2B each have a 1-second propagation delay, in each direction,
and zero bandwidth delay (that is, one packet takes 1.0 sec to travel from R1 to R2; two packets also take
1.0 sec to travel from R1 to R2). Assume winsize=6. Carry out the table for 8 seconds. Note that with zero
bandwidth delay, multiple packets sent together will remain together until the destination.
5. Suppose RTTnoLoad = 4 seconds and the bottleneck bandwidth is 1 packet every 2 seconds.
(a). What window size is needed to remain just at the knee of congestion?
(b). Suppose winsize=6. How many packets are in the queue, at the steady state, and what is RTTactual ?
R2 sends
1
1
2
2
3
3
...
7. Argue that, if A sends to B using sliding windows, and in the path from A to B the slowest link is not the
first link out of A, then eventually A will have the entire window outstanding (except at the instant just after
each new ACK comes in).
8. Suppose RTTnoLoad is 50 ms and the available bandwidth is 2,000 packets/sec. Sliding windows is used
for transmission.
(a). What window size is needed to remain just at the knee of congestion?
(b). If RTTactual rises to 60 ms (due to use of a larger winsize), how many packets are in a queue at any one
time?
(c). What value of winsize would lead to RTTactual = 60 ms?
(d). What value of winsize would make RTTactual rise to 100 ms?
124
9. Suppose winsize=4 in a sliding-windows connection, and assume that while packets may be lost, they
are never reordered (that is, if two packets P1 and P2 are sent in that order, and both arrive, then they arrive
in that order). Show that if Data[8] is in the receivers window (meaning that everything up through Data[4]
has been received and acknowledged), then it is no longer possible for even a late Data[0] to arrive at the
receiver. (A consequence of the general principle here is that in the absence of reordering we can replace
the packet sequence number with (sequence_number) mod (2winsize+1) without ambiguity.)
10. Suppose winsize=4 in a sliding-windows connection, and assume as in the previous exercise that while
packets may be lost, they are never reordered. Give an example in which Data[8] is in the receivers window
(so the receiver has presumably sent ACK[4]), and yet Data[1] legitimately arrives. (Thus, the late-packet
bound in the previous exercise is the best possible.)
11. Suppose the network is A R1
R2 B, where the AR1 ink is infinitely fast and the R1R2
link has a bandwidth of 1 packet/second each way, for an RTTnoLoad of 2 seconds. Suppose also that A
begins sending with winsize = 6. By the analysis in 6.3.1.3 Case 3: winsize = 6, RTT should rise to
winsize/bandwidth = 6 seconds. Give the RTTs of the first eight packets. How long does it take for RTT to
rise to 6 seconds?
6.5 Exercises
125
126
7 IP VERSION 4
There are multiple LAN protocols below the IP layer and multiple transport protocols above, but IP itself
stands alone. The Internet is essentially the IP Internet. If you want to run your own LAN somewhere or if
you want to run your own transport protocol, the Internet backbone will still work for you. But if you want
to change the IP layer, you may encounter difficulty. (Just talk to the IPv6 people, or the IP-multicasting or
IP-reservations groups.)
IP is, in effect, a universal routing and addressing protocol. The two are developed together; every node
has an IP address and every router knows how to handle IP addresses. IP was originally seen as a way to
interconnect multiple LANs, but it may make more sense now to view IP as a virtual LAN overlaying all
the physical LANs.
A crucial aspect of IP is its scalability. As the Internet has grown to ~109 hosts, the forwarding tables are
not much larger than 105 (perhaps now 105.5 ). Ethernet, in comparison, scales poorly.
Furthermore, IP, unlike Ethernet, offers excellent support for multiple redundant links. If the network
below were an IP network, each node would communicate with each immediate neighbor via their shared
direct link. If, on the other hand, this were an Ethernet network with the spanning-tree algorithm, then one
of the four links would simply be disabled completely.
The IP network service model is to act like a LAN. That is, there are no acknowledgments; delivery is
generally described as best-effort. This design choice is perhaps surprising, but it has also been quite fruitful.
Currently the Internet uses (almost exclusively) IP version 4, with its 32-bit address size. As the Internet has
run out of new large blocks of IPv4 addresses (1.10 IP - Internet Protocol), there is increasing pressure to
convert to IPv6, with its 128-bit address size. Progress has been slow, however, and delaying tactics such
as IPv4-address markets and NAT have allowed IPv4 to continue. Aside from the major change in address
structure, there are relatively few differences in the routing models of IPv4 and IPv6. We will study IPv4 in
this chapter and IPv6 in the following.
If you want to provide a universal service for delivering any packet anywhere, what else do you need besides
routing and addressing? Every network (LAN) needs to be able to carry any packet. The protocols spell
out the use of octets (bytes), so the only possible compatibility issue is that a packet is too large for a given
network. IP handles this by supporting fragmentation: a network may break a too-large packet up into
units it can transport successfully. While IP fragmentation is inefficient and clumsy, it does guarantee that
any packet can potentially be delivered to any node.
127
The IP header, and basics of IP protocol operation, were defined in RFC 791; some minor changes have
since occurred. Most of these changes were documented in RFC 1122, though the DS field was defined in
RFC 2474 and the ECN bits were first proposed in RFC 2481.
The Version field is, for IPv4, the number 4: 0100. The IHL field represents the total IP Header Length, in
32-bit words; an IP header can thus be at most 15 words long. The base header takes up five words, so the
IP Options can consist of at most ten words. If one looks at IPv4 packets using a packet-capture tool that
displays the packets in hex, the first byte will most often be 0x45.
The Differentiated Services (DS) field is used by the Differentiated Services suite to specify preferential handling for designated packets, eg those involved in VoIP or other real-time protocols. The Explicit
Congestion Notification bits are there to allow routers experiencing congestion to mark packets, thus indicating to the sender that the transmission rate should be reduced. We will address these in 14.8.2 Explicit
Congestion Notification (ECN). These two fields together replace the old 8-bit Type of Service field.
The Total Length field is present because an IP packet may be smaller than the minimum LAN packet
size (see Exercise 1) or larger than the maximum (if the IP packet has been fragmented over several LAN
packets. The IP packet length, in other words, cannot be inferred from the LAN-level packet size. Because
the Total Length field is 16 bits, the maximum IP packet size is 216 bytes. This is probably much too large,
even if fragmentation were not something to be avoided.
The second word of the header is devoted to fragmentation, discussed below.
The Time-to-Live (TTL) field is decremented by 1 at each router; if it reaches 0, the packet is discarded. A
typical initial value is 64; it must be larger than the total number of hops in the path. In most cases, a value of
32 would work. The TTL field is there to prevent routing loops always a serious problem should they occur
from consuming resources indefinitely. Later we will look at various IP routing-table update protocols and
how they minimize the risk of routing loops; they do not, however, eliminate it. By comparison, Ethernet
headers have no TTL field, but Ethernet also disallows cycles in the underlying topology.
128
7 IP version 4
The Protocol field contains a value to indicate if the body of the IP packet represents a TCP packet or a
UDP packet, or, in unusual cases, something else altogether.
The Header Checksum field is the Internet checksum applied to the header only, not the body. Its only
purpose is to allow the discarding of packets with corrupted headers. When the TTL value is decremented
the router must update the header checksum. This can be done algebraically by adding a 1 in the correct
place to compensate, but it is not hard simply to re-sum the 8 halfwords of the average header.
The Source and Destination Address fields contain, of course, the IP addresses. These would be updated
only by NAT firewalls.
One option is the Record Route option, in which routers are to insert their own IP address into the IP
header option area. Unfortunately, with only ten words available, there is not enough space to record most
longer routes (but see 7.9.1 Traceroute and Time Exceeded, below). Another option, now deprecated as
security risk, is to support source routing. The sender would insert into the IP header option area a list of IP
addresses; the packet would be routed to pass through each of those IP addresses in turn. With strict source
routing, the IP addresses had to represent adjacent neighbors; no router could be used if its IP address were
not on the list. With loose source routing, the listed addresses did not have to represent adjacent neighbors
and ordinary IP routing was used to get from one listed IP address to the next. Both forms are essentially
never used, again for security reasons: if a packet has been source-routed, it may have been routed outside
of the at-least-somewhat trusted zone of the Internet backbone.
7.2 Interfaces
IP addresses (IPv4 and IPv6) are, strictly speaking, assigned not to hosts or nodes, but to interfaces. In the
most common case, where each node has a single LAN interface, this is a distinction without a difference. In
a room full of workstations each with a single Ethernet interface eth0 (or perhaps Ethernet adapter
Local Area Connection), we might as well view the IP address assigned to the interface as assigned
to the workstation itself.
Each of those workstations, however, likely also has a loopback interface (at least conceptually), providing
a way to deliver IP packets to other processes on the same machine. On many systems, the name localhost
resolves to the IPv4 address 127.0.0.1 (although the IPv6 address ::1 is also used). Delivering packets to
the localhost address is simply a form of interprocess communication; a functionally similar alternative is
named pipes. Loopback delivery avoids the need to use the LAN at all, or even the need to have a LAN.
For simple client/server testing, it is often convenient to have both client and server on the same machine,
in which case the loopback interface is convenient and fast. On unix-based machines the loopback interface
represents a genuine logical interface, commonly named lo. On Windows systems the interface may not
represent an actual entity, but this is of practical concern only to those interested in sniffing all loopback
traffic; packets sent to the loopback address are still delivered as expected.
Workstations often have special other interfaces as well. Most recent versions of Microsoft Windows have
a Teredo Tunneling pseudo-interface and an Automatic Tunneling pseudo-interface; these are both intended
(when activated) to support IPv6 connectivity when the local ISP supports only IPv4. The Teredo protocol
is documented in RFC 4380.
When VPN connections are created, as in 3.1 Virtual Private Network, each end of the logical connection
typically terminates at a virtual interface (one of these is labeled tun0 in the diagram of 3.1 Virtual Private
7.2 Interfaces
129
Network). These virtual interfaces appear, to the systems involved, to be attached to a point-to-point link
that leads to the other end.
When a computer hosts a virtual machine, there is almost always a virtual network to connect the host and
virtual systems. The host will have a virtual interface to connect to the virtual network. The host may act as
a NAT router for the virtual machine, hiding that virtual machine behind its own IP address, or it may act
as an Ethernet switch, in which case the virtual machine will need an additional public IP address.
Whats My IP Address?
This simple-seeming question is in fact not very easy to answer, if by my IP address one means the IP
address assigned to the interface that connects directly to the Internet. One strategy is to find the address
of the default router, and then iterate through all interfaces (eg with the Java NetworkInterface
class) to find an IP address with a matching network prefix. Unfortunately, finding the default router
is hard to do in an OS-independent way, and even then this approach can fail if the Wi-Fi and Ethernet
interfaces both are assigned IP addresses on the same network, but only one is actually connected.
Many workstations have both an Ethernet interface and a Wi-Fi interface. Both of these can be used simultaneously (with different IP addresses assigned to each), either on the same IP network or on different IP
networks.
Routers always have at least two interfaces on two separate LANs. Generally this means a separate IP
address for each interface, though some point-to-point interfaces can be used without being assigned an IP
address (7.10 Unnumbered Interfaces).
Finally, it is usually possible to assign multiple IP addresses to a single interface. Sometimes this is done
to allow two IP networks to share a single LAN; the interface might be assigned one IP address for each
IP network. Other times a single interface is assigned multiple IP addresses that are on the same LAN; this
is often done so that one physical machine can act as a server (eg a web server) for multiple distinct IP
addresses corresponding to multiple distinct domain names.
While it is important to be at least vaguely aware of all these special cases, we emphasize again that in most
ordinary contexts each end-user workstation has one IP address that corresponds to a LAN connection.
130
7 IP version 4
172.16.0.0/12
192.168.0.0/16
The last block is the one from which addresses are most commonly allocated by DHCP servers
(7.8.1 DHCP and the Small Office) built into NAT routers.
Broadcast addresses are a special form of IP address intended to be used in conjunction with LAN-layer
broadcast. The most common forms are broadcast to this network, consisting of all 1-bits, and broadcast
to network D, consisting of Ds network-address bits followed by all 1-bits for the host bits. If you try to
send a packet to the broadcast address of a remote network D, the odds are that some router involved will
refuse to forward it, and the odds are even higher that, once the packet arrives at a router actually on network
D, that router will refuse to broadcast it. Even addressing a broadcast to ones own network will fail if the
underlying LAN does not support LAN-level broadcast (eg ATM).
The highly influential early Unix implementation Berkeley 4.2 BSD used 0-bits for the broadcast bits, instead of 1s. As a result, to this day host bits cannot be all 1-bits or all 0-bits in order to avoid confusion with
the IP broadcast address. One consequence of this is that a Class C network has 254 usable host addresses,
not 256.
7.4 Fragmentation
If you are trying to interconnect two LANs (as IP does), what else might be needed besides Routing and
Addressing? IP explicitly assumes all packets are composed on 8-bit bytes (something not universally true
in the early days of IP; to this day the RFCs refer to octets to emphasize this requirement). IP also defines
bit-order within a byte, and it is left to the networking hardware to translate properly. Neither byte size nor
bit order, therefore, can interfere with packet forwarding.
There is one more feature IP must provide, however, if the goal is universal connectivity: it must accommodate networks for which the maximum packet size, or Maximum Transfer Unit, MTU, is smaller than
the packet that needs forwarding. Otherwise, if we were using IP to join Token Ring (MTU = 4KB, at least
7.4 Fragmentation
131
originally) to Ethernet (MTU = 1500B), the token-ring packets might be too large to deliver to the Ethernet side, or to traverse an Ethernet backbone en route to another Token Ring. (Token Ring, in its day, did
commonly offer a configuration option to allow Ethernet interoperability.)
So, IP must support fragmentation, and thus also reassembly. There are two major strategies here: per-link
fragmentation and reassembly, where the reassembly is done at the opposite end of the link (as in ATM), and
path fragmentation and reassembly, where reassembly is done at the far end of the path. The latter approach
is what is taken by IP, partly because intermediate routers are too busy to do reassembly (this is as true today
as it was thirty years ago), and partly because IP fragmentation is seen as the strategy of last resort.
An IP sender is supposed to use a different value for the IDENT field for different packets, at least up until
the field wraps around. When an IP datagram is fragmented, the fragments keep the same IDENT field, so
this field in effect indicates which fragments belong to the same packet.
After fragmentation, the Fragment Offset field marks the start position of the data portion of this fragment
within the data portion of the original IP packet. Note that the start position can be a number up to 216 , the
maximum IP packet length, but the FragOffset field has only 13 bits. This is handled by requiring the data
portions of fragments to have sizes a multiple of 8 (three bits), and left-shifting the FragOffset value by 3
bits before using it.
As an example, consider the following network, where MTUs are excluding the LAN header:
Suppose A addresses a packet of 1500 bytes to B, and sends it via the LAN to the first router R1. The packet
contains 20 bytes of IP header and 1480 of data.
R1 fragments the original packet into two packets of sizes 20+976 = 996 and 20+504=544. Having 980
bytes of payload in the first fragment would fit, but violates the rule that the sizes of the data portions be
divisible by 8. The first fragment packet has FragOffset = 0; the second has FragOffset = 976.
R2 refragments the first fragment into three packets as follows:
first: size = 20+376=396, FragOffset = 0
second: size = 20+376=396, FragOffset = 376
third: size = 20+224 = 244 (note 376+376+224=976), FragOffset = 752.
R2 refragments the second fragment into two:
first: size = 20+376 = 396, FragOffset = 976+0 = 976
second: size = 20+128 = 148, FragOffset = 976+376=1352
R3 then sends the fragments on to B, without reassembly.
Note that it would have been slightly more efficient to have fragmented into four fragments of sizes 376,
376, 376, and 352 in the beginning. Note also that the packet format is designed to handle fragments of
different sizes easily. The algorithm is based on multiple fragmentation with reassembly only at the final
destination.
Each fragment has its IP-header Total Length field set to the length of that fragment.
132
7 IP version 4
We have not yet discussed the three flag bits. The first bit is reserved, and must be 0. The second bit is the
Dont Fragment bit. If it is set to 1 by the sender then a router must not fragment the packet and must
drop it instead; see 12.12 Path MTU Discovery for an application of this. The third bit is set to 1 for all
fragments except the final one (this bit is thus set to 0 if no fragmentation has occurred). The third bit tells
the receiver where the fragments stop.
The receiver must take the arriving fragments and reassemble them into a whole packet. The fragments
may not arrive in order unlike in ATM networks and may have unrelated packets interspersed. The
reassembler must identify when different arriving packets are fragments of the same original, and must
figure out how to reassemble the fragments in the correct order; both these problems were essentially trivial
for ATM.
Fragments are considered to belong to the same packet if they have the same IDENT field and also the same
source and destination addresses and same protocol.
As all fragment sizes are a multiple of 8 bytes, the receiver can keep track of whether all fragments have
been received with a bitmap in which each bit represents one 8-byte fragment chunk. A 1 KB packet could
have up to 128 such chunks; the bitmap would thus be 16 bytes.
If a fragment arrives that is part of a new (and fragmented) packet, a buffer is allocated. While the receiver
cannot know the final size of the buffer, it can usually make a reasonable guess. Because of the FragOffset
field, the fragment can then be stored in the buffer in the appropriate position. A new bitmap is also allocated,
and a reassembly timer is started.
As subsequent fragments arrive, not necessarily in order, they too can be placed in the proper buffer in the
proper position, and the appropriate bits in the bitmap are set to 1.
If the bitmap shows that all fragments have arrived, the packet is sent on up as a completed IP packet. If, on
the other hand, the reassembly timer expires, then all the pieces received so far are discarded.
TCP connections usually engage in Path MTU Discovery, and figure out the largest packet size they can
send that will not entail fragmentation (12.12 Path MTU Discovery). But it is not unusual, for example,
for UDP protocols to use fragmentation, especially over the short haul. In the Network File Sharing (NFS)
protocol, for example, UDP is used to carry 8KB disk blocks. These are often sent as a single 8+ KB IP
packet, fragmented over Ethernet to five full packets and a fraction. Fragmentation works reasonably well
here because most of the time the packets do not leave the Ethernet they started on. Note that this is an
example of fragmentation done by the sender, not by an intermediate router.
Finally, any given IP link may provide its own link-layer fragmentation and reassembly; we saw in
3.8.1 ATM Segmentation and Reassembly that ATM does just this. Such link-layer mechanisms are, however, generally invisible to the IP layer.
133
lookup and assume there is a lookup() method available that, when given a destination address, returns
the next_hop neighbor.
Instead of class-based divisions, we will assume that each of the IP addresses assigned to a nodes interfaces
is configured with an associated length of the network prefix; following the slash notation of 1.10 IP Internet Protocol, if B is an address and the prefix length is k = kB then the prefix itself is B/k. As usual, an
ordinary host may have only one IP interface, while a router will always have multiple interfaces.
Let D be the given IP destination address; we want to decide if D is local or nonlocal. The host or router
involved may have multiple IP interfaces, but for each interface the length of the network portion of the
address will be known. For each network address B/k assigned to one of the hosts interfaces, we compare
the first k bits of B and D; that is, we ask if D matches B/k.
If one of these comparisons yields a match, delivery is local; the host delivers the packet to its final
destination via the LAN connected to the corresponding interface. This means looking up the LAN
address of the destination, if applicable, and sending the packet to that destination via the interface.
If there is no match, delivery is nonlocal, and the host passes D to the lookup() routine of the
forwarding table and sends to the associated next_hop (which must represent a physically connected
neighbor). It is now up to lookup() routine to make any necessary determinations as to how D
might be split into Dnet and Dhost .
The forwarding table is, abstractly, a set of network addresses now also with lengths each of the form
B/k, with an associated next_hop destination for each. The lookup() routine will, in principle, compare
D with each table entry B/k, looking for a match (that is, equality of the first k bits). As with the interfaces
check above, the net/host division point (that is, k) will come from the table entry; it will not be inferred
from D or from any other information borne by the packet. There is, in fact, no place in the IP header to
store a net/host division point, and furthermore different routers along the path may use different values of k
with the same destination address D. In 10 Large-Scale IP Routing we will see that in some cases multiple
matches in the forwarding table may exist; the longest-match rule will be introduced to pick the best match.
Here is a simple example for a router with immediate neighbors A-E:
destination
10.3.0.0/16
10.4.1.0/24
10.4.2.0/24
10.4.3.0/24
10.3.37.0/24
next_hop
A
B
C
D
E
The IP addresses 10.3.67.101 and 10.3.59.131 both route to A. The addresses 10.4.1.101, 10.4.2.157 and
10.4.3.233 route to B, C and D respectively. Finally, 10.3.37.103 matches both A and E, but the E match is
longer so the packet is routed that way.
The forwarding table may also contain a default entry for the next_hop, which it may return in cases when
the destination D does not match any known network. We take the view here that returning such a default
entry is a valid result of the routing-table lookup() operation, rather than a third option to the algorithm
above; one approach is for the default entry to be the next_hop corresponding to the destination 0.0.0.0/0,
which does indeed match everything (use of this would definitely require the above longest-match rule,
though).
Default routes are hugely important in keeping leaf forwarding tables small. Even backbone routers some-
134
7 IP version 4
times expend considerable effort to keep the network address prefixes in their forwarding tables as short as
possible, through consolidation.
Routers may also be configured to allow passing quality-of-service information to the lookup() method,
as mentioned in Chapter 1, to support different routing paths for different kinds of traffic (eg bulk file-transfer
versus real-time).
For a modest exception to the local-delivery rule described here, see below in 7.10 Unnumbered Interfaces.
7.6 IP Subnets
Subnets were the first step away from Class A/B/C routing: a large network (eg a class A or B) could be
divided into smaller IP networks called subnets. Consider, for example, a typical Class B network such
as Loyola Universitys (originally 147.126.0.0/16); the underlying assumption is that any packet can be
delivered via the underlying LAN to any internal host. This would require a rather large LAN, and would
require that a single physical LAN be used throughout the site. What if our site has more than one physical
LAN? Or is really too big for one physical LAN? It did not take long for the IP world to run into this
problem.
Subnets were first proposed in RFC 917, and became official with RFC 950.
Getting a separate IP network prefix for each subnet is bad for routers: the backbone forwarding tables now
must have an entry for every subnet instead of just for every site. What is needed is a way for a site to appear
to the outside world as a single IP network, but for further IP-layer routing to be supported inside the site.
This is what subnets accomplish.
Subnets introduce hierarchical routing: first we route to the primary network, then inside that site we route
to the subnet, and finally the last hop delivers to the host.
Routing with subnets involves in effect moving the IPnet division line rightward. (Later, when we consider
CIDR, we will see the complementary case of moving the division line to the left.) For now, observe that
moving the line rightward within a site does not affect the outside world at all; outside routers are not even
aware of site-internal subnetting.
In the following diagram, the outside world directs traffic addressed to 147.126.0.0/16 to the router R. Internally, however, the site is divided into subnets. The idea is that traffic from 147.126.1.0/24 to 147.126.2.0/24
is routed, not switched; the two LANs involved may not even be compatible. Most of the subnets shown
are of size /24, meaning that the third byte of the IP address has become part of the network portion of the
subnets address; one /20 subnet is also shown. RFC 950 would have disallowed the subnet with third byte
0, but having 0 for the subnet bits generally does work.
7.6 IP Subnets
135
What we want is for the internal routing to be based on the extended network prefixes shown, while externally continuing to use only the single routing entry for 147.126.0.0/16.
To implement subnets, we divide the sites IP network into some combination of physical LANs the subnets
, and assign each a subnet address: an IP network address which has the sites IP network address as prefix.
To put this more concretely, suppose the sites IP network address is A, and consists of n network bits (so
the site address may be written with the slash notation as A/n); in the diagram above, A/n = 147.126.0.0/16.
A subnet address is an IP network address B/k such that:
The address B/k is within the site: the first n bits of B are the same as A/ns
B/k extends A/n: kn
An example B/k in the diagram above is 147.126.1.0/24. (There is a slight simplification here in that subnet
addresses do not absolutely have to be prefixes; see below.)
We now have to figure out how packets will be routed to the correct subnet. For incoming packets we could
set up some proprietary protocol at the entry router to handle this. However, the more complicated situation
is all those existing internal hosts that, under the class A/B/C strategy, would still believe they can deliver
via the LAN to any site host, when in fact they can now only do that for hosts on their own subnet. We need
a more general solution.
We proceed as follows. For each subnet address B/k, we create a subnet mask for B consisting of k 1-bits
followed by enough 0-bits to make a total of 32. We then make sure that every host and router in the site
knows the subnet mask for every one of its interfaces. Hosts usually find their subnet mask the same way
they find their IP address (by static configuration if necessary, but more likely via DHCP, below).
Hosts and routers now apply the IP delivery algorithm of the previous section, with the proviso that, if a
subnet mask for an interface is present, then the subnet mask is used to determine the number of address bits
rather than the Class A/B/C mechanism. That is, we determine whether a packet addressed to destination
D is deliverable locally via an interface with subnet address B/k and corresponding mask M by comparing
D&M with B&M, where & represents bitwise AND; if the two match, the packet is local. This will generally
136
7 IP version 4
involve a match of more bits than if we used the Class A/B/C strategy to determine the network portion of
addresses D and B.
As stated previously, given an address D with no other context, we will not be able to determine the network/host division point in general (eg for outbound packets). However, that division point is not in fact
what we need. All that is needed is a way to tell if a given destination host address D belongs to the current
subnet, say B; that is, we need to compare the first k bits of D and B where k is the (known) length of B.
In the diagram above, the subnet mask for the /24 subnets would be 255.255.255.0; bitwise ANDing any IP
address with the mask is the same as extracting the first 24 bits of the IP address, that is, the subnet portion.
The mask for the /20 subnet would be 255.255.240.0 (240 in binary is 1111 0000).
In the diagram above none of the subnets overlaps or conflicts: the subnets 147.126.0.0/24 and
147.126.1.0/24 are disjoint. It takes a little more effort to realize that 147.126.16.0/20 does not overlap
with the others, but note that an IP address matches this network prefix only if the first four bits of the third
byte are 0001, so the third byte itself ranges from decimal 32 to decimal 63 = binary 0001 1111.
Note also that if host A = 147.126.0.1 wishes to send to destination D = 147.126.1.1, and A is not subnetaware, then delivery will fail: A will infer that the interface is a Class B, and therefore compare the first
two bytes of A and D, and, finding a match, will attempt direct LAN delivery. But direct delivery is now
likely impossible, as the subnets are not joined by a switch. Only with the subnet mask will A realize that
its network is 147.126.0.0/24 while Ds is 147.126.1.0/24 and that these are not the same. A would still be
able to send packets to its own subnet. In fact A would still be able to send packets to the outside world:
it would realize that the destination in that case does not match 147.126.0.0/16 and will thus forward to its
router. Hosts on other subnets would be the only unreachable ones.
Properly, the subnet address is the entire prefix, eg 147.126.65.0/24. However, it is often convenient to
identify the subnet address with just those bits that represent the extension of the site IP-network address;
we might thus say casually that the subnet address here is 65.
The class-based IP-address strategy allowed any host anywhere on the Internet to properly separate any
address into its net and host portions. With subnets, this division point is now allowed to vary; for example,
the address 147.126.65.48 divides into 147.126 | 65.48 outside of Loyola, but into 147.126.65 | 48 inside.
This means that the net-host division is no longer an absolute property of addresses, but rather something
that depends on where the packet is on its journey.
Technically, we also need the requirement that given any two subnet addresses of different, disjoint subnets,
neither is a proper prefix of the other. This guarantees that if A is an IP address and B is a subnet address
with mask M (so B = B&M), then A&M = B implies A does not match any other subnet. Regardless of
the net/host division rules, we cannot possibly allow subnet 147.126.16.0/20 to represent one LAN while
147.126.16.0/24 represents another; the second subnet address block is a subset of the first. (We can,
and sometimes do, allow the first LAN to correspond to everything in 147.126.16.0/20 that is not also in
147.126.16.0/24; this is the longest-match rule.)
The strategy above is actually a slight simplification of what the subnet mechanism actually allows: subnet
address bits do not in fact have to be contiguous, and masks do not have to be a series of 1-bits followed by
0-bits. The mask can be any bit-mask; the subnet address bits are by definition those where there is a 1 in
the mask bits. For example, we could at a Class-B site use the fourth byte as the subnet address, and the
third byte as the host address. The subnet mask would then be 255.255.0.255. While this generality was
once sometimes useful in dealing with legacy IP addresses that could not easily be changed, life is simpler
when the subnet bits precede the host bits.
7.6 IP Subnets
137
138
size
128
64
32
32
decimal range
128-255
0-63
64-95
96-127
7 IP version 4
As desired, none of the subnet addresses in the third column is a prefix of any other subnet address.
The end result of all of this is that routing is now hierarchical: we route on the site IP address to get to a
site, and then route on the subnet address within the site.
7.6 IP Subnets
139
7 IP version 4
containing DLAN . Because the original request contained ALAN , Ds response can be sent directly to A, that
is, unicast.
Additionally, all hosts maintain an ARP cache, consisting of IP,LAN address pairs for other hosts on the
network. After the exchange above, A has DIP ,DLAN in its table; anticipating that A will soon send it a
packet to which it needs to respond, D also puts AIP ,ALAN into its cache.
ARP-cache entries eventually expire. The timeout interval used to be on the order of 10 minutes, but linux
systems now use a much smaller timeout (~30 seconds observed in 2012). Somewhere along the line, and
probably related to this shortened timeout interval, repeat ARP queries about a timed-out entry are first sent
unicast, not broadcast, to the previous Ethernet address on record. This cuts down on the total amount of
broadcast traffic; LAN broadcasts are, of course, still needed for new hosts. The ARP cache on a linux
system can be examined with the command ip -s neigh; the corresponding windows command is arp
-a.
The above protocol is sufficient, but there is one further point. When A sends its broadcast who-has D?
ARP query, all other hosts C check their own cache for an entry for A. If there is such an entry (that is, if
AIP is found there), then the value for ALAN is updated with the value taken from the ARP message; if there
is no pre-existing entry then no action is taken. This update process serves to avoid stale ARP-cache entries,
which can arise is if a host has had its Ethernet card replaced.
141
make any necessary cache updates. Finally, ACD requires that hosts that do detect a duplicate address must
discontinue using it.
It is also possible for other stations to answer an ARP query on behalf of the actual destination D; this is
called proxy ARP. An early common scenario for this was when host C on a LAN had a modem connected
to a serial port. In theory a host D dialing in to this modem should be on a different subnet, but that requires
allocation of a new subnet. Instead, many sites chose a simpler arrangement. A host that dialed in to Cs
serial port might be assigned IP address DIP , from the same subnet as C. C would be configured to route
packets to D; that is, packets arriving from the serial line would be forwarded to the LAN interface, and
packets sent to CLAN addressed to DIP would be forwarded to D. But we also have to handle ARP, and as
D is not actually on the LAN it will not receive broadcast ARP queries. Instead, C would be configured to
answer on behalf of D, replying with DIP ,CLAN . This generally worked quite well.
Proxy ARP is also used in Mobile IP, for the so-called home agent to intercept traffic addressed to the
home address of a mobile device and then forward it (eg via tunneling) to that device. See 7.11 Mobile
IP.
One delicate aspect of the ARP protocol is that stations are required to respond to a broadcast query. In the
absence of proxies this theoretically should work quite well: there should be only one respondent. However,
there were anecdotes from the Elder Days of networking when a broadcast ARP query would trigger an
avalanche of responses. The protocol-design moral here is that determining who is to respond to a broadcast
message should be done with great care. (RFC 1122 section 3.2.2 addresses this same point in the context
of responding to broadcast ICMP messages.)
ARP-query implementations also need to include a timeout and some queues, so that queries can be resent
if lost and so that a burst of packets does not lead to a burst of queries. A naive ARP algorithm without these
might be:
To send a packet to destination DIP , see if DIP is in the ARP cache. If it is, address the packet
to DLAN ; if not, send an ARP query for D
To see the problem with this approach, imagine that a 32KB packet arrives at the IP layer, to be sent over
Ethernet. It will be fragmented into 22 fragments (assuming an Ethernet MTU of 1500 bytes), all sent at
once. The naive algorithm above will likely send an ARP query for each of these. What we need instead is
something like the following:
To send a packet to destination DIP :
If DIP is in the ARP cache, send to DLAN and return
If not, see if an ARP query for DIP is pending.
If it is, put the current packet in a queue for D.
If there is no pending ARP query for DIP , start one,
again putting the current packet in the (new) queue for D
We also need:
If an ARP query for some CIP times out, resend it (up to a point)
If an ARP query for CIP is answered, send off any packets in Cs queue
7 IP version 4
Here is an ARP-based strategy, sometimes known as ARP Spoofing. First, B makes sure the real S is down,
either by waiting until scheduled downtime or by launching a denial-of-service attack against S.
When A tries to connect, it will begin with an ARP who-has S?. All B has to do is answer, S is-at B.
There is a trivial way to do this: B simply needs to set its own IP address to that of S.
A will connect, and may be convinced to give its password to B. B now simply responds with something
plausible like backup in progress; try later, and meanwhile use As credentials against the real S.
This works even if the communications channel A uses is encrypted! If A is using the SSH protocol
(21.9.1 SSH), then A will get a message that the other sides key has changed (B will present its own
SSH key, not Ss). Unfortunately, many users (and even some IT departments) do not recognize this as a
serious problem. Some organizations especially schools and universities use personal workstations with
frozen configuration, so that the filesystem is reset to its original state on every reboot. Such systems may
be resistant to viruses, but in these environments the user at A will always get a message to the effect that
Ss credentials are not known.
143
Recall that ARP is based on the idea of someone broadcasting an ARP query for a host, containing the hosts
IP address, and the host answering it with its LAN address. DHCP involves a host, at startup, broadcasting a
query containing its own LAN address, and having a server reply telling the host what IP address is assigned
to it. The DHCP response is likely to contain several other essential startup options as well, including:
IP address
subnet mask
default router
DNS Server
These four items are a pretty standard minimal network configuration.
Default Routers and DHCP
If you lose your default router, you cannot communicate. Here is something that used to happen to me,
courtesy of DHCP:
1. I am connected to the Internet via Ethernet, and my default router is via my Ethernet interface
2. I connect to my institutions wireless network.
3. Their DHCP server sends me a new default router on the wireless network. However, this default
router will only allow access to a tiny private network, because I have neglected to complete the
Wi-Fi network registration process.
4. I therefore disconnect from the wireless network, and my wireless-interface default router goes
away. However, my system does not automatically revert to my Ethernet default-router entry;
DHCP does not work that way. As a result, I will have no router at all until the next scheduled
DHCP lease renegotiation, and must fix things manually.
The DHCP server has a range of IP addresses to hand out, and maintains a database of which IP address has
been assigned to which LAN address. Reservations can either be permanent or dynamic; if the latter, hosts
typically renew their DHCP reservation periodically (typically one to several times a day).
7 IP version 4
DHCP servers on the same subnet. This often results in chaos, as two different hosts may be assigned the
same IP address, or a hosts IP address may suddenly change if it gets a new IP address from the other server.
Disabling one of the DHCP servers fixes this.
While omnipresent DHCP servers have made IP autoconfiguration work out of the box in many cases, in
the era in which IP was designed the need for such servers would have been seen as a significant drawback in
terms of expense and reliability. IPv6 has an autoconfiguration strategy (8.7.2 Stateless Autoconfiguration
(SLAAC)) that does not require DHCP, though DHCPv6 may well end up displacing it.
145
Type
Echo Request
Echo Reply
Destination Unreachable
Source Quench
Redirect Message
Router Solicitation
Time Exceeded
Bad IP Header or Parameter
Description
ping queries
ping responses
Destination network unreachable
Destination host unreachable
Destination port unreachable
Fragmentation required but DF flag set
Network administratively prohibited
Congestion control
Redirect datagram for the network
Redirect datagram for the host
Redirect for TOS and network
Redirect for TOS and host
Router discovery/selection/solicitation
TTL expired in transit
Fragment reassembly time exceeded
Pointer indicates the error
Missing a required option
Bad length
ICMP is perhaps best known for Echo Request/Reply, on which the ping tool is based. Ping remains very
useful for network troubleshooting: if you can ping a host, then the network is reachable, and any problems
are higher up the protocol chain. Unfortunately, ping replies are blocked by default by many firewalls, on
the theory that revealing even the existence of computers is a security risk. While this may be an appropriate
decision, it does significantly impair the utility of ping. Most routers do still pass ping requests, but some
site routers block them.
Source Quench was used to signal that congestion has been encountered. A router that drops a packet due to
congestion experience was encouraged to send ICMP Source Quench to the originating host. Generally the
TCP layer would handle these appropriately (by reducing the overall sending rate), but UDP applications
never receive them. ICMP Source Quench did not quite work out as intended, and was formally deprecated
by RFC 6633. (Routers can inform TCP connections of impending congestion by using the ECN bits.)
The Destination Unreachable type has a large number of subtypes:
Network unreachable: some router had no entry for forwarding the packet, and no default route
Host unreachable: the packet reached a router that was on the same LAN as the host, but the host
failed to respond to ARP queries
Port unreachable: the packet was sent to a UDP port on a given host, but that port was not open.
TCP, on the other hand, deals with this situation by replying to the connecting endpoint with a reset
packet. Unfortunately, the UDP Port Unreachable message is sent to the host, not to the application on
that host that sent the undeliverable packet, and so is close to useless as a practical way for applications
to be informed when packets cannot be delivered.
Fragmentation required but DF flag set: a packet arrived at a router and was too big to be forwarded
without fragmentation. However, the Dont Fragment bit in the IP header was set, forbidding fragmentation. Later we will see how TCP uses this option as part of Path MTU Discovery, the process
of finding the largest packet we can send to a specific destination without fragmentation. The basic
146
7 IP version 4
idea is that we set the DF bit on some of the packets we send; if we get back this message, that packet
was too big.
Administratively Prohibited: this is sent by a router that knows it can reach the network in question,
but has configured to drop the packet and send back Administratively Prohibited messages. A router
can also be configured to blackhole messages: to drop the packet and send back nothing.
7.9.2 Redirects
Most non-router hosts start up with an IP forwarding table consisting of a single (default) router, discovered
along with their IP address through DHCP. ICMP Redirect messages help hosts learn of other useful routers.
Here is a classic example:
A is configured so that its default router is R1. It addresses a packet to B, and sends it to R1. R1 receives
the packet, and forwards it to R2. However, R1 also notices that R2 and A are on the same network, and
so A could have sent the packet to R2 directly. So R1 sends an appropriate ICMP redirect message to A
(Redirect Datagram for the Network), and A adds a route to B via R2 to its own forwarding table.
7.9 Internet Control Message Protocol
147
The endpoints of L could always be assigned private IP addresses (7.3 Special Addresses), such as 10.0.0.1
and 10.0.0.2. To do this we would need to create a subnet; because the host bits cannot be all 0s or all 1s,
the minimum subnet size is four (eg 10.0.0.0/30). Furthermore, the routing protocols to be introduced in
9 Routing-Update Algorithms will distribute information about the subnet throughout the organization or
routing domain, meaning care must be taken to ensure that each links subnet is unique. Use of unnumbered links avoids this.
If R1 were to originate a packet to be sent to (or forwarded via) R2, the standard strategy is for it to treat
its link0 interface as if it shared the IP address of its Ethernet interface eth0, that is, 200.0.0.1; R2
would do likewise. This still leaves R1 and R2 violating the IP local-delivery rule of 7.5 The Classless
IP Delivery Algorithm; R1 is expected to deliver packets via local delivery to 201.1.1.1 but has no interface
that is assigned an IP address on the destination subnet 201.1.1.0/24. The necessary dispensation, however,
is granted by RFC 1812. All that is necessary by way of configuration is that R1 be told R2 is a directly
connected neighbor reachable via its link0 interface. On linux systems this might be done with the ip
route command on R1 as follows:
ip route
The linux ip route command illustrated here was tested on a virtual point-to-point link created with
ssh and pppd; the link interface name was in fact ppp0. While the command appeared to work as
advertised, it was only possible to create the link if endpoint IP addresses were assigned at the time
of creation; these were then removed with ip route del and then re-assigned with the command
shown here.
148
7 IP version 4
7.11 Mobile IP
In the original IPv4 model, there was a strong if implicit assumption that each IP host would stay put. One
role of an IP address is simply as a unique endpoint identifier, but another role is as a locator: some prefix
of the address (eg the network part, in the class-A/B/C strategy, or the provider prefix) represents something
about where the host is physically located. Thus, if a host moves far enough, it may need a new address.
When laptops are moved from site to site, it is common for them to receive a new IP address at each
location, eg via DHCP as the laptop connects to the local Wi-Fi. But what if we wish to support devices like
smartphones that may remain active and communicating while moving for thousands of miles? Changing
IP addresses requires changing TCP connections; life (and application development) might be simpler if a
device had a single, unchanging IP address.
One option, commonly used with smartphones connected to some so-called 3G networks, is to treat the
phones data network as a giant wireless LAN. The phones IP address need not change as it moves within
this LAN, and it is up to the phone provider to figure out how to manage LAN-level routing, much as is
done in 3.3.5 Wi-Fi Roaming.
But Mobile IP is another option, documented in RFC 5944. In this scheme, a mobile host has a permanent
home address and, while roaming about, will also have a temporary care-of address, which changes from
place to place. The care-of address might be, for example, an IP address assigned by a local Wi-Fi network,
and which in the absence of Mobile IP would be the IP address for the mobile host. (This kind of care-of
address is known as co-located; the care-of address can also be associated with some other device known
as a foreign agent in the vicinity of the mobile host.) The goal of Mobile IP is to make sure that the mobile
host is always reachable via its home address.
To maintain connectivity to the home address, a Mobile IP host needs to have a home agent back on the
home network; the job of the home agent is to maintain an IP tunnel that always connects to the devices
current care-of address. Packets arriving at the home network addressed to the home address will be forwarded to the mobile device over this tunnel by the home agent. Similarly, if the mobile device wishes to
send packets from its home address that is, with the home address as IP source address it can use the
tunnel to forward the packet to the home agent.
The home agent may use proxy ARP (7.7.1 ARP Finer Points) to declare itself to be the appropriate
destination on the home LAN for packets addressed to the home (IP) address; it is then straightforward for
the home agent to forward the packets.
An agent discovery process is used for the mobile host to decide whether it is mobile or not; if it is, it then
needs to notify its home agent of its current care-of address.
There are several forms of packet encapsulation that can be used for Mobile IP tunneling, but the default one
is IP-in-IP encapsulation, defined in RFC 2003. In this process, the entire original IP packet (with header
addressed to the home address) is used as data for a new IP packet, with a new IP header (the outer header)
addressed to the care-of address.
7.11 Mobile IP
149
A special value in the IP-header Protocol field indicates that IP-in-IP tunneling was used, so the receiver
knows to forward the packet on using the information in the inner header. The MTU of the tunnel will be
the original MTU of the path to the care-of address, minus the size of the outer header.
7.12 Epilog
At this point we have concluded the basic mechanics of IPv4. Still to come is a discussion of how IP
routers build their forwarding tables. This turns out to be a complex topic, divided into routing within single
organizations and ISPs 9 Routing-Update Algorithms and routing between organizations 10 LargeScale IP Routing.
But before that, in the next chapter, we compare IPv4 with IPv6, now twenty years old but still seeing very
limited adoption. The biggest issue fixed by IPv6 is IPv4s lack of address space, but there are also several
other less dramatic improvements.
7.13 Exercises
1. Suppose an Ethernet packet represents a TCP acknowledgment; that is, the packet contains an IP header
and a 20-byte TCP header but nothing else. Is the IP packet here smaller than the Ethernet minimum-packet
size, and, if so, by how much?
2. How can a receiving host tell if an arriving IP packet is unfragmented?
3. How long will it take the IDENT field of the IP header to wrap around, if the sender host A sends a stream
of packets to host B as fast as possible? Assume the packet size is 1500 bytes and the bandwidth is 600
Mbps.
4. The following diagram has routers A, B, C, D and E; E is the border router connecting the site to
the Internet. All router-to-router connections are via Ethernet-LAN /24 subnets with addresses of the form
200.0.x. Give forwarding tables for each of A, B, C and D. Each table should include each of the listed
subnets and also a default entry that routes traffic toward router E.
200.0.5
200.0.6
200.0.7
200.0.8
Internet
200.0.9
C
150
7 IP version 4
200.0.10
5. (This exercise is an attempt at modeling Internet-2 routing.) Suppose sites S1 ... Sn each have a single
connection to the standard Internet, and each site Si has a single IP address block Ai . Each sites connection
to the Internet is through a single router Ri ; each Ri s default route points towards the standard Internet. The
sites also maintain a separate, higher-speed network among themselves; each site has a single link to this
separate network, also through Ri . Describe what the forwarding tables on each Ri will have to look like so
that traffic from one Si to another will always use the separate higher-speed network.
6. For each IP network prefix given (with length), identify which of the subsequent IP addresses are part of
the same subnet.
7. Suppose that the subnet bits below for the following five subnets A-E all come from the beginning of the
fourth byte of the IP address; that is, these are subnets of a /24 block.
A: 00
B: 01
C: 110
D: 111
E: 1010
(a). What are the sizes of each subnet, and the corresponding decimal ranges? Count the addresses with
host bits all 0s or with host bits all 1s as part of the subnet.
(b). How many IP addresses in the class-C block do not belong to any of the subnets A, B, C, D and E?
8. In 7.7 Address Resolution Protocol: ARP it was stated that, in newer implementations, repeat ARP
queries about a timed out entry are first sent unicast, in order to reduce broadcast traffic. Why is this unicast
approach likely to succeed most of the time? Can you give an example of a situation in which the unicast
query would fail, but a followup broadcast query would succeed?
9. Suppose A broadcasts an ARP query who-has B?, receives Bs response, and proceeds to send B a
regular IP packet. If B now wishes to reply, why is it likely that A will already be present in Bs ARP cache?
Identify a circumstance under which this can fail.
10. Suppose A broadcasts an ARP request who-has B, but inadvertently lists the physical address of
another machine C instead of its own (that is, As ARP query has IPsrc = A, but LANsrc = C). What will
happen? Will A receive a reply? Will any other hosts on the LAN be able to send to A? What entries will
be made in the ARP caches on A, B and C?
7.13 Exercises
151
152
7 IP version 4
8 IP VERSION 6
What has been learned from experience with IPv4? First and foremost, more than 32 bits are needed for
addresses; the primary motive in developing IPv6 was the specter of running out of IPv4 addresses (something which, at the highest level, has already happened; see the discussion at the end of 1.10 IP - Internet
Protocol). Another important issue is that IPv4 requires a modest amount of effort at configuration; IPv6
was supposed to improve this.
By 1990 the IETF was actively interested in proposals to replace IPv4. A working group for the so-called
IP next generation, or IPng, was created in 1993 to select the new version; RFC 1550 was this groups
formal solicitation of proposals. In July 1994 the IPng directors voted to accept a modified version of the
Simple Internet Protocol, or SIP (unrelated to the Session Initiation Protocol), as the basis for IPv6.
SIP addresses were originally to be 64 bits in length, but in the month leading up to adoption this was
increased to 128. 64 bits would probably have been enough, but the problem is less the actual number than
the simplicity with which addresses can be allocated; the more bits, the easier this becomes, as sites can be
given relatively large address blocks without fear of waste. A secondary consideration in the 64-to-128 leap
was the potential to accommodate now-obsolete CLNP addresses, which were up to 160 bits in length (but
compressible).
IPv6 has to some extent returned to the idea of a fixed division between network and host portions: in most
ordinary-host cases, the first 64 bits of the address is the network portion (including any subnet portion)
and the remaining 64 bits represent the host portion. While there are some configuration alternatives here,
and while the IETF occasionally revisits the issue, at the present time the 64/64 split seems here to stay.
Routing, however, can, as in IPv4, be done on different prefixes of the address at different points of the
network. Thus, it is misleading to think of IPv6 as a return to Class A/B/C address allocation.
IPv6 is now twenty years old, and yet usage remains quite modest. However, the shortage in IPv4 addresses
has begun to loom ominously; IPv6 adoption rates may rise quickly if IPv4 addresses begin to climb in
price.
153
154
8 IP version 6
The above is an example of the standard IPv6 format for representing IPv4 addresses. A separate representation of IPv4 addresses, with the FFFF block replaced by 0-bits, is used for tunneling IPv6 traffic over
IPv4. The IPv6 loopback address is ::1 (that is, 127 0-bits followed by a 1-bit).
Network address prefixes may be written with the / notation, as in IPv4:
12AB:0:0:CD30::/60
RFC 3513 suggested that initial IPv6 unicast-address allocation be initially limited to addresses beginning
with the bits 001, that is, the 2000::/3 block (20 in binary is 0010 0000).
Generally speaking, IPv6 addresses consist of a 64-bit network prefix (perhaps including subnet bits) and a
64-bit host identifier. See 8.5 Network Prefixes.
155
156
8 IP version 6
157
Generally speaking, fragmentation should be avoided at the application layer when possible. UDP-based
applications that attempt to transmit filesystem-sized (usually 8 KB) blocks of data remain persistent users
of fragmentation.
158
8 IP version 6
example, if I use IPv4 block 10.0.0.0/8 at home, and connect using VPN to a site also using 10.0.0.0/8, it is
possible that my printer will have the same IPv4 address as their application server.
159
another via a router on the LAN, even though they should in principle be able to communicate directly. IPv6
drops this restriction.
The Router Advertisement packets sent by the router should contain a complete list of valid network-address
prefixes, as the Prefix Information option. In simple cases this list may contain a single externally routable
64-bit prefix. If a particular LAN is part of multiple (overlapping) physical subnets, the prefix list will
contain an entry for each subnet; these 64-bit prefixes will themselves likely have a common prefix of length
N<64. For multihomed sites the prefix list may contain multiple unrelated prefixes corresponding to the
different address blocks. Finally, private and local IPv6 address prefixes may also be included.
Each prefix will have an associated lifetime; nodes receiving a prefix from an RA packet are to use it only
for the duration of this lifetime. On expiration (and likely much sooner) a node must obtain a newer RA
packet with a newer prefix list. The rationale for inclusion of the prefix lifetime is ultimately to allow sites
to easily renumber; that is, to change providers and switch to a new network-address prefix provided by a
new router. Each prefix is also tagged with a bit indicating whether it can be used for autoconfiguration, as
in 8.7.2 Stateless Autoconfiguration (SLAAC) below.
8 IP version 6
to support complete plug-and-play network setup: hosts on a completely isolated LAN could talk to one
another out of the box, and if a router was introduced connecting the LAN to the Internet, then hosts would
be able to determine unique, routable addresses from information available from the router.
In the early days of IPv6 development, in fact, DHCPv6 may have been intended only for address assignments to routers and servers, with SLAAC meant for ordinary hosts. In that era, it was still common for
IPv4 addresses to be assigned statically, via per-host configuration files. RFC 4862 states that SLAAC is
to be used when a site is not particularly concerned with the exact addresses hosts use, so long as they are
unique and properly routable.
SLAAC and DHCPv6 evolved to some degree in parallel. While SLAAC solves the autoconfiguration problem quite neatly, at this point DHCPv6 solves it just as effectively, and provides for greater administrative
control. For this reason, SLAAC may end up less widely deployed. On the other hand, SLAAC gives hosts
greater control over their IPv6 addresses, and so may end up offering hosts a greater degree of privacy by
allowing endpoint management of the use of private and temporary addresses (below).
When a host first begins the Neighbor Discovery process, it receives a Router Advertisement packet. In this
packet are two special bits: the M (managed) bit and the O (other configuration) bit. The M bit is set to
indicate that DHCPv6 is available on the network for address assignment. The O bit is set to indicate that
DHCPv6 is able to provide additional configuration information (eg the name of the DNS server) to hosts
that are using SLAAC to obtain their addresses.
161
The next step is to see if there is a router available. The host sends a Router Solicitation (RS) message
to the all-routers multicast address. A router if present should answer with a Router Advertisement
(RA) message that also contains a Prefix Information option; that is, a list of IPv6 network-address prefixes
(8.6.2 Prefix Discovery). The RA message will mark with a flag those prefixes eligible for use with
SLAAC; if no prefixes are so marked, then SLAAC should not be used. All prefixes will also be marked
with a lifetime, indicating how long the host may continue to use the prefix; once the prefix expires, the host
must obtain a new one via a new RA message.
The host chooses an appropriate prefix, stores the prefix-lifetime information, and, in the original version
of SLAAC, appends the prefix to the front of its host identifier to create what should now be a routable
address. The prefix length plus the host-identifier length must equal 128 bits; in the most common case each
is 64 bits. The address so formed must now be verified through the duplicate-address-detection mechanism
above.
An address generated in this way will, because of the embedded host identifier, uniquely identify the host
for all time. This includes identifying the host even when it is connected to a new network and is given
a different network prefix. Therefore, RFC 4941 defines a set of privacy extensions to SLAAC: optional
mechanisms for the generation of alternative host identifiers, based on pseudorandom generation using the
original LAN-address-based host identifier as a seed value. The probability of two hosts accidentally
choosing the same host identifier in this manner is very small; the Neighbor Solicitation mechanism with
DAD must, however, still be used to verify that the address is in fact unique. DHCPv6 also provides an
option for temporary address assignments, also to improve privacy, but one of the potential advantages of
SLAAC is that this process is entirely under the control of the end system.
Regularly (eg every few hours, or less) changing the host portion of an IPv6 address will make external
tracking of a host slightly more difficult. However, for a residential site with only a handful of hosts, a
considerable degree of tracking may be obtained simply by using the common 64-bit prefix.
In theory, if another host B on the LAN wishes to contact host A with a SLAAC-configured address containing the original host identifier, and B knows As IPv6 address AIPv6 , then B might extract As LAN address
from the low-order bits of AIPv6 . This was never actually allowed, however, even before the RFC 4941
privacy options, as there is no way for B to know that As address was generated via SLAAC at all. B would
always find As LAN address through the usual process of IPv6 Neighbor Solicitation.
A host using SLAAC may receive multiple network prefixes, and thus generate for itself multiple addresses.
RFC 6724 defines a process for a host to determine, when it wishes to connect to destination address D,
which of its own multiple addresses to use. For example, if D is a site-local address, not globally visible,
then the host will likely want to use an address that is also site-local. RFC 6724 also includes mechanisms
to allow a host with a permanent public address (eg corresponding to a DNS entry) to prefer alternative
temporary or privacy addresses for outbound connections.
At the end of the SLAAC process, the host knows its IPv6 address (or set of addresses) and its default router.
In IPv4, these would have been learned through DHCP along with the identity of the hosts DNS server; one
concern with SLAAC is that there is no obvious way for a host to find its DNS server. One strategy is to fall
back on DHCPv6 for this. However, RFC 6106 now defines a process by which IPv6 routers can include
DNS-server information in the RA packets they send to hosts as part of the SLAAC process; this completes
the final step of the autoconfiguration process.
How to get DNS names for SLAAC-configured IPv6 hosts into the DNS servers is an entirely separate
issue. One approach is simply not to give DNS names to such hosts. In the NAT-router model for IPv4
autoconfiguration, hosts on the inward side of the NAT router similarly do not have DNS names (although
162
8 IP version 6
they are also not reachable directly, while SLAAC IPv6 hosts would be reachable). If DNS names are needed
for hosts, then a site might choose DHCPv6 for address assignment instead of SLAAC. It is also possible to
figure out the addresses SLAAC would use (by identifying the host-identifier bits) and then creating DNS
entries for these hosts. Hosts can also use Dynamic DNS (RFC 2136) to update their own DNS records.
8.7.3 DHCPv6
The job of the DHCPv6 server is to tell an inquiring host its network prefix(es) and also supply a 64-bit host-identifier.
Hosts begin the process by sending a DHCPv6 request to the
All_DHCP_Relay_Agents_and_Servers multicast IPv6 address FF02::1:2 (versus the broadcast address for
IPv4). As with DHCPv4, the job of a relay agent is to tag a DHCP request with the correct current subnet, and then to forward it to the actual DCHPv6 server. This allows the DHCP server to be on a different subnet from the requester. Note that the use of multicast does nothing to diminish the need for
relay agents; use of the multicast group does not necessarily identify a requesters subnet. In fact, the
All_DHCP_Relay_Agents_and_Servers multicast address scope is limited to the current link; relay agents
then forward to the actual DHCP server using the site-scoped address All_DHCP_Servers.
Hosts using SLAAC to obtain their address can still use a special Information-Request form of DHCPv6 to
obtain their DNS server and any other static DHCPv6 information.
Clients may ask for temporary addresses. These are identified as such in the DHCPv6 request, and are
handled much like permanent address requests, except that the client may ask for a new temporary address
only a short time later. When the client does so, a different temporary address will be returned; a repeated
request for a permanent address, on the other hand, would usually return the same address as before.
When the DHCPv6 server returns a temporary address, it may of course keep a log of this address. The
absence of such a log is one reason SLAAC may provide a greater degree of privacy. Another concern
is that the DHCPv6 temporary-address sequence might have a flaw that would allow a remote observer to
infer a relationship between different temporary addresses; with SLAAC, a host is responsible itself for the
security of its temporary-address sequence and is not called upon to trust an external entity.
A DHCPv6 response contains a list (perhaps of length 1) of IPv6 addresses. Each separate address has an
expiration date. The client must send a new request before the expiration of any address it is actually using;
unlike for DHCPv4, there is no separate address lease lifetime.
In DHCPv4, the host portion of addresses typically comes from address pools representing small ranges
of integers such as 64-254; these values are generally allocated consecutively. A DHCPv6 server, on the
other hand, should take advantage of the enormous range (264 ) of possible host portions by allocating values
more sparsely, through the use of pseudorandomness. This makes it very difficult for an outsider who knows
one of a sites host addresses to guess the addresses of other hosts. Some DHCPv6 servers, however, do not
yet support this; such servers make the SLAAC approach more attractive.
163
be convenient to distribute only the /64 prefix via manual configuration, and have SLAAC supply the loworder 64 bits, this option is not described in the SLAAC RFCs and seems not to be available in common
implementations.
8.9 ICMPv6
RFC 4443 defines an updated version of the ICMP protocol. It includes an IPv6 version of ICMP Echo
Request / Echo Reply, upon which the ping command is based. It also handles the error conditions below;
this list is somewhat cleaner than the corresponding ICMPv4 list:
Destination Unreachable
In this case, one of the following numeric codes is returned:
0. No route to destination, returned when a router has no next_hop entry.
1. Communication with destination administratively prohibited, returned when a router has a
next_hop entry, but declines to use it for policy reasons. Codes 5 and 6 are special cases; these
more-specific codes are returned when appropriate.
2. Beyond scope of source address, returned when a router is, for example, asked to route a packet
to a global address, but the return address is site-local. In IPv4, when a host with a private address
attempts to connect to a global address, NAT is almost always involved.
164
8 IP version 6
3. Address unreachable, a catchall category for routing failure not covered by any other message. An
example is if the packet was successfully routed to the last_hop router, but Neighbor Discovery failed
to find a LAN address corresponding to the IPv6 address.
4. Port unreachable, returned when, as in ICMPv4, the destination host does not have the requested
UDP port open.
5. Source address failed ingress/egress policy, see code 1.
6. Reject route to destination, see code 1.
Packet Too Big
This is like ICMPv4s Fragmentation Required but DontFragment flag sent; IPv6 however has no routerbased fragmentation.
Time Exceeded
This is used for cases where the Hop Limit was exceeded, and also where source-based fragmentation was
used and the fragment-reassembly timer expired.
Parameter Problem
This is used when there is a malformed entry in the IPv6 header, eg an unrecognized Next Header value.
8.10.1 ping6
We will start with the linux version of ping6, the IPv6 analogue of the familiar ping command. It is
used to send ICMPv6 Echo Requests. The ping6 command supports an option to specify the interface (-I
eth0); as noted above, this is mandatory when sending to link-local addresses.
ping6 ::1: This allows me to ping my own loopback address.
ping6 -I eth0 ff02::1: This pings the all-nodes multicast group on interface eth0. I get these answers:
64 bytes from fe80::3e97:eff:fe2c:2beb (this is the host I am pinging from)
64 bytes from fe80::2a0:ccff:fe24:b0e4 (another linux host)
My VoIP phone on the same subnet but apparently supporting IPv4 only remains mute.
ping6 -I eth0 fe80::6267:20ff:fe72:8960: This pings the link-local address of the other linux host answering
the previous query. Note the ff:fe in the host identifier. Also note the flipped seventh bit of the two bytes
02a0; the other linux host has Ethernet address 00:a0:cc:24:b0:e4.
165
166
8 IP version 6
8.12 Epilog
IPv4 has run out of large address blocks, as of 2011. IPv6 has reached a mature level of development. Most
common operating systems provide excellent IPv6 support.
Yet conversion has been slow. Many ISPs still provide limited (to nonexistent) support, and inexpensive
IPv6 firewalls to replace the ubiquitous consumer-grade NAT routers do not really exist. Time will tell how
all this evolves. However, while IPv6 has now been around for twenty years, top-level IPv4 address blocks
disappeared just three years ago. It is quite possible that this will prove to be just the catalyst IPv6 needs.
8.13 Exercises
1. Each IPv6 address is associated with a specific solicited-node multicast address. Explain why, on a
typical Ethernet, if the original IPv6 host address was obtained via SLAAC then the LAN multicast group
corresponding to the hosts solicited-node multicast addresses is likely to be small, in many cases consisting
of one host only. (Packet delivery to small LAN multicast groups can be much more efficient than delivery
to large multicast groups.)
(b). What steps might a DHCPv6 server take to ensure that, for the IPv6 addresses it hands out, the LAN
multicast groups corresponding to the host addresses solicited-node multicast addresses will be small?
2. If an attacker sends a large number of probe packets via IPv4, you can block them by blocking the
attackers IP address. Now suppose the attacker uses IPv6 to launch the probes; for each probe, the attacker
changes the low-order 64 bits of the address. Can these probes be blocked efficiently? If so, what do you
have to block? Might you also be blocking other users?
3. Suppose someone tried to implement ping6 so that, if the address was a link-local address and no interface
was specified, the ICMPv6 Echo Request was sent out all non-loopback interfaces. Could the end result be
different than conventional ping6 with the correct interface supplied? If so, how likely is this?
4. Create an IPv6 ssh connection as in 8.10 Routerless Connection Examples. Examine the connections
packets using WireShark or the equivalent. Does the TCP handshake (12.3 TCP Connection Establishment)
look any different over IPv6?
8.11 IPv6-to-IPv4 connectivity
167
5. Create an IPv6 ssh connection using manually configured addresses as in 8.10.3 Manual address
configuration. Again use WireShark or the equivalent to monitor the connection. Is DAD (8.7.1 Duplicate
Address Detection) used?
168
8 IP version 6
9 ROUTING-UPDATE ALGORITHMS
169
Routers identify their router neighbors (through some sort of neighbor-discovery mechanism), and
add a third column to their forwarding tables for cost; table entries are thus of the form
destination,next_hop,cost. The simplest case is to assign a cost of 1 to each link (the hopcount metric); it is also possible to assign more complex numbers.
Each router then reports the destination,cost portion of its table to its neighboring routers at regular
intervals (these table portions are the vectors of the algorithm name). It does not matter if neighbors
exchange reports at the same time, or even at the same rate.
Each router also monitors its continued connectivity to each neighbor; if neighbor N becomes unreachable
then its reachability cost is set to infinity.
Actual destinations in IP would be networks attached to routers; one router might be directly connected
to several such destinations. In the following, however, we will identify all a routers directly connected
networks with the router itself. That is, we will build forwarding tables to reach every router. While
it is possible that one destination network might be reachable by two or more routers, thus breaking our
identification of a router with its set of attached networks, in practice this is of little concern. See exercise 4
for an example in which networks are not identified with adjacent routers.
9.1.2 Example 1
For our first example, no links will break and thus only the first two rules above will be used. We will start
out with the network below with empty forwarding tables; all link costs are 1.
170
9 Routing-Update Algorithms
After initial neighbor discovery, here are the forwarding tables. Each node has entries only for its directly
connected neighbors:
A: B,B,1 C,C,1 D,D,1
B: A,A,1 C,C,1
C: A,A,1 B,B,1 E,E,1
D: A,A,1 E,E,1
E: C,C,1 D,D,1
Now let D report to A; it sends records A,1 and E,1. A ignores Ds A,1 record, but E,1 represents a
new destination; A therefore adds E,D,2 to its table. Similarly, let A now report to D, sending B,1 C,1
D,1 E,2 (the last is the record we just added). D ignores As records D,1 and E,2 but As records
B,1 and C,1 cause D to create entries B,A,2 and C,A,2. A and Ds tables are now, in fact, complete.
Now suppose C reports to B; this gives B an entry E,C,2. If C also reports to E, then Es table will have
A,C,2 and B,C,2. The tables are now:
A: B,B,1 C,C,1 D,D,1 E,D,2
B: A,A,1 C,C,1 E,C,2
C: A,A,1 B,B,1 E,E,1
D: A,A,1 E,E,1 B,A,2 C,A,2
E: C,C,1 D,D,1 A,C,2 B,C,2
We have two missing entries: B and C do not know how to reach D. If A reports to B and C, the tables will
be complete; B and C will each reach D via A at cost 2. However, the following sequence of reports might
also have occurred:
E reports to C, causing C to add D,E,2
C reports to B, causing B to add D,C,3
In this case we have 100% reachability but B routes to D via the longer-than-necessary path BCED.
However, one more report will fix this: suppose A reports to B. B will received D,1 from A, and will
update its entry D,C,3 to D,A,2.
Note that A routes to E via D while E routes to A via C; this asymmetry was due to indeterminateness in the
order of initial table exchanges.
If all link weights are 1, and if each pair of neighbors exchange tables once before any pair starts a second
exchange, then the above process will discover the routes in order of length, ie the shortest paths will be the
first to be discovered. This is not, however, a particularly important consideration.
171
9.1.3 Example 2
The next example illustrates link weights other than 1. The first route discovered between A and B is the
direct route with cost 8; eventually we discover the longer ACDB route with cost 2+1+3=6.
9.1.4 Example 3
Our third example will illustrate how the algorithm proceeds when a link breaks. We return to the first
diagram, with all tables completed, and then suppose the DE link breaks. This is the bad-news case: a
link has broken, and is no longer available; this will bring the third rule into play.
172
9 Routing-Update Algorithms
We shall assume, as above, that A reaches E via D, but we will here assume contrary to Example 1 that
C reaches D via A (see exercise 3.5 for the original case).
Initially, upon discovering the break, D and E update their tables to E,-, and D,-, respectively
(whether or not they actually enter into their tables is implementation-dependent; we may consider this
as equivalent to removing their entries for one another; the - as next_hop indicates there is no next_hop).
Eventually D and E will report the break to their respective neighbors A and C. A will apply the bad-news
rule above and update its entry for E to E,-,. We have assumed that C, however, routes to D via A, and
so it will ignore Es report.
We will suppose that the next steps are for C to report to E and to A. When C reports its route D,2 to E,
E will add the entry D,C,3, and will again be able to reach D. When C reports to A, A will add the route
E,C,2. The final step will be when A next reports to D, and D will have E,A,3. Connectivity is restored.
9.1.5 Example 4
The previous examples have had a global perspective in that we looked at the entire network. In the next
example, we look at how one specific router, R, responds when it receives a distance-vector report from its
neighbor S. Neither R nor S nor we have any idea of what the entire network looks like. Suppose Rs table
is initially as follows, and the SR link has cost 1:
destination
A
B
C
D
next_hop
S
T
S
U
cost
3
4
5
6
S now sends R the following report, containing only destinations and its costs:
destination
A
B
C
D
E
cost
2
3
5
4
2
173
destination
A
B
C
D
E
next_hop
S
T
S
S
S
cost
3
4
6
5
3
reason
No change; S probably sent this report before
No change; Rs cost via S is tied with Rs cost via T
Next_hop increase
Lower-cost route via S
New destination
If A immediately reports to B that D is no longer reachable (distance = ), then all is well. However, it is
possible that B reports to A first, telling A that it has a route to D, with cost 2, which B still believes it has.
This means A now installs the entry D,B,3. At this point we have what we called in 1.6 Routing Loops a
linear routing loop: if a packet is addressed to D, A will forward it to B and B will forward it back to A.
Worse, this loop will be with us a while. At some point A will report D,3 to B, at which point B will
update its entry to D,A,4. Then B will report D,4 to A, and As entry will be D,B,5, etc. This process
is known as slow convergence to infinity. If A and B each report to the other once a minute, it will take
2,000 years for the costs to overflow an ordinary 32-bit integer.
174
9 Routing-Update Algorithms
Suppose the A-D link breaks, and A updates to D,-,. A then reports D, to B, which updates its
table to D,-,. But then, before A can also report D, to C, C reports D,2 to B. B then updates to
D,C,3, and reports D,3 back to A; neither this nor the previous report violates split-horizon. Now As
entry is D,B,4. Eventually A will report to C, at which point Cs entry becomes D,A,5, and the numbers
keep increasing as the reports circulate counterclockwise. The actual routing proceeds in the other direction,
clockwise.
Split horizon often also includes poison reverse: if A uses N as its next_hop to D, then A in fact reports
D, to N, which is a more definitive statement that A cannot reach D by itself. However, coming up with
a scenario where poison reverse actually affects the outcome is not trivial.
9.2.1.2 Triggered Updates
In the original example, if A was first to report to B then the loop resolved immediately; the loop occurred
if B was first to report to A. Nominally each outcome has probability 50%. Triggered updates means that
any router should report immediately to its neighbors whenever it detects any change for the worse. If A
reports first to B in the first example, the problem goes away. Similarly, in the second example, if A reports
to both B and C before B or C report to one another, the problem goes away. There remains, however, a
small window where B could send its report to A just as A has discovered the problem, before A can report
to B.
9.2.1.3 Hold Down
Hold down is sort of a receiver-side version of triggered updates: the receiver does not use new alternative
routes for a period of time (perhaps two router-update cycles) following discovery of unreachability. This
gives time for bad news to arrive. In the first example, it would mean that when A received Bs report D,2,
it would set this aside. It would then report D, to B as usual, at which point B would now report D,
back to A, at which point Bs earlier report D,2 would be discarded. A significant drawback of hold down
is that legitimate new routes are also delayed by the hold-down period.
175
These mechanisms for preventing slow convergence are, in the real world, quite effective. The Routing
Information Protocol (RIP, RFC 2453) implements all but hold-down, and has been widely adopted at
smaller installations.
However, the potential for routing loops and the limited value for infinity led to the development of alternatives. One alternative is the link-state strategy, 9.5 Link-State Routing-Update Algorithm. Another
alternative is Ciscos Enhanced Interior Gateway Routing Protocol, or EIGRP, 9.4.2 EIGRP. While part of
the distance-vector family, EIGRP is provably loop-free, though to achieve this it must sometimes suspend
forwarding to some destinations while tables are in flux.
Now suppose that A and B use distance-vector but are allowed to choose the shortest route to within 10%. A
would get a report from C that D could be reached with cost 1, for a total cost of 21. The forwarding entry
via C would be D,C,21. Similarly, A would get a report from B that D could be reached with cost 21, for
a total cost of 22: D,B,22. Similarly, B has choices D,C,21 and D,A,22.
If A and B both choose the minimal route, no loop forms. But if A and B both use the 10%-overage rule,
they would be allowed to choose the other route: A could choose D,B,22 and B could choose D,A,22.
176
9 Routing-Update Algorithms
If this happened, we would have a routing loop: A would forward packets for D to B, and B would forward
them right back to A.
As we apply distance-vector routing, each router independently builds its tables. A router might have some
notion of the path its packets would take to their destination; for example, in the case above A might believe
that with forwarding entry D,B,22 its packets would take the path ABCD (though in distance-vector
routing, routers do not particularly worry about the big picture). Consider again the accurate-cost question
above. This fails in the 10%-overage example, because the actual path is now infinite.
We now prove that, in distance-vector routing, the network will have accurate costs, provided
each router selects what it believes to be the shortest path to the final destination, and
the network is stable, meaning that further dissemination of any reports would not result in changes
To see this, suppose the actual route taken by some packet from source to destination, as determined by
application of the distributed algorithm, is longer than the cost calculated by the source. Choose an example
of such a path with the fewest number of links, among all such paths in the network. Let S be the source,
D the destination, and k the number of links in the actual path P. Let Ss forwarding entry for D be D,N,c,
where N is Ss next_hop neighbor. To have obtained this route through the distance-vector algorithm, S must
have received report D,cD from N, where we also have the cost of the SN link as cN and c = cD + cN . If
we follow a packet from N to D, it must take the same path P with the first link deleted; this sub-path has
length k-1 and so, by our hypothesis that k was the length of the shortest path with non-accurate costs, the
cost from N to D is cD . But this means that the cost along path P, from S to D via N, must be cD + cN = c,
contradicting our selection of P as a path longer than its advertised cost.
There is one final observation to make about route costs: any cost-minimization can occur only within
a single routing domain, where full information about all links is available. If a path traverses multiple
routing domains, each separate routing domain may calculate the optimum path traversing that domain. But
these local minimums do not necessarily add up to a globally minimal path length, particularly when
one domain calculates the minimum cost from one of its routers only to the other domain rather than to a
router within that other domain. Here is a simple example. Routers BR1 and BR2 are the border routers
connecting the domain LD to the left of the vertical dotted line with domain RD to the right. From A to B,
LD will choose the shortest path to RD (not to B, because LD is not likely to have information about links
within RD). This is the path of length 3 through BR2. But this leads to a total path length of 3+8=11 from
A to B; the global minimum path length, however, is 4+1=5, through BR1.
In this example, domains LD and RD join at two points. For a route across two domains joined at only a
single point, the domain-local shortest paths do add up to the globally shortest path.
177
9.4.1 DSDV
DSDV, or Destination-Sequenced Distance Vector, was proposed in [PB94]. It avoids routing loops by
the introduction of sequence numbers: each router will always prefer routes with the most recent sequence
number, and bad-news information will always have a lower sequence number then the next cycle of corrected information.
DSDV was originally proposed for MANETs (3.3.8 MANETs) and has some additional features for traffic
minimization that, for simplicity, we ignore here. It is perhaps best suited for wired networks and for small,
relatively stable MANETs.
DSDV forwarding tables contain entries for every other reachable node in the system. One successor of
DSDV, Ad Hoc On-Demand Distance Vector routing or AODV, allows forwarding tables to contain only
those destinations in active use; a mechanism is provided for discovery of routes to newly active destinations.
See [PR99] and RFC 3561.
Under DSDV, each forwarding table entry contains, in addition to the destination, cost and next_hop, the current sequence number for that destination. When neighboring nodes exchange their distance-vector reachability reports, the reports include these per-destination sequence numbers.
When a router R receives a report from neighbor N for destination D, and the report contains a sequence
number larger than the sequence number for D currently in Rs forwarding table, then R always updates to
use the new information. The three cost-minimization rules of 9.1.1 Distance-Vector Update Rules above
are used only when the incoming and existing sequence numbers are equal.
Each time a router R sends a report to its neighbors, it includes a new value for its own sequence number,
which it always increments by 2. This number is then entered into each neighbors forwarding-table entry for
R, and is then propagated throughout the network via continuing report exchanges. Any sequence number
originating this way will be even, and whenever another nodes forwarding-table sequence number for R is
even, then its cost for R will be finite.
Infinite-cost reports are generated in the usual way when former neighbors discover they can no longer reach
one another; however, in this case each node increments the sequence number for its former neighbor by 1,
thus generating an odd value. Any forwarding-table entry with infinite cost will thus always have an odd
sequence number. If A and B are neighbors, and As current sequence number is s, and the AB link breaks,
then B will start reporting A at distance with sequence number s+1 while A will start reporting its own
new sequence number s+2. Any other node now receiving a report originating with B (with sequence number
s+1) will mark A as having cost , but will obtain a valid route to A upon receiving a report originating
from A with new (and larger) sequence number s+2.
The triggered-update mechanism is used: if a node receives a report with some destinations newly marked
with infinite cost, it will in turn forward this information immediately to its other neighbors, and so on. This
is, however, not essential; bad and good reports are distinguished by sequence number, not by relative
arrival time.
178
9 Routing-Update Algorithms
It is now straightforward to verify that the slow-convergence problem is solved. After a link break, if there is
some alternative path from router R to destination D, then R will eventually receive Ds latest even sequence
number, which will be greater than any sequence number associated with any report listing D as unreachable.
If, on the other hand, the break partitioned the network and there is no longer any path to D from R, then the
highest sequence number circulating in Rs half of the original network will be odd and the associated table
entries will all list D at cost . One way or another, the network will quickly settle down to a state where
every destinations reachability is accurately described.
In fact, a stronger statement is true: not even transient routing loops are created. We outline a proof. First,
whenever router R has next_hop N for a destination D, then Ns sequence number for D must be greater
than or equal to Rs, as R must have obtained its current route to D from one of Ns reports. A consequence
is that all routers participating in a loop for destination D must have the same (even) sequence number s for
D throughout. This means that the loop would have been created if only the reports with sequence number
s were circulating. As we noted in 9.1.1 Distance-Vector Update Rules, any application of the next_hopincrease rule must trace back to a broken link, and thus must involve an odd sequence number. Thus, the
loop must have formed from the sequence-number-s reports by the application of the first two rules only.
But this violates the claim in Exercise 10.
There is one drawback to DSDV: nodes may sometimes briefly switch to routes that are longer than optimum
(though still correct). This is because a router is required to use the route with the newest sequence number,
even if that route is longer than the existing route. If A and B are two neighbors of router R, and B is closer
to destination D but slower to report, then every time Ds sequence number is incremented R will receive
As longer route first, and switch to using it, and Bs shorter route shortly thereafter.
DSDV implementations usually address this by having each router R keep track of the time interval between
the first arrival at R of a new route to a destination D with a given sequence number, and the arrival of the
best route with that sequence number. During this interval following the arrival of the first report with a new
sequence number, R will use the new route, but will refrain from including the route in the reports it sends
to its neighbors, anticipating that a better route will soon arrive.
This works best when the hopcount cost metric is being used, because in this case the best route is likely
to arrive first (as the news had to travel the fewest hops), and at the very least will arrive soon after the first
route. However, if the networks cost metric is unrelated to the hop count, then the time interval between
first-route and best-route arrivals can involve multiple update cycles, and can be substantial.
9.4.2 EIGRP
EIGRP, or the Enhanced Interior Gateway Routing Protocol, is a once-proprietary Cisco distance-vector
protocol that was released as an Internet Draft in February 2013. As with DSDV, it eliminates the risk of
routing loops, even ephemeral ones. It is based on the distributed update algorithm (DUAL) of [JG93].
EIGRP is an actual protocol; we present here only the general algorithm. Our discussion follows [CH99].
Each router R keeps a list of neighbor routers NR , as with any distance-vector algorithm. Each R also
maintains a data structure known (somewhat misleadingly) as its topology table. It contains, for each
destination D and each N in NR , an indication of whether N has reported the ability to reach D and, if so, the
reported cost c(D,N). The router also keeps, for each N in NR , the cost cN of the link from R to N. Finally,
the forwarding-table entry for any destination can be marked passive, meaning safe to use, or active,
meaning updates are in process and the route is temporarily unavailable.
179
Initially, we expect that for each router R and each destination D, Rs next_hop to D in its forwarding table
is the neighbor N for which the following total cost is a minimum:
c(D,N) + cN
Now suppose R receives a distance-vector report from neighbor N1 that it can reach D with cost c(D,N1 ).
This is processed in the usual distance-vector way, unless it represents an increased cost and N1 is Rs
next_hop to D; this is the third case in 9.1.1 Distance-Vector Update Rules. In this case, let C be Rs
current cost to D, and let us say that neighbor N of R is a feasible next_hop (feasible successor in Ciscos
terminology) if Ns cost to D (that is, c(D,N)) is strictly less than C. R then updates its route to D to use the
feasible neighbor N for which c(D,N) + cN is a minimum. Note that this may not in fact be the shortest path;
it is possible that there is another neighbor M for which c(D,M)+cM is smaller, but c(D,M)C. However,
because Ns path to D is loop-free, and because c(D,N) < C, this new path through N must also be loop-free;
this is sometimes summarized by the statement one cannot create a loop by adopting a shorter route.
If no neighbor N of R is feasible which would be the case in the DAB example of 9.2 Distance-Vector
Slow-Convergence Problem, then R invokes the DUAL algorithm. This is sometimes called a diffusion
algorithm as it invokes a diffusion-like spread of table changes proceeding away from R.
Let C in this case denote the new cost from R to D as based on N1 s report. R marks destination D as
active (which suppresses forwarding to D) and sends a special query to each of its neighbors, in the form
of a distance-vector report indicating that its cost to D has now increased to C. The algorithm terminates
when all Rs neighbors reply back with their own distance-vector reports; at that point R marks its entry for
D as passive again.
Some neighbors may be able to process Rs report without further diffusion to other nodes, remain passive,
and reply back to R immediately. However, other neighbors may, like R, now become active and continue
the DUAL algorithm. In the process, R may receive other queries that elicit its distance-vector report; as
long as R is active it will report its cost to D as C. We omit the argument that this process and thus the
network must eventually converge.
9 Routing-Update Algorithms
that every router sees every LSP, and also that no LSPs circulate repeatedly. (The acronym LSP is used by
a link-state implementation known as IS-IS; the preferred acronym used by the Open Shortest Path First
(OSPF) implementation is LSA, where A is for advertisement.) LSPs are sent immediately upon link-state
changes, like triggered updates in distance-vector protocols except there is no race between bad news
and good news.
It is possible for ephemeral routing loops to exist; for example, if one router has received a LSP but another
has not, they may have an inconsistent view of the network and thus route to one another. However, as soon
as the LSP has reached all routers involved, the loop should vanish. There are no race conditions, as with
distance-vector routing, that can lead to persistent routing loops.
The link-state flooding algorithm avoids the usual problems of broadcast in the presence of loops by having
each node keep a database of all LSP messages. The originator of each LSP includes its identity, information
about the link that has changed status, and also a sequence number. Other routers need only keep in their
databases the LSP packet with the largest sequence number; older LSPs can be discarded. When a router
receives a LSP, it first checks its database to see if that LSP is old, or is current but has been received before;
in these cases, no further action is taken. If, however, an LSP arrives with a sequence number not seen
before, then in typical broadcast fashion the LSP is retransmitted over all links except the arrival interface.
As an example, consider the following arrangement of routers:
Suppose the AE link status changes. A sends LSPs to C and B. Both these will forward the LSPs to D;
suppose Bs arrives first. Then D will forward the LSP to C; the LSP traveling CD and the LSP traveling
DC might even cross on the wire. D will ignore the second LSP copy that it receives from C and C will
ignore the second copy it receives from D.
It is important that LSP sequence numbers not wrap around. (Protocols that do allow a numeric field to wrap
around usually have a clear-cut idea of the active range that can be used to conclude that the numbering has
wrapped rather than restarted; this is harder to do in the link-state context.) OSPF uses lollipop sequencenumbering here: sequence numbers begin at -231 and increment to 231 -1. At this point they wrap around
back to 0. Thus, as long as a sequence number is less than zero, it is guaranteed unique; at the same time,
routing will not cease if more than 231 updates are needed. Other link-state implementations use 64-bit
sequence numbers.
Actual link-state implementations often give link-state records a maximum lifetime; entries must be periodically renewed.
181
182
9 Routing-Update Algorithms
We start with current = A. At the end of the first stage, B,B,3 is moved into R, T is {D,D,12}, and
current is B. The second stage adds C,B,5 to T, and then moves this to R; current then becomes C. The
third stage introduces the route (from A) D,B,10; this is an improvement over D,D,12 and so replaces it
in T; at the end of the stage this route to D is moved to R.
A link-state source node S computes the entire path to a destination D. But as far as the actual path that a
packet sent by S will take to D, S has direct control only as far as the first hop N. While the accurate-cost
rule we considered in distance-vector routing will still hold, the actual path taken by the packet may differ
from the path computed at the source, in the presence of alternative paths of the same length. For example,
S may calculate a path SNAD, and yet a packet may take path SNBD, so long as the NAD and
NBD paths have the same length.
Link-state routing allows calculation of routes on demand (results are then cached), or larger-scale calculation. Link-state also allows routes calculated with quality-of-service taken into account, via straightforward
extension of the algorithm above.
183
routers knew nothing about. Instead, the forwarding table is split up into multiple dest, next_hop (or dest,
QoS, next_hop) tables. One of these tables is the main table, and is the table that is updated by routingupdate protocols interacting with neighbors. Before a packet is forwarded, administratively supplied rules
are consulted to determine which table to apply; these rules are allowed to consult other packet attributes.
The collection of tables and rules is known as the routing policy database.
As a simple example, in the situation above the main table would have an entry default, L1 (more precisely,
it would have the IP address of the far end of the L1 link instead of L1 itself). There would also be another
table, perhaps named slow, with a single entry default, L2. If a rule is created to have a packet routed
using the slow table, then that packet will be forwarded via L2. Here is one such linux rule, applying to
traffic from host 10.0.0.17:
ip rule add from 10.0.0.17 table slow
Now suppose we want to route traffic to port 25 (the SMTP port) via L2. This is harder; linux provides no
support here for routing based on port numbers. However, we can instead use the iptables mechanism to
mark all packets destined for port 25, and then create a routing-policy rule to have such marked traffic
use the slow table. The mark is known as the forwarding mark, or fwmark; its value is 0 by default. The
fwmark is not actually part of the packet; it is associated with the packet only while the latter remains
within the kernel.
iptables --table mangle --append PREROUTING \\
--protocol tcp --source-port 25 --jump MARK --set-mark 1
ip rule add fwmark 1 table slow
9.7 Epilog
At this point we have concluded the basics of IP routing, involving routing within large (relatively) homogeneous organizations such as multi-site corporations or Internet Service Providers. Every router involved
must agree to run the same protocol, and must agree to a uniform assignment of link costs.
At the very largest scales, these requirements are impractical. The next chapter is devoted to this issue of
very-large-scale IP routing, eg on the global Internet.
9.8 Exercises
1. Suppose the network is as follows, where distance-vector routing update is used. Each link has cost 1,
and each router has entries in its forwarding table only for its immediate neighbors (so As table contains
B,B,1, D,D,1 and Bs table contains A,A,1, C,C,1).
A
184
9 Routing-Update Algorithms
(a). Suppose each node creates a report from its initial configuration and sends that to each of its neighbors.
What will each nodes forwarding table be after this set of exchanges? The exchanges, in other words, are
all conducted simultaneously; each node first sends out its own report and then processes the reports
arriving from its two neighbors.
(b). What will each nodes table be after the simultaneous-and-parallel exchange process of part (a) is
repeated a second time?
Hint: you do not have to go through each exchange in detail; the only information added by an exchange is
additional reachability information.
2. Now suppose the configuration of routers has the link weights shown below.
A
2
D
C
12
(a). As in the previous exercise, give each nodes forwarding table after each node exchanges with its
immediate neighbors simultaneously and in parallel.
(b). How many iterations of such parallel exchanges will it take before C learns to reach F via B; that is,
before it creates the entry F,B,11? Count the answer to part (a) as the first iteration.
cost
5
6
7
8
9
next hop
R1
R1
R2
R2
R3
R now receives the following report from R1; the cost of the RR1 link is 1.
destination
A
B
C
D
E
F
9.8 Exercises
cost
4
7
7
6
8
8
185
Give Rs updated table after it processes R1s report. For each entry that changes, give a brief explanation,
in the style of 9.1.5 Example 4.
3.5. At the start of Example 3 (9.1.4 Example 3), we changed Cs routing table so that it reached D via A
instead of via E: Cs entry D,E,2 was changed to D,A,2. This meant that C had a valid route to D at the
start.
How might the scenario of Example 3 play out if Cs table had not been altered? Give a sequence of reports
that leads to correct routing between D and E.
4. In the following exercise, A-D are routers and the attached networks N1-N6, which are the ultimate
destinations, are shown explicitly. Routers still exchange distance-vector reports with neighboring routers,
as usual. If a router has a direct connection to a network, you may report the next_hop as direct, eg, from
As table, N1,direct,0
A
N1
N3
N6
N5
N4
N2
(a). Give the initial tables for A through D, before any distance-vector exchanges.
(b). Give the tables after each router A-D exchanges with its immediate neighbors simultaneously and in
parallel.
(c). At the end of (b), what networks are not known by what routers?
5. Suppose A, B, C, D and E are connected as follows. Each link has cost 1, and so each forwarding table
is uniquely determined; Bs table is A,A,1, C,C,1, D,A,2, E,C,2. Distance-vector routing update is
used.
Now suppose the DE link fails, and so D updates its entry for E to E,-,.
186
9 Routing-Update Algorithms
6. Consider the network in 9.2.1.1 Split Horizon:, using distance-vector routing updates.
D,A,2
D,A,2
(a). What reports (a pair should suffice) will lead to the formation of the routing loop?
(b). What (single) report will eliminate the possibility of the routing loop?
7. Suppose the network of 9.2 Distance-Vector Slow-Convergence Problem is changed to the following.
Distance-vector update is used; again, the AD link breaks.
D,D,1
A
D,E,2
B
(a). Explain why Bs report back to A, after A reports D,-,, is now valid.
(b). Explain why hold down (9.2.1.3 Hold Down) will delay the use of the new route ABED.
8. Suppose the routers are A, B, C, D, E and F, and all link costs are 1. The distance-vector forwarding
tables for A and F are below. Give the network with the fewest links that is consistent with these tables.
Hint: any destination reached at cost 1 is directly connected; if X reaches Y via Z at cost 2, then Z and Y
must be directly connected.
As table
dest
B
C
D
E
F
cost
1
1
2
2
3
next_hop
B
C
C
C
B
Fs table
9.8 Exercises
187
dest
A
B
C
D
E
cost
3
2
2
1
1
next_hop
E
D
D
D
E
9. (a) Suppose routers A and B somehow end up with respective forwarding-table entries D,B,n and
D,A,m, thus creating a routing loop. Explain why the loop may be removed more quickly if A and B both
use poison reverse with split horizon, versus if A and B use split horizon only.
(b). Suppose the network looks like the following. The AB link is extremely slow.
A
C
Suppose A and B send reports to each other advertising their routes to D, and immediately afterwards the
CD link breaks and C reports to A and B that D is unreachable. After those unreachability reports are
processed, A and Bs reports sent to each other before the break finally arrive. Explain why the network is
now in the state described in part (a).
10. Suppose the distance-vector algorithm is run on a network and no links break (so by the last paragraph
of 9.1.1 Distance-Vector Update Rules the next_hop-increase rule is never applied).
(a). Prove that whenever A is Bs next_hop to destination D, then As cost to D is strictly less than Bs.
Hint: assume that if this claim is true, then it remains true after any application of the rules in
9.1.1 Distance-Vector Update Rules. If the lower-cost rule is applied to B after receiving a report from A,
resulting in a change to Bs cost to D, then one needs to show As cost is less than Bs, and also Bs new
cost is less than that of any neighbor C that uses B as its next_hop to D.
(b). Use (a) to prove that no routing loops ever form.
11. Give a scenario illustrating how a (very temporary!) routing loop might form in link-state routing.
12. Use the Shortest Path First algorithm to find the shortest path from A to E in the network below. Show
the sets R and T, and the node current, after each step.
13. Suppose you take a laptop, plug it into an Ethernet LAN, and connect to the same LAN via Wi-Fi. From
laptop to LAN there are now two routes. Which route will be preferred? How can you tell which way traffic
is flowing? How can you configure your OS to prefer one path or another?
188
9 Routing-Update Algorithms
10 LARGE-SCALE IP ROUTING
In the previous chapter we considered two classes of routing-update algorithms: distance-vector and linkstate. Each of these approaches requires that participating routers have agreed not just to a common protocol,
but also to a common understanding of how link costs are to be assigned. We will address this further below
in 10.6 Border Gateway Protocol, BGP, but the basic problem is that if one site prefers the hop-count
approach, assigning every link a cost of 1, while another site prefers to assign link costs in proportion to
their bandwidth, then path cost comparisons between the two sites simply cannot be done. In general, we
cannot even translate costs from one site to another, because the paths themselves depend on the cost
assignment strategy.
The term routing domain is used to refer to a set of routers under common administration, using a common
link-cost assignment. Another term for this is autonomous system. While use of a common routingupdate protocol within the routing domain is not an absolute requirement for example, some subnets may
internally use distance-vector while the sites backbone routers use link-state we can assume that all
routers have a uniform view of the sites topology and cost metrics.
One of the things included in the term large-scale IP routing is the coordination of routing between multiple routing domains. Even in the earliest Internet there were multiple routing domains, if for no other
reason than that how to measure link costs was (and still is) too unsettled to set in stone. However, another
component of large-scale routing is support for hierarchical routing, above the level of subnets; we turn to
this next.
By the year 2000, CIDR had essentially eliminated the Class A/B/C mechanism from the backbone Internet,
and had more-or-less completely changed how backbone routing worked. You purchased an address block
from a provider or some other IP address allocator, and it could be whatever size you needed, from /27 to
/15.
What CIDR enabled is IP routing based on an address prefix of any length; the Class A/B/C mechanism of
course used fixed prefix lengths of 8, 16 and 24 bits. Furthermore, CIDR allows different routers, at different
levels of the backbone, to route on prefixes of different lengths.
CIDR was formally introduced by RFC 1518 and RFC 1519. For a while there were strategies in place to
support compatibility with non-CIDR-aware routers; these are now obsolete. In particular, it is no longer
appropriate for large-scale routers to fall back on the Class A/B/C mechanism in the absence of CIDR
information; if the latter is missing, the routing should fail.
The basic strategy of CIDR is to consolidate multiple networks going to the same destination into a single
entry. Suppose a router has four class Cs all to the same destination:
200.7.0.0/24 foo
200.7.1.0/24 foo
200.7.2.0/24 foo
200.7.3.0/24 foo
The router can replace all these with the single entry
200.7.0.0/22 foo
It does not matter here if foo represents a single ultimate destination or if it represents four sites that just
happen to be routed to the same next_hop.
It is worth looking closely at the arithmetic to see why the single entry uses /22. This means that the first 22
bits must match 200.7.0.0; this is all of the first and second bytes and the first six bits of the third byte. Let
us look at the third byte of the network addresses above in binary:
200.7.000000 00.0/24 foo
200.7.000000 01.0/24 foo
200.7.000000 10.0/24 foo
200.7.000000 11.0/24 foo
The /24 means that the network addresses stop at the end of the third byte. The four entries above cover
every possible combination of the last two bits of the third byte; for an address to match one of the entries
above it suffices to begin 200.7 and then to have 0-bits as the first six bits of the third byte. This is another
way of saying the address must match 200.7.0.0/22.
Most implementations actually use a bitmask, eg FF.FF.FC.00 (in hex) rather than the number 22; note 0xFC
= 1111 1100 with 6 leading 1-bits, so FF.FF.FC.00 has 8+8+6=22 1-bits followed by 10 0-bits.
The IP delivery algorithm of 7.5 The Classless IP Delivery Algorithm still works with CIDR, with the
understanding that the routers forwarding table can now have a network-prefix length associated with any
entry. Given a destination D, we search the forwarding table for network-prefix destinations B/k until we
find a match; that is, equality of the first k bits. In terms of masks, given a destination D and a list of table
entries prefix,mask = B[i],M[i], we search for i such that (D & M[i]) = B[i].
It is possible to have multiple matches, and responsibility for avoiding this is much too distributed to be
190
10 Large-Scale IP Routing
declared illegal by IETF mandate. Instead, CIDR introduced the longest-match rule: if destination D
matches both B1 /k1 and B2 /k2 , with k1 < k2 , then the longer match B2 /k2 match is to be used. (Note that if
D matches two distinct entries B1 /k1 and B2 /k2 then either k1 < k2 or k2 < k1 ).
191
D: 201.0.0.0/16
E: 201.1.0.0/16
192
10 Large-Scale IP Routing
F: 202.0.0.0/16
G: 202.1.0.0/16
The routing model is that packets are first routed to the appropriate provider, and then to the customer.
While this model may not in general guarantee the shortest end-to-end path, it does in this case because
each provider has a single point of interconnection to the others. Here is the network diagram:
With this diagram, P0s forwarding table looks something like this:
destination
200.0.0.0/16
200.1.0.0/16
200.2.16.0/20
201.0.0.0/8
202.0.0.0/8
next_hop
A
B
C
P1
P2
193
194
next_hop
P0
P2
D
E
C
10 Large-Scale IP Routing
This does work, but all Cs inbound traffic except for that originating in P1 will now be routed through
Cs ex-provider P0, which as an ex-provider may not be on the best of terms with C. Also, the routing is
inefficient: Cs traffic from P2 is routed P2P0P1 instead of the more direct P2P1.
A better solution is for all providers other than P1 to add the route 200.2.16.0/20, P1. While traffic to
200.0.0.0/8 otherwise goes to P0, this particular sub-block is instead routed by each provider to P1. The
important case here is P2, as a stand-in for all other providers and their routers: P2 routes 200.0.0.0/8 traffic
to P0 except for the block 200.2.16.0/20, which goes to P1.
Having every other provider in the world need to add an entry for C is going to cost some money, and, one
way or another, C will be the one to pay. But at least there is a choice: C can consent to renumbering (which
is not difficult if they have been diligent in using DHCP and perhaps NAT too), or they can pay to keep their
old address block.
As for the second diagram above, with the various private links (shown as dashed lines), it is likely that the
longest-match rule is not needed for these links to work. As private link to P1 might only mean that
A can send outbound traffic via P1
P1 forwards As traffic to A via the private link
P2, in other words, is still free to route to A via P0. P1 may not advertise its route to A to anyone else.
The globally shortest path between A and B is via the r2s2 crossover, with total length 6+1+5=12. However,
traffic from A to B will be routed by P1 to its closest crossover to P2, namely the r3s3 link. The total path is
2+1+8+5=16. Traffic from B to A will be routed by P2 via the r1s1 crossover, for a length of 2+1+7+6=16.
195
This routing strategy is sometimes called hot-potato routing; each provider tries to get rid of any traffic (the
potatoes) as quickly as possible, by routing to the closest exit point.
Not only are the paths taken inefficient, but the AB and BA paths are now asymmetric. This can be
a problem if forward and reverse timings are critical, or if one of P1 or P2 has significantly more bandwidth
or less congestion than the other. In practice, however, route asymmetry is of little consequence.
As for the route inefficiency itself, this also is not necessarily a significant problem; the primary reason
routing-update algorithms focus on the shortest path is to guarantee that all computed paths are loop-free.
As long as each half of a path is loop-free, and the halves do not intersect except at their common midpoint,
these paths too will be loop-free.
The BGP MED value (10.6.5.3 MULTI_EXIT_DISC) offers an optional mechanism for P1 to agree that
AB traffic should take the r1s1 crossover. This might be desired if P1s network were better and
customer A was willing to pay extra to keep its traffic within P1s network as long as possible.
196
10 Large-Scale IP Routing
than other (shorter) paths. It is much easier to string terrestrial cable than undersea cable. However, within
a continent physical distance does not always matter as much as might be supposed. Furthermore, a large
geographically spread-out provider can always divide up its address blocks by region, allowing internal
geographical routing to the correct region.
Here is a diagram of IP address allocation as of 2006: http://xkcd.com/195.
197
The BGP speakers must maintain a database of all routes received, not just of the routes actually used.
However, the speakers exchange with neighbors only the routes they (and thus their AS) use themselves;
this is a firm BGP rule.
The current BGP standard is RFC 4271.
10.6.1 AS-paths
At its most basic level, BGP involves the exchange of lists of reachable destinations, like distance-vector
routing without the distance information. But that strategy, alone, cannot avoid routing loops. BGP solves
the loop problem by having routers exchange, not just destination information, but also the entire path used
to reach each destination. Paths including each router would be too cumbersome; instead, BGP abbreviates
the path to the list of ASs traversed; this is called the AS-path. This allows routers to make sure their routes
do not traverse any AS more than once, and thus do not have loops.
As an example of this, consider the network below, in which we consider Autonomous Systems also to be
destinations. Initially, we will assume that each AS discovers its immediate neighbors. AS3 and AS5 will
then each advertise to AS4 their routes to AS2, but AS4 will have no reason at this level to prefer one route
to the other (BGP does use the shortest AS-path as part of its tie-breaking rule, but, before falling back on
that rule, AS4 is likely to have a commercial preference for which of AS3 and AS5 it uses to reach AS2).
Also, AS2 will advertise to AS3 its route to reach AS1; that advertisement will contain the AS-path
AS2,AS1. Similarly, AS3 will advertise this route to AS4 and then AS4 will advertise it to AS5. When
AS5 in turn advertises this AS1-route to AS2, it will include the entire AS-path AS5,AS4,AS3,AS2,AS1,
and AS2 would know not to use this route because it would see that it is a member of the AS-path. Thus,
BGP is spared the kind of slow-convergence problem that traditional distance-vector approaches were subject to.
It is theoretically possible that the shortest path (in the sense, say, of the hopcount metric) from one host to
another traverses some AS twice. If so, BGP will not allow this route.
AS-paths potentially add considerably to the size of the AS database. The number of paths a site must keep
track of is proportional to the number of ASs, because there will be one AS-path to each destination AS.
(Actually, an AS may have to record many times that many AS-paths, as an AS may hear of AS-paths that
it elects not to use.) Typically there are several thousand ASs in the world. Let A be the number of ASs.
Typically the average length of an AS-path is about log(A), although this depends on connectivity. The
amount of memory required by BGP is
CAlog(A) + KN,
where C and K are constants.
The other major goal of BGP is to allow some degree of administrative input to what, for interior routing,
198
10 Large-Scale IP Routing
is largely a technical calculation (though an interior-routing administrator can set link costs). BGP is the
interface between large ISPs, and can be used to implement contractual agreements made regarding which
ISPs will carry other ISPs traffic. If ISP2 tells ISP1 it has a good route to destination D, but ISP1 chooses
not to send traffic to ISP2, BGP can be used to implement this.
Despite the exchange of AS-path information, temporary routing loops may still exist. This is because BGP
may first decide to use a route and only then export the new AS-path; the AS on the other side may realize
there is a problem as soon as the AS-path is received but by then the loop will have at least briefly been in
existence. See the first example below in 10.6.8 Examples of BGP Instability.
BGPs predecessor was EGP, which guaranteed loop-free routes by allowing only a single route to any AS,
thus forcing the Internet into a tree topology, at least at the level of Autonomous Systems. The AS graph
could contain no cycles or alternative routes, and hence there could be no redundancy provided by alternative
paths. EGP also thus avoided having to make decisions as to the preferred path; there was never more than
one choice. EGP was sometimes described as a reachability protocol; its only concern was whether a given
network was reachable.
199
AS-sequence=AS2
AS-set={AS3,AS4}
AS2 thus both achieves the desired aggregation and also accurately reports the AS-path length.
The AS-path can in general be an arbitrary list of AS-sequence and AS-set parts, but in cases of simple
aggregation such as the example here, there will be one AS-sequence followed by one AS-set.
RFC 6472 now recommends against using AS-sets entirely, and recommends that aggregation as above be
avoided.
200
10 Large-Scale IP Routing
As an example of import filtering, a site might elect to ignore all routes from a particular neighbor, or to
ignore all routes whose AS-path contains a particular AS, or to ignore temporarily all routes from a neighbor
that has demonstrated too much recent route instability (that is, rapidly changing routes). Import filtering
can also be done in the best-path-selection stage. Finally, while it is not commonly useful, import filtering
can involve rather strange criteria; for example, in 10.6.8 Examples of BGP Instability we will consider
examples where AS1 prefers routes with AS-path AS3,AS2 to the strictly shorter path AS2.
The next stage is best-path selection, for which the first step is to eliminate AS-paths with loops. Even if the
neighbors have been diligent in not advertising paths with loops, an AS will still need to reject routes that
contain itself in the associated AS-path.
The next step in the best-path-selection stage, generally the most important in BGP configuration, is to assign
a local_preference, or weight, to each route received. An AS may have policies that add a certain amount to
the local_preference for routes that use a certain AS, etc. Very commonly, larger sites will have preferences
based on contractual arrangements with particular neighbors. Provider ASs, for example, will in general
prefer routes that go through their customers, as these are cheaper. A smaller ISP that connects to two
or more larger ones might be paying to route almost all its outbound traffic through a particular one of the
two; its local_preference values will then implement this choice. After BGP calculates the local_preference
value for every route, the routes with the best local_preference are then selected.
Domains are free to choose their local_preference rules however they wish. Some choices may lead to
instability, below, so domains are encouraged to set their rules in accordance with some standard principles,
also below.
In the event of ties two routes to the same destination with the same local_preference a first tie-breaker
rule is to prefer the route with the shorter AS-path. While this superficially resembles a shortest-path algorithm, the real work should have been done in administratively assigning local_preference values.
Local_preference values are communicated internally via the LOCAL_PREF path attribute, below. They
are not shared with other Autonomous Systems.
The final significant step of the route-selection phase is to apply the Multi_Exit_Discriminator value; we
postpone this until below. A site may very well choose to ignore this value entirely. There may then be
additional trivial tie-breaker rules; note that if a tie-breaker rule assigns significant traffic to one AS over
another, then it may have significant economic consequences and shouldnt be considered trivial. If this
situation is detected, it would probably be addressed in the local-preferences phase.
After the best-path-selection stage is complete, the BGP speaker has now selected the routes it will use. The
final stage is to decide what rules will be exported to which neighbors. Only routes the BGP speaker will
use that is, routes that have made it to this point can be exported; a site cannot route to destination D
through AS1 but export a route claiming D can be reached through AS2.
It is at the export-filtering stage that an AS can enforce no-transit rules. If it does not wish to carry transit
traffic to destination D, it will not advertise D to any of its AS-neighbors.
The export stage can lead to anomalies. Suppose, for example, that AS1 reaches D and AS5 via AS2, and
announces this to AS4.
201
Later AS1 switches to reaching D via AS3, but A is forbidden by policy to announce AS3-paths to AS4.
Then A must simply withdraw the announcement to AS4 that it could reach D at all, even though the route
to D via AS2 is still there.
202
10 Large-Scale IP Routing
10.6.5.2 LOCAL_PREF
If one BGP speaker in an AS has been configured with local_preference values, used in the best-pathselection phase above, it uses the LOCAL_PREF path attribute to share those preferences with all other
BGP speakers at a site.
10.6.5.3 MULTI_EXIT_DISC
The Multi-Exit Discriminator, or MED, attribute allows one AS to learn something of the internal structure
of another AS, should it elect to do so. Using the MED information provided by a neighbor has the potential
to cause an AS to incur higher costs, as it may end up carrying traffic for longer distances internally; MED
values received from a neighboring AS are therefore only recognized when there is an explicit administrative
decision to do so.
Specifically, if an autonomous system AS1 has multiple links to neighbor AS2, then AS1 can, when advertising an internal destination D to AS2, have each of its BGP speakers provide associated MED values so
that AS2 can know which link AS1 would prefer that AS2 use to reach D. This allows AS2 to route traffic
to D so that it is carried primarily by AS2 rather than by AS1. The alternative is for AS2 to use only the
closest gateway to AS1, which means traffic is likely carried primarily by AS1.
MED values are considered late in the best-path-selection process; in this sense the use of MED values is a
tie-breaker when two routes have the same local_preference.
As an example, consider the following network (from 10.4.3 Provider-Based Hierarchical Routing, with
providers now replaced by Autonomous Systems); the numeric values on links are their relative costs. We
will assume that border routers R1, R2 and R3 are also AS1s BGP speakers.
In the absence of the MED, AS1 will send traffic from A to B via the R3S3 link, and AS2 will return the
traffic via S1R1. These are the links that are closest to R and S, respectively, representing AS1 and AS2s
desire to hand off the outbound traffic as quickly as possible.
However, AS1s R1, R2 and R3 can provide MED values to AS2 when advertising destination A, indicating
a preference for AS2AS1 traffic to use the rightmost link:
R1: destination A has MED 200
R2: destination A has MED 150
203
204
10 Large-Scale IP Routing
the effect of allowing the original AS to configure itself without involving the receiving AS in the process. Communities are often used, for example, by (large) customers of an ISP to request specific routing
treatment.
A customer would have to find out from the provider what communities the provider defines, and what their
numeric codes are. At that point the customer can place itself into the providers community at will.
Here are some of the community values once supported by a no-longer-extant ISP that we shall call AS1.
The full community value would have included AS1s AS-number.
value
90
100
105
110
990
991
action
set local_preference used by AS1 to 90
set local_preference used by AS1 to 100, the default
set local_preference used by AS1 to 105
set local_preference used by AS1 to 110
the route will not leave AS1s domain; equivalent to NO_EXPORT
route will only be exported to AS1s other customers
If A and B fully advertise link1, by exporting to their respective ISPs routes to each other, then ISP1 (paid
by A) may end up carrying much of Bs traffic or ISP2 (paid by B) may end up carrying much of As traffic.
Economically, these options are not desirable unless fully agreed to by both parties. The primary issue here
is the use of the ISP1A link by B, and the ISP2B link by A; use of the shared link1 might be a secondary
issue depending on the relative bandwidths and A and Bs understandings of appropriate uses for link1.
Three common options A and B might agree to regarding link1 are no-transit, backup, and load-balancing.
For the no-transit option, A and B simply do not export the route to their respective ISPs at all. This is done
via export filtering. If ISP1 does not know A can reach B, it will not send any of Bs traffic to A.
For the backup option, the intent is that traffic to A will normally arrive via ISP1, but if the ISP1 link is down
then As traffic will be allowed to travel through ISP2 and B. To achieve this, A and B can export their link1route to each other, but arrange for ISP1 and ISP2 respectively to assign this route a low local_preference
value. As long as ISP1 hears of a route to B from its upstream provider, it will reach B that way, and will
not advertise the existence of the link1 route to B; ditto ISP2. However, if the ISP2 route to B fails, then As
upstream provider will stop advertising any route to B, and so ISP1 will begin to use the link1 route to B
205
and begin advertising it to the Internet. The link1 route will be the primary route to B until ISP2s service is
restored.
A and B must convince their respective ISPs to assign the link1 route a low local_preference; they cannot
mandate this directly. However, if their ISPs recognize community attributes that, as above, allow customers
to influence their local_preference value, then A and B can use this to create the desired local_preference.
For outbound traffic, A and B will need a way to send through one another if their own ISP link is down.
One approach is to consider their default-route path (eg to 0.0.0.0/0) to be a concrete destination within BGP.
ISP1 advertises this to A, using As interior routing protocol, but so does B, and A has configured things so
Bs route has a higher cost. Then A will route to 0.0.0.0/0 through ISP1 that is, will use ISP1 as its default
route as long as it is available, and will switch to B when it is not.
For inbound load balancing, there is no easy fix, in that if ISP1 and ISP2 both export routes to A, then A has
lost all control over how other sites will prefer one to the other. A may be able to make one path artificially
appear more expensive, and keep tweaking this cost until the inbound loads are comparable. Outbound
load-balancing is up to A and B.
Another basic policy question is which of the two available paths site (or regional AS) A uses to reach site
D, in the following diagram. B and C are Autonomous Systems.
How can A express preference for B over C, assuming B and C both advertise to A their routes to D?
Generally A will use a local_preference setting to make the carrier decision for AD traffic, though it is
D that makes the decision for the DA traffic. It is possible (though not customary) for one of the transit
providers to advertise to A that it can reach D, but not advertise to D that it can reach A.
Here is a similar diagram, showing two transit-providing Autonomous Systems B and C connecting at
Internet exchange points IXP1 and IXP2.
B and C each have routers within each IXP. B would probably like to make sure C does not attempt to save
on its long-haul transit costs by forwarding AD traffic over to B at IXP1, and DA traffic over to
B at IXP2. B can avoid this problem by not advertising to C that it can reach A and D. In general, transit
providers are often quite careful about advertising reachability to any other AS for whom they do not intend
to provide transit service, because to do so may implicitly mean getting stuck with that traffic.
If B and C were both to try try to get away with this, a routing loop would be created within IXP1! But in
that case in Bs next advertisement to C at IXP1, B would state that it reaches D via AS-path C (or C,D
206
10 Large-Scale IP Routing
if D were a full-fledged AS), and C would do similarly; the loop would not continue for long.
207
Following these rules creates a simplified BGP world. Special cases for special situations have the potential
to introduce non-convergence or instability.
The so-called tier-1 providers are those that are not customers of anyone; these represent the top-level
backbone providers. Each tier-1 AS must, as a rule, peer with every other tier-1 AS.
A consequence of the use of the above classification and attendant export rules is the no-valley theorem
[LG01]: if every AS has BGP policies consistent with the scheme above, then when we consider the full ASpath from A to B, there is at most one peer-peer link. Those to the left of the peer-peer link are (moving from
left to right) either customerprovider links or siblingsibling links; that is, they are non-downwards
(ie upwards or level). To the right of the peer-peer link, we see providercustomer or siblingsibling
links; that is, these are non-upwards. If there is no peer-peer link, then we can still divide the AS-path into
a non-downwards first part and a non-upwards second part.
The above constraints are not quite sufficient to guarantee convergence of the BGP system to a stable set
of routes. To ensure convergence in the case without sibling relationships, it is shown in [GR01] that the
following simple local_preference rule suffices:
If AS1 gets two routes r1 and r2 to a destination D, and the first AS of the r1 route is a customer
of AS1, and the first AS of r2 is not, then r1 will be assigned a higher local_preference value
than r2.
More complex rules exist that allow for cases when the local_preference values can be equal; one such rule
states that strict inequality is only required when r2 is a provider route. Other straightforward rules handle
the case of sibling relationships, eg by requiring that siblings have local_preference rules consistent with the
use of their shared connection only for backup.
As a practical matter, unstable BGP arrangements appear rare on the Internet; most actual relationships and
configurations are consistent with the rules above.
208
10 Large-Scale IP Routing
That is, AS2,AS0 is preferred to the direct path AS0 (one way to express this preference might be
prefer routes for which the AS-PATH begins with AS2; perhaps the AS1AS0 link is more expensive).
Similarly, we assume AS2 prefers paths to D in the order AS1,AS0, AS0. Both AS1 and AS2 start out
using path AS0; they advertise this to each other. As each receives the others advertisement, they apply
their preference order and therefore each switches to routing Ds traffic to the other; that is, AS1 switches
to the route with AS-path AS2,AS0 and AS2 switches to AS1,AS0. This, of course, causes a routing
loop! However, as soon as they export these paths to one another, they will detect the loop in the AS-path
and reject the new route, and so both will switch back to AS0 as soon as they announce to each other the
change in what they use.
This oscillation may continue indefinitely, as long as both AS1 and AS2 switch away from AS0 at the
same moment. If, however, AS1 switches to AS2,AS0 while AS2 continues to use AS0, then AS2 is
stuck and the situation is stable. In practice, therefore, eventual convergence to a stable state is likely.
AS1 and AS2 might choose not to export their D-route to each other to avoid this instability.
Example 2: No stable state exists. This example is from [VGE00]. Assume that the destination D is attached
to AS0, and that AS0 in turn connects to AS1, AS2 and AS3 as in the following diagram:
AS1-AS3 each have a direct route to AS0, but we assume each prefers the AS-path that takes their clockwise
neighbor; that is, AS1 prefers AS3,AS0 to AS0; AS3 prefers AS2,AS0 to AS0, and AS2 prefers
AS1,AS0 to AS0. This is a peculiar, but legal, example of input filtering.
Suppose all adopt AS0, and advertise this, and AS1 is the first to look at the incoming advertisements.
AS1 switches to the route AS3,AS0, and announces this.
At this point, AS2 sees that AS1 uses AS3,AS0; if AS2 switches to AS1 then its path would be
AS1,AS3,AS0 rather than AS1,AS0 and so it does not make the switch.
But AS3 does switch: it prefers AS2,AS0 and this is still available. Once it makes this switch, and
advertises it, AS1 sees that the route it had been using, AS3,AS0, has become AS3,AS1,AS0. At this
point AS1 switches back to AS0.
Now AS2 can switch to using AS1,AS0, and does so. After that, AS3 finds it is now using AS2,AS1,AS0
and it switches back to AS0. This allows AS1 to switch to the longer route, and then AS2 switches back
to the direct route, and then AS3 gets the longer route, then AS2 again, etc, forever rotating clockwise.
10.7 Epilog
CIDR was a deceptively simple idea. At first glance it is a straightforward extension of the subnet concept,
moving the net/host division point to the left as well as to the right. But it has ushered in true hierarchical
10.7 Epilog
209
routing, most often provider-based. While CIDR was originally offered as a solution to some early crises in
IPv4 address-space allocation, it has been adopted into the core of IPv6 routing as well.
Interior routing using either distance-vector or link-state protocols is neat and mathematical. Exterior
routing with BGP is messy and arbitrary. Perhaps the most surprising thing about BGP is that the Internet
works as well as it does, given the complexity of provider interconnections. The business side of routing
almost never has an impact on ordinary users. To an extent, BGP works well because providers voluntarily limit the complexity of their filtering preferences, but that seems to be largely because the business
relationships of real-world ISPs do not seem to require complex filtering.
10.8 Exercises
1. Consider the following IP forwarding table that uses CIDR. IP address bytes are in hexadecimal, so each
hex digit corresponds to four address bits.
destination
81.30.0.0/12
81.3c.0.0/16
81.3c.50.0/20
81.40.0.0/12
81.44.0.0/14
next_hop
A
B
C
D
E
next_hop
A
B
C
D
(a). To what next_hop would each of the following be routed? 63.b1.82.15, 9e.00.15.01, de.ad.be.ef
(b). Explain why every IP address is routed somewhere, even though there is no default entry.
3. Give an IPv4 forwarding table using CIDR that will route all Class A addresses to next_hop A, all
Class B addresses to next_hop B, and all Class C addresses to next_hop C.
210
10 Large-Scale IP Routing
4. Suppose a router using CIDR has the following entries. Address bytes are in decimal except for the third
byte, which is in binary.
destination
37.119.0000 0000.0/18
37.119.0100 0000.0/18
37.119.1000 0000.0/18
37.119.1100 0000.0/18
next_hop
A
A
A
B
These four entries cannot be consolidated into a single /16 entry, because they dont all go to the same
next_hop. How could they be consolidated into two entries?
5. Suppose P, Q and R are ISPs with respective CIDR address blocks (with bytes in decimal) 51.0.0.0/8, 52.0.0.0/8 and
A: 51.10.0.0/16
B: 51.23.0.0/16
Q has customers C and D and assigns them address blocks as follows:
C: 52.14.0.0/16
D: 52.15.0.0/16
(a). Give forwarding tables for P, Q and R assuming they connect to each other and to each of their own
customers.
(b). Now suppose A switches from provider P to provider Q, and takes its address block with it. Give the
forwarding tables for P, Q and R; the longest-match rule will be needed to resolve conflicts.
(c). Now suppose in addition to A switching from P to Q, C switches from provider Q to provider R. Give
the forwarding tables.
6. Suppose P, Q and R are ISPs as in the previous problem. P and R do not connect directly; they route traffic
to one another via Q. In addition, customer B is multi-homed and has a secondary connection to provider R;
customer D is also multi-homed and has a secondary connection to provider P. R and P use these secondary
connections to send to B and D respectively; however, these secondary connections are not advertised to
other providers. Give forwarding tables for P, Q and R.
7. Consider the following network of providers P-S, all using BGP. The providers are the horizontal lines;
each provider is its own AS.
10.8 Exercises
211
(a). What routes to network NS will P receive, assuming there is no export filtering? For each route, list the
AS-path.
(b). What routes to network NQ will P receive? For each route, list the AS-path.
(c). Suppose R uses export filtering so as not to advertise to P any of its routes except those that involve S
in their AS-path. What routes to network NR will P receive, with AS-paths?
8. Consider the following network of Autonomous Systems AS1 through AS6, which double as destinations.
When AS1 advertises itself to AS2, for example, the AS-path it provides is AS1.
AS1
AS2
AS4
AS5
AS3
:
:
:
AS6
(a). If neither AS3 nor AS6 exports their AS3AS6 link to their neighbors AS2 and AS5 to the left, what
routes will AS2 receive to reach AS5? Specify routes by AS-path.
(b). What routes will AS2 receive to reach AS6?
(c). Suppose AS3 exports to AS2 its link to AS6, but AS6 continues not to export the AS3AS6 link to
AS5. How will AS5 now reach AS2? How will AS2 now reach AS6? Assume that there are no local
preferences in use in BGP best-path selection, and that the shortest AS-path wins.
9. Suppose that Internet routing in the US used geographical routing, and the first 12 bits of every IP
address represent a geographical area similar in size to a telephone area code. Megacorp gets the prefix
12.34.0.0/16, based geographically in Chicago, and allocates subnets from this prefix to its offices in all 50
states. Megacorp routes all its internal traffic over its own network.
(a). Assuming all Megacorp traffic must enter and exit in Chicago, what is the route of traffic to and from
the San Diego office to a client also in San Diego?
212
10 Large-Scale IP Routing
(b). Now suppose each office has its own link to a local ISP, but still uses its 12.34.0.0/16 IP addresses.
Now what is the route of traffic between the San Diego office and its neighbor?
(c). Suppose Megacorp gives up and gets a separate geographical prefix for each office. What must it do to
ensure that its internal traffic is still routed over its own network?
10. Suppose we try to use BGPs strategy of exchanging destinations plus paths as an interior routing-update
strategy, perhaps replacing distance-vector routing. No costs or hop-counts are used, but routers attach to
each destination a list of the routers used to reach that destination. Routers can also have route preferences,
such as prefer my link to B whenever possible.
(a). Consider the network of 9.2 Distance-Vector Slow-Convergence Problem:
D
The DA link breaks, and B offers A what it thinks is its own route to D. Explain how exchanging path
information prevents a routing loop here.
(b). Suppose the network is as below, and initially each router knows about itself and its immediately
adjacent neighbors. What sequence of router announcements can lead to A reaching F via
ADEBCF, and what individual router preferences would be necessary? (Initially, for example,
A would reach B directly; what preference might make it prefer ADEB?)
(c). Explain why this method is equivalent to using the hopcount metric with either distance-vector or
link-state routing, if routers are not allowed to have preferences and if the router-path length is used as a
tie-breaker.
10.8 Exercises
213
214
10 Large-Scale IP Routing
11 UDP TRANSPORT
The standard transport protocols riding above the IP layer are TCP and UDP. As we saw in Chapter 1, UDP
provides simple datagram delivery to remote sockets, that is, to host,port pairs. TCP provides a much
richer functionality for sending data, but requires that the remote socket first be connected. In this chapter,
we start with the much-simpler UDP, including the UDP-based Trivial File Transfer Protocol.
We also review some fundamental issues any transport protocol must address, such as lost final packets and
packets arriving late enough to be subject to misinterpretation upon arrival. These fundamental issues will
be equally applicable to TCP connections.
The port numbers are what makes UDP into a real transport protocol: with them, an application can now
connect to an individual server process (that is, the process owning the port number in question), rather
than simply to a host.
UDP is unreliable, in that there is no UDP-layer attempt at timeouts, acknowledgment and retransmission;
applications written for UDP must implement these. As with TCP, a UDP host,port pair is known as a
socket (though UDP ports are considered a separate namespace from TCP ports). UDP is also unconnected,
or stateless; if an application has opened a port on a host, any other host on the Internet may deliver packets
to that host,port socket without preliminary negotiation.
UDP packets use the 16-bit Internet checksum (5.4 Error Detection) on the data. While it is seldom done
now, the checksum can be disabled and the field set to the all-0-bits value, which never occurs as an actual
ones-complement sum.
UDP packets can be dropped due to queue overflows either at an intervening router or at the receiving host.
When the latter happens, it means that packets are arriving faster than the receiver can process them. Higherlevel protocols that define ACK packets (eg UDP-based RPC, below) typically include some form of flow
control to prevent this.
UDP is popular for local transport, confined to one LAN. In this setting it is common to use UDP as the
transport basis for a Remote Procedure Call, or RPC, protocol. The conceptual idea behind RPC is that
one host invokes a procedure on another host; the parameters and the return value are transported back and
forth by UDP. We will consider RPC in greater detail below, in 11.4 Remote Procedure Call (RPC); for
215
now, the point of UDP is that on a local LAN we can fall back on rather simple mechanisms for timeout and
retransmission.
UDP is well-suited for request-reply semantics beyond RPC; one can use TCP to send a message and get
a reply, but there is the additional overhead of setting up and tearing down a connection. DNS uses UDP,
largely for this reason. However, if there is any chance that a sequence of request-reply operations will be
performed in short order then TCP may be worth the overhead.
UDP is also popular for real-time transport; the issue here is head-of-line blocking. If a TCP packet is lost,
then the receiving host queues any later data until the lost data is retransmitted successfully, which can take
several RTTs; there is no option for the receiving application to request different behavior. UDP, on the
other hand, gives the receiving application the freedom simply to ignore lost packets. This approach is very
successful for voice and video, where small losses simply degrade the received signal slightly, but where
larger delays are intolerable. This is the reason the Real-time Transport Protocol, or RTP, is built on top
of UDP rather than TCP. It is common for VoIP telephone calls to use RTP and UDP.
11.1.1 QUIC
Sometimes UDP is used simply because it allows new or experimental protocols to run entirely as user-space
applications; no kernel updates are required, as would be the case with TCP changes. Google has created
a protocol named QUIC (Quick UDP Internet Connections, chromium.org/quic) in this category, though
QUIC also takes advantage of UDPs freedom from head-of-line blocking. For example, one of QUICs
goals includes supporting multiplexed streams in a single connection (eg for the multiple components of a
web page). A lost packet blocks its own stream until it is retransmitted, but the other streams can continue
without waiting. Because QUIC supports error-correcting codes (5.4.2 Error-Correcting Codes), a lost
packet might not require any waiting at all; this is another feature that would be difficult to add to TCP.
QUIC also eliminates the extra RTT needed for setting up a TCP connection.
QUIC provides support for advanced congestion control, currently (2014) including a UDP analog of TCP
CUBIC (15.11 TCP CUBIC). QUIC does this at the application layer but new congestion-control mechanisms within TCP often require client operating-system changes even when the mechanism lives primarily
at the server end. QUIC represents a promising approach to using UDPs flexibility to support innovative
or experimental transport-layer features. The downside of QUIC is its nonstandard programming interface,
but note that Google can achieve widespread web utilization of QUIC simply by distributing the client side
in its Chrome browser.
216
11 UDP Transport
address will form the socket address to which clients connect. Clients must discover that port number or
have it written into their application code. Clients too will have a port number, but it is largely invisible.
On the server side, simplex-talk must do the following:
ask for a designated port number
create a socket, the sending/receiving endpoint
bind the socket to the socket address, if this is not done at the point of socket creation
receive packets sent to the socket
for each packet received, print its sender and its content
The client side has a similar list:
look up the servers IP address, using DNS
create an anonymous socket; we dont care what the clients port number is
read a line from the terminal, and send it to the socket address server_IP,port
11.1.2.1 The Server
We will start with the server side, presented here in Java. We will use port 5432; the socket-creation and
port-binding operations are combined into the single operation new DatagramSocket(destport).
Once created, this socket will receive packets from any host that addresses a packet to it; there is no need
for preliminary connection. We also need a DatagramPacket object that contains the packet data and
source IP_address,port for arriving packets. The server application does not acknowledge anything sent to
it, or in fact send any response at all.
The server application needs no parameters; it just starts. (That said, we could make the port number a
parameter, to allow easy change. The port we use here, 5432, has also been adopted by PostgreSQL for TCP
connections.) The server accepts both IPv4 and IPv6 connections; we return to this below.
Though it plays no role in the protocol, we will also have the server time out every 15 seconds and display
a message, just to show how this is done; implementations of real protocols essentially always must arrange
when attempting to receive a packet to time out after a certain interval with no response. The file below is
at udp_stalks.java.
/* simplex-talk server, UDP version */
import java.net.*;
import java.io.*;
public class stalks {
static public int destport = 5432;
static public int bufsize = 512;
static public final int timeout = 15000; // time in milliseconds
static public void main(String args[]) {
DatagramSocket s;
// UDP uses DatagramSockets
217
try {
s = new DatagramSocket(destport);
}
catch (SocketException se) {
System.err.println("cannot create socket with port " + destport);
return;
}
try {
s.setSoTimeout(timeout);
// set timeout in milliseconds
} catch (SocketException se) {
System.err.println("socket exception: timeout not set!");
}
// create DatagramPacket object for receiving data:
DatagramPacket msg = new DatagramPacket(new byte[bufsize], bufsize);
while(true) { // read loop
try {
msg.setLength(bufsize); // max received packet size
s.receive(msg);
// the actual receive operation
System.err.println("message from <" +
msg.getAddress().getHostAddress() + "," + msg.getPort() + ">");
} catch (SocketTimeoutException ste) {
// receive() timed out
System.err.println("Response timed out!");
continue;
} catch (IOException ioe) {
// should never happen!
System.err.println("Bad receive");
break;
}
String str = new String(msg.getData(), 0, msg.getLength());
System.out.print(str);
// newline must be part of str
}
s.close();
} // end of main
}
in which case only packets sent to the host and port through the hosts specific IP address local_addr
would be delivered. It does not matter here whether IP forwarding on the host has been enabled. In the
218
11 UDP Transport
original C socket library, this binding of a port to (usually) a server socket was done with the bind() call.
To allow connections via any of the hosts IP addresses, the special IP address INADDR_ANY is passed to
bind().
When a host has multiple IP addresses, the standard socket library does not provide a way to find out to
which these an arriving UDP packet was actually sent. Normally, however, this is not a major difficulty. If
a host has only one interface on an actual network (ie not counting loopback), and only one IP address for
that interface, then any remote clients must send to that interface and address. Replies (if any, which there
are not with stalk) will also come from that address.
Multiple interfaces do not necessarily create an ambiguity either; the easiest such case to experiment with
involves use of the loopback and Ethernet interfaces (though one would need to use an application that,
unlike stalk, sends replies). If these interfaces have respective IPv4 addresses 127.0.0.1 and 192.168.1.1,
and the client is run on the same machine, then connections to the server application sent to 127.0.0.1 will
be answered from 127.0.0.1, and connections sent to 192.168.1.1 will be answered from 192.168.1.1. The
IP layer sees these as different subnets, and fills in the IP source-address field according to the appropriate
subnet. The same applies if multiple Ethernet interfaces are involved, or if a single Ethernet interface is
assigned IP addresses for two different subnets, eg 192.168.1.1 and 192.168.2.1.
Life is slightly more complicated if a single interface is assigned multiple IP addresses on the same subnet,
eg 192.168.1.1 and 192.168.1.2. Regardless of which address a client sends its request to, the servers reply
will generally always come from one designated address for that subnet, eg 192.168.1.1. Thus, it is possible
that a legitimate UDP reply will come from a different IP address than that to which the initial request was
sent.
If this behavior is not desired, one approach is to create multiple server sockets, and to bind each of the
hosts network IP addresses to a different server socket.
11.1.2.3 The Client
Next is the Java client version udp_stalkc.java. The client any client must provide the name of the
host to which it wishes to send; as with the port number this can be hard-coded into the application but is
more commonly specified by the user. The version here uses host localhost as a default but accepts any
other hostname as a command-line argument. The call to InetAddress.getByName(desthost)
invokes the DNS system, which looks up name desthost and, if successful, returns an IP address.
(InetAddress.getByName() also accepts addresses in numeric form, eg 127.0.0.1, in which case
DNS is not necessary.) When we create the socket we do not designate a port in the call to new
DatagramSocket(); this means any port will do for the client. When we create the DatagramPacket
object, the first parameter is a zero-length array as the actual data array will be provided within the loop.
A certain degree of messiness is introduced by the need to create a BufferedReader object to handle
terminal input.
// simplex-talk CLIENT in java, UDP version
import java.net.*;
import java.io.*;
public class stalkc {
static public BufferedReader bin;
219
220
11 UDP Transport
try {
s.send(msg);
}
catch (IOException ioe) {
System.err.println("send() failed");
return;
}
} // while
s.close();
}
}
The default value of desthost here is localhost; this is convenient when running the client and the
server on the same machine, in separate terminal windows.
Like the server, the client works with both IPv4 and IPv6. The InetAddress object dest in the server
code above can hold either IPv4 or IPv6 addresses; InetAddress is the base class with child classes
Inet4Address and Inet6Address. If the client and server can communicate at all via IPv6 and if the
value of desthost supplied to the client is an IPv6-only name, then dest will be an Inet6Address
object and IPv6 will be used. For example, if the client is invoked from the command line with java
stalkc ip6-localhost, and the name ip6-localhost resolves to the IPv6 loopback address ::1,
the client will connect to an stalk server on the same host using IPv6 (and the loopback interface). If greater
IPv4-versus-IPv6 control is desired, one can to replace the getByName() call with getAllByName(),
which returns an array of all addresses (InetAddress[]) associated with the given name. One can
then find the IPv6 addresses by searching this array for addresses addr for which addr instanceof
Inet6Address.
Finally, here is a simple python version of the client, udp_stalkc.py.
#!/usr/bin/python3
from socket import *
from sys import argv
portnum = 5432
def talk():
rhost = "localhost"
if len(argv) > 1:
rhost = argv[1]
print("Looking up address of " + rhost + "...", end="")
try:
dest = gethostbyname(rhost)
except (GAIerror, herror) as mesg:
# GAIerror: error in gethostbyname()
errno,errstr=mesg.args
print("\n
", errstr);
return;
print("got it: " + dest)
addr=(dest, portnum)
# a socket address
s = socket(AF_INET, SOCK_DGRAM)
s.settimeout(1.5)
# we dont actually need to set timeout here
while True:
buf = input("> ")
221
if len(buf) == 0: return
# an empty line exits
s.sendto(bytes(buf + "\n", ascii), addr)
talk()
Why not C?
While C is arguably the most popular language for network programming, it does not support IP
addresses and other network objects as first-class types, and so we omit it here. But see 21.2.2 An
Actual Stack-Overflow Example for a TCP-based C version of an stalk-like program.
To experiment with these on a single host, start the server in one window and one or more clients in other
windows. One can then try the following:
have two clients simultaneously running, and sending alternating messages to the same server
invoke the client with the external IP address of the server in dotted-decimal, eg 10.0.0.3 (note that
localhost is 127.0.0.1)
run the java and python clients simultaneously, sending to the same server
run the server on a different host (eg a virtual host or a neighboring machine)
invoke the client with a nonexistent hostname
Note that, depending on the DNS server, the last one may not actually fail. When asked for the DNS name
of a nonexistent host such as zxqzx.org, many ISPs will return the IP address of a host running a web server
hosting an error/search/advertising page (usually their own). This makes some modicum of sense when
attention is restricted to web searches, but is annoying if it is not, as it means non-web applications have no
easy way to identify nonexistent hosts.
Simplex-talk will work if the server is on the public side of a NAT firewall. No server-side packets need to
be delivered to the client! But if the other direction works, something is very wrong with the firewall.
222
11 UDP Transport
The client needs to create the byte array organized as above, and the server needs to extract the values. (The
inclusion of the list length as a short int is not really necessary, as the receiver will be able to infer the
list length from the packet size, but we want to be able to illustrate the encoding of both int and short
int values.)
The protocol also needs to define how the integers themselves are laid out. There are two common ways to
represent a 32-bit integer as a sequence of four bytes. Consider the integer 0x01020304 = 12563 + 22562
+ 3256 + 4. This can be encoded as the byte sequence [1,2,3,4], known as big-endian encoding, or as
[4,3,2,1], known as little-endian encoding; the former was used by early IBM mainframes and the latter is
used by most Intel processors. (We are assuming here that both architectures represent signed integers using
twos-complement; this is now universal but was not always.)
To send 32-bit integers over the network, it is certainly possible to tag the data as big-endian or little-endian,
or for the endpoints to negotiate the encoding. However, by far the most common approach on the Internet
at least below the application layer is to follow the convention of RFC 1700 and use big-endian encoding
exclusively; big-endian encoding has since come to be known as network byte order.
How one converts from host byte order to network byte order is language-dependent. It must always be
done, even on big-endian architectures, as code may be recompiled on a different architecture later.
In Java the byte-order conversion is generally combined with the process of conversion from int to
byte[]. The client will use a DataOutputStream class to support the writing of the binary values to an output stream, through methods such as writeInt() and writeShort(), together with a
ByteArrayOutputStream class to support the conversion of the output stream to type byte[]. The
code below assumes the list of integers is initially in an ArrayList<Integer> named theNums.
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
try {
dos.writeShort(theNums.size());
for (int n : theNums) {
dos.writeInt(n);
}
} catch (IOException ioe) { /* exception handling */ }
byte[] bbuf = baos.toByteArray();
msg.setData(bbuf);
The server then needs to to the reverse; again, msg is the arriving DatagramPacket. The code below simply
calculates the sum of the 32-bit integers in msg:
ByteArrayInputStream bais = new ByteArrayInputStream(msg.getData(), 0, msg.getLength());
DataInputStream dis = new DataInputStream(bais);
int sum = 0;
try {
int count = dis.readShort();
for (int i=0; i<count; i++) {
sum += dis.readInt();
}
} catch (IOException ioe) { /* more exception handling */ }
A version of simplex-talk for lists of integers can be found in client saddc.java and server sadds.java. The
client reads from the command line a list of character-encoded integers (separated by whitespace), constructs
11.1 User Datagram Protocol UDP
223
the binary encoding as above, and sends them to the server; the server prints their sum. Port 5434 is used;
this can be changed if necessary.
In the C language, we can simply allocate a char[] of the appropriate size and write the network-byteorder values directly into it. Conversion to network byte order and back is done with the following library
calls:
htonl: host-to-network conversion for long (32-bit) integers
ntohl: network-to-host conversion for long integers
htons: host-to-network conversion for short (16-bit) integers
ntohs: network-to-host conversion for short integers
A certain amount of casting between int * and char * is also necessary.
In general, the designer of a protocol needs to select an unambiguous format for all binary data; protocoldefining RFCs always include such format details. This can be a particular issue for floating-point data,
for which two formats can have the same endianness but still differ, eg in normalization or the size of the
exponent field. Formats for structured data, such as arrays, must also be spelled out; in the example above
the list size was indicated by a length field but other options are possible.
The example above illustrates fixed-field-width encoding. Another possible option, using variable-length
encoding, is ASN.1 using the Basic Encoding Rules (20.6 ASN.1 Syntax and SNMP); fixed-field encoding
sometimes becomes cumbersome as data becomes more hierarchical.
At the application layer, the use of non-binary encodings is common, though binary encodings continue to
remain common as well. Two popular formats using human-readable unicode strings for data encoding are
ASN.1 with its XML Encoding Rules and JSON. While the latter format originated with JavaScript, it is
now widely supported by many other languages.
11 UDP Transport
Both these scenarios assume that the old duplicate was sent earlier, but was somehow delayed in transit
for an extended period of time, while later packets were delivered normally. Exactly how this might occur
remains unclear; perhaps the least far-fetched scenario is the following:
A first copy of the old duplicate was sent
A routing error occurs; the packet is stuck in a routing loop
An alternative path between the original hosts is restored, and the packet is retransmitted successfully
Some time later, the packet stuck in the routing loop is released, and reaches its final destination
Another scenario involves a link in the path that supplies link-layer acknowledgment: the packet was sent
once across the link, the link-layer ACK was lost, and so the packet was sent again. Some mechanism is
still needed to delay one of the copies.
Most solutions to the old-duplicate problem assume some cap on just how late an old duplicate can be. In
practical terms, TCP officially once took this time limit to be 60 seconds, but implementations now usually
take it to be 30 seconds. Other protocols often implicitly adopt the TCP limit. Once upon a time, IP routers
were expected to decrement a packets TTL field by 1 for each second the router held the packet in its queue;
in such a world, IP packets cannot be more than 255 seconds old.
It is also possible to prevent external old duplicates by including a connection count parameter in the
transport or application header. For each consecutive connection, the connection count is incremented by (at
least) 1. A separate connection-count value must be maintained by each side; if a connection-count value
is ever lost, a suitable backup mechanism based on delay might be used. As an example, see 12.11 TCP
Faster Opening.
Lost final ACK: Most packets will be acknowledged. The final packet (typically but not necessarily an
ACK) cannot itself be acknowledged, as then it would not be the final packet. Somebody has to go last. This
leaves some uncertainty on the part of the sender: did the last packet make it through, or not?
Duplicated connection request: How do we distinguish between two different connection requests and a
single request that was retransmitted? Does it matter?
Reboots: What if one side reboots while the other side is still sending data? How will the other side detect
this? Are there any scenarios that could lead to corrupted data?
225
226
11 UDP Transport
If the server answered all requests from port 69, it would have to distinguish among multiple concurrent
transfers by looking at the client socket address; each client transfer would have its own state information
including block number, open file, and the time of the last successful packet. This considerably complicates
the implementation.
This port-change rule does break TFTP when the server is on the public side of a NAT firewall. When
the client sends an RRQ to port 69, the NAT firewall will now allow the server to respond from port 69.
However, the servers response from s_port is generally blocked, and so the client never receives Data[1].
227
received, recording its source port. The second Data[1] now appears to be from an incorrect port; the TFTP
specification requires that a receiver reply to any packets from an unknown port by sending an ERROR
packet with the code Unknown Transfer ID (where Transfer ID means port number). Were it not for
this duplicate-RRQ scenario, packets from an unknown port could probably be simply ignored.
What this means in practice is that the first of the two sender processes above will successfully connect to
the receiver, and the second will receive the Unknown Transfer ID message and will exit.
A more unfortunate case related to this is below, example 4 under TFTP Scenarios.
228
11 UDP Transport
// try again
// other errors
Note that the check for elapsed time is quite separate from the check for the
SocketTimeoutException. It is possible for the receiver to receive a steady stream of wrong
packets, so that it never encounters a SocketTimeoutException, and yet no good packet arrives
and so the receiver must still arrange (as above) for a timeout and retransmission.
229
4. Getting a different file than requested: Suppose the client sends RRQ(foo), but transmission is
delayed. In the meantime, the client reboots or aborts, and then sends RRQ(bar). This second RRQ is
lost, but the server sends Data[1] for foo.
At this point the client believes it is receiving file bar, but is in fact receiving file foo.
In practical terms, this scenario seems to be of limited importance, though diskless workstations often did
use TFTP to request their boot image file when restarting.
If the sender reboots, the transfer simply halts.
5. Malicious flooding: A malicious application aware that client C is about to request a file might send
repeated copies of bad Data[1] to likely ports on C. When C does request a file (eg if it requests a boot
image upon starting up, from port 1024), it may receive the malicious file instead of what it asked for.
This is a consequence of the server handoff from port 69 to a new port. Because the malicious application
must guess the clients port number, this scenario too appears to be of limited importance.
11 UDP Transport
I/O (below). RPC is also quite successful as the mechanism for interprocess communication within CPU
clusters, perhaps its most time-sensitive application.
While TCP can be used for processes like these, this adds the overhead of creating and tearing down a
connection; in many cases, the RPC exchange consists of nothing further beyond the request and reply and
so the TCP overhead would be nontrivial. RPC over UDP is particularly well suited for transactions where
both endpoints are quite likely on the same LAN, or are otherwise situated so that losses due to congestion
are negligible.
The drawback to UDP is that the RPC layer must then supply its own acknowledgment protocol. This is
not terribly difficult; usually the reply serves to acknowledge the request, and all that is needed is another
ACK after that. If the protocol is run over a LAN, it is reasonable to use a static timeout period, perhaps
somewhere in the range of 0.5 to 1.0 seconds.
Nonetheless, there are some niceties that early RPC implementations sometimes ignored, leading to a complicated history; see 11.4.2 Sun RPC below.
It is essential that requests and replies be numbered (or otherwise identified), so that the client can determine
which reply matches which request. This also means that the reply can serve to acknowledge the request;
if reply[N] is not received; the requester retransmits request[N]. This can happen either if request[N] never
arrived, or if it was reply[N] that got lost:
When the server creates reply[N] and sends it to the client, it must also keep a cached copy of the reply, until
such time as ACK[N] is received.
231
After sending reply[N], the server may receive ACK[N], indicating all is well, or may receive request[N]
again, indicating that reply[N] was lost, or may experience a timeout, indicating that either reply[N] or
ACK[N] was lost. In the latter two cases, the server should retransmit reply[N] and wait again for ACK[N].
11 UDP Transport
Statelessness Inaction
Back when the Loyola CS department used Sun NFS extensively, server crashes would bring people
calmly out of their offices to find out what had happened; client-workstation processes doing I/O would
have locked up. Everyone would mill about in the hall until the server was rebooted, at which point
they would return to their work and were almost always able to pick up where they left off. If the server
had not been stateless, users would have been quite a bit less happy.
It is, of course, also possible to build recovery mechanisms into stateful protocols.
The lack of file-locking and other non-idempotent I/O operations, along with the rise of cheap clientworkstation storage (and, for that matter, more-reliable servers), eventually led to the decline of NFS over
RPC, though it has not disappeared. NFS can, if desired, also be run (statefully!) over TCP.
11.4.3 Serialization
In some RPC systems, even those with explicit ACKs, requests are executed serially by the server. Serial
execution is automatic if request[N+1] serves as an implicit ACK[N]. This is a problem for file I/O operations, as physical disk drives are generally most efficient when the I/O operations can be reordered to suit
the geometry of the disk. Disk drives commonly use the elevator algorithm to process requests: the read
head moves from low-numbered tracks outwards to high-numbered tracks, pausing at each track for which
there is an I/O request. Waiting for the Nth read to complete before asking the disk to start the N+1th one is
slow.
The best solution here is to allow multiple outstanding requests and out-of-order replies.
11.4.4 Refinements
One basic network-level improvement to RPC concerns the avoidance of IP-level fragmentation. While
fragmentation is not a major performance problem on a single LAN, it may have difficulties over longer
distances. One possible refinement is an RPC-level large-message protocol, that fragments at the RPC layer
and which supports a mechanism for retransmission, if necessary, only of those fragments that are actually
lost.
Another optimization might address the possibility that the server reboots. If a client really wants to be sure
that its request is executed only once, it needs to be sure that the server did not reboot between the original
request and the clients retransmission following a timeout. One way to achieve this is for the server to
maintain a reboot counter, written to the disk and incremented after each restart, and then to include the
value of the reboot counter in each reply. Requests contain the clients expected value for the server reboot
counter; if at the server end there is not a match, the client is notified of the potential error. Full recovery
from what may have been a partially executed request, however, requires some form of application-layer
journal log like that used for database servers.
233
11.5 Epilog
UDP does not get as much attention as TCP, but between avoidance of connection-setup overhead, avoidance
of head-of-line blocking and high LAN performance, it holds its own.
We also use UDP here to illustrate fundamental transport issues, both abstractly and for the specific protocol
TFTP. We will revisit these fundamental issues extensively in the next chapter in the context of TCP; these
issues played a major role in TCPs design.
11.6 Exercises
1. Perform the UDP simplex-talk experiments discussed at the end of 11.1.2
multiple clients have simultaneous sessions with the same server?
2. What would happen in TFTP if both sides implemented retransmit-on-timeout and neither side implemented retransmit-on-duplicate? Assume the actual transfer time is negligible. Assume Data[3] is sent but
the first instance is lost. Consider these cases:
sender timeout = receiver timeout = 2 seconds
sender timeout = 1 second, receiver timeout = 3 seconds
sender timeout = 3 seconds, receiver timeout = 1 second
3. In the previous problem, how do things change if ACK[3] is the packet that is lost?
4. Spell out plausible responses for a TFTP receiver upon receipt of a Data[N] packet for each of the TFTP
states UNLATCHED, ESTABLISHED, and DALLYING proposed in 11.3.5 TFTP States. Your answer
may depend on N and the packet size. Indicate the events that cause a transition from one state to the next.
Example: upon receipt of an ERROR packet, TFTP would in all three states exit.
5.
In the TFTP-receiver code in 11.3.5
TFTP States, explain why we must check
thePacket.getLength() before extracting the opcode and block number.
6. Assume both the TFTP sender and the TFTP receiver implement retransmit-on-timeout but not retransmiton-duplicate. Outline a TFTP scenario in which the TFTP receiver of 11.3.5 TFTP States sets a socket
timeout interval but never encounters a hard timeout that is, a SocketTimeoutException and
yet must timeout and retransmit. Hint: the only way to avoid a hard timeout is constantly to receive some
packet before the timeout timer expires.
7. In 11.3.6 TFTP scenarios, under Old duplicate, we claimed that if either side changed ports, the
old-duplicate problem would not occur.
(a). If the client changes its port number on a subsequent connection, but the server does not, what prevents
the old-duplicate problem?
(b). If the server changes its port number on a subsequent connection, but the client does not, what prevents
the old-duplicate problem?
234
11 UDP Transport
8. In the simple RPC protocol at the beginning of 11.4 Remote Procedure Call (RPC), suppose that the
server sends reply[N] and experiences a timeout, receiving nothing back from the client. In the text we
suggested that most likely this meant ACK[N] was lost. Give another loss scenario, involving the loss of
two packets. Assume the client and the server have the same timeout interval.
9. Suppose a Sun RPC read() request ends up executing twice. Unfortunately, in between successive
read() operations the block of data is updated by another process, so different data is returned. Is this a
failure of idempotence? Why or why not?
10. Outline an RPC protocol in which multiple requests can be outstanding, and replies can be sent in
any order. Assume that requests are numbered, and that ACK[N] acknowledges reply[N]. Should ACKs be
cumulative? If not, what should happen if an ACK is lost?
11.6 Exercises
235
236
11 UDP Transport
12 TCP TRANSPORT
The standard transport protocols riding above the IP layer are TCP and UDP. As we saw in 11 UDP
Transport, UDP provides simple datagram delivery to remote sockets, that is, to host,port pairs. TCP
provides a much richer functionality for sending data to (connected) sockets.
TCP is quite different in several dimensions from UDP. TCP is stream-oriented, meaning that the application can write data in very small or very large amounts and the TCP layer will take care of appropriate
packetization. TCP is connection-oriented, meaning that a connection must be established before the beginning of any data transfer. TCP is reliable, in that TCP uses sequence numbers to ensure the correct order
of delivery and a timeout/retransmission mechanism to make sure no data is lost short of massive network
failure. Finally, TCP automatically uses the sliding windows algorithm to achieve throughput relatively
close to the maximum available.
These features mean that TCP is very well suited for the transfer of large files. The two endpoints open a
connection, the file data is written by one end into the connection and read by the other end, and the features
above ensure that the file will be received correctly. TCP also works quite well for interactive applications
where each side is sending and receiving streams of small packets. Examples of this include ssh or telnet,
where packets are exchanged on each keystroke, and database connections that may carry many queries per
second. TCP even works reasonably well for request/reply protocols, where one side sends a message,
the other side responds, and the connection is closed. The drawback here, however, is the overhead of
setting up a new connection for each request; a better application-protocol design might be to allow multiple
request/reply pairs over a single TCP connection.
Note that the connection-orientation and reliability of TCP represent abstract features built on top of the IP
layer which supports neither of them.
The connection-oriented nature of TCP warrants further explanation. With UDP, if a server opens a socket
(the OS object, with corresponding socket address), then any client on the Internet can send to that socket,
via its socket address. Any UDP application, therefore, must be prepared to check the source address of each
packet that arrives. With TCP, all data arriving at a connected socket must come from the other endpoint of
the connection. When a server S initially opens a socket s, that socket is unconnected; it is said to be in
the LISTEN state. While it still has a socket address consisting of its host and port, a LISTENing socket
will never receive data directly. If a client C somewhere on the Internet wishes to send data to s, it must first
establish a connection, which will be defined by the socketpair consisting of the socket addresses at both C
and S. As part of this connection process, a new connected child socket sC will be created; it is sC that will
receive any data sent from C. Usually, S will also create a new thread or process to handle communication
with sC . Typically the server S will have multiple connected children of s, and, for each one, a process
attached to it.
If C1 and C2 both connect to s, two connected sockets at S will be created, s1 and s2 , and likely two separate
processes. When a packet arrives at S addressed to the socket address of s, the source socket address will
also be examined to determine whether the data is part of the C1S or the C2S connection, and thus whether
a read on s1 or on s2 , respectively, will see the data.
If S is acting as an ssh server, the LISTENing socket listens on port 22, and the connected child sockets
correspond to the separate user login connections; the process on each child socket represents the login
process of that user, and may run for hours or days.
237
In Chapter 1 we likened TCP sockets to telephone connections, with the server like one high-volume phone
number 800-BUY-NOWW. The unconnected socket corresponds to the number everyone dials; the connected sockets correspond to the actual calls. (This analogy breaks down, however, if one looks closely at
the way such multi-operator phone lines are actually configured: each typically does have its own number.)
238
12 TCP Transport
The sequence and acknowledgment numbers are for numbering the data, at the byte level. This allows
TCP to send 1024-byte blocks of data, incrementing the sequence number by 1024 between successive
packets, or to send 1-byte telnet packets, incrementing the sequence number by 1 each time. There is no
distinction between DATA and ACK packets; all packets carrying data from A to B also carry the most
current acknowledgment of data sent from B to A. Many TCP applications are largely unidirectional, in
which case the sender would include essentially the same acknowledgment number in each packet while the
receiver would include essentially the same sequence number.
It is traditional to refer to the data portion of TCP packets as segments.
The value of the sequence number, in relative terms, is the position of the first byte of the packet in the data
stream, or the position of what would be the first byte in the case that no data was sent. The value of the
acknowledgment number, again in relative terms, represents the byte position for the next byte expected.
Thus, if a packet contains 1024 bytes of data and the first byte is number 1, then that would be the sequence
number. The data bytes would be positions 1-1024, and the ACK returned would have acknowledgment
number 1025.
The sequence and acknowledgment numbers, as sent, represent these relative values plus an Initial Sequence Number, or ISN, that is fixed for the lifetime of the connection. Each direction of a connection has
its own ISN; see below.
TCP acknowledgments are cumulative: when an endpoint sends a packet with an acknowledgment number
of N, it is acknowledging receipt of all data bytes numbered less than N. Standard TCP provides no mechanism for acknowledging receipt of packets 1, 2, 3 and 5; the highest cumulative acknowledgment that could
be sent in that situation would be to acknowledge packet 3.
The TCP header defines six important flag bits; the brief definitions here are expanded upon in the sequel:
SYN: for SYNchronize; marks packets that are part of the new-connection handshake
ACK: indicates that the header Acknowledgment field is valid; that is, all but the first packet
FIN: for FINish; marks packets involved in the connection closing
PSH: for PuSH; marks non-full packets that should be delivered promptly at the far end
RST: for ReSeT; indicates various error conditions
URG: for URGent; part of a now-seldom-used mechanism for high-priority data
CWR and ECE: part of the Explicit Congestion Notification mechanism, 14.8.2 Explicit Congestion
Notification (ECN)
239
Normally, the three-way handshake is triggered by an applications request to connect; data can be sent only
after the handshake completes. This means a one-RTT delay before any data can be sent. The original TCP
standard RFC 793 does allow data to be sent with the first SYN packet, as part of the handshake, but such
data cannot be released to the remote-endpoint application until the handshake completes. Most traditional
TCP programming interfaces offer no support for this early-data option.
There are recurrent calls for TCP to support earlier data in a more useful manner, so as to achieve request/reply turnaround comparable to that with RPC (11.4 Remote Procedure Call (RPC)). We return to
this in 12.11 TCP Faster Opening.
To close the connection, a superficially similar exchange involving FIN packets may occur:
A sends B a packet with the FIN bit set (a FIN packet), announcing that it has finished sending data
B sends A an ACK of the FIN
When B is also ready to cease sending, it sends its own FIN to A
A sends B an ACK of the FIN; this is the final packet in the exchange
The FIN handshake is really more like two separate two-way FIN/ACK handshakes. If B is ready to close
immediately, it may send its FIN along with its ACK of As FIN, as is shown in the above diagram at the
left. In theory this is rare, however, as the ACK of As FIN is generated by the kernel but Bs FIN cannot
be sent until Bs process is scheduled to run on the CPU. On the other hand, it is possible for B to send a
considerable amount of data back to A after A sends its FIN, as is shown at the right. The FIN is, in effect, a
promise not to send any more, but that side of the connection must still be prepared to receive data. A good
example of this occurs when A is sending a stream of data to B to be sorted; A sends FIN to indicate that it
240
12 TCP Transport
is done sending, and only then does B sort the data and begin sending it back to A. This can be generated
with the command, on A, cat thefile | ssh B sort.
In the following table, relative sequence numbers are used, which is to say that sequence numbers begin
with 0 on each side. The SEQ numbers in bold on the A side correspond to the ACK numbers in bold on
the B side; they both count data flowing from A to B.
1
2
3
4
5
6
7
8
9
10
11,12
13
14
15
A sends
SYN, seq=0
B sends
SYN+ACK, seq=0, ack=1 (expecting)
(We will see below that this table is slightly idealized, in that real sequence numbers do not start at 0.)
Here is the ladder diagram corresponding to this connection:
241
In terms of the sequence and acknowledgment numbers, SYNs count as 1 byte, as do FINs. Thus, the SYN
counts as sequence number 0, and the first byte of data (the a of abc) counts as sequence number 1.
Similarly, the ack=21 sent by the B side is the acknowledgment of goodbye, while the ack=22 is the
acknowledgment of As subsequent FIN.
Whenever B sends ACN=n, A follows by sending more data with SEQ=n.
TCP does not in fact transport relative sequence numbers, that is, sequence numbers as transmitted do not
begin at 0. Instead, each side chooses its Initial Sequence Number, or ISN, and sends that in its initial
SYN. The third ACK of the three-way handshake is an acknowledgment that the server sides SYN response
was received correctly. All further sequence numbers sent are the ISN chosen by that side plus the relative
sequence number (that is, the sequence number as if numbering did begin at 0). If A chose ISNA =1000, we
would add 1000 to all the bold entries above: A would send SYN(seq=1000), B would reply with ISNB and
ack=1001, and the last two lines would involve ack=1022 and seq=1022 respectively. Similarly, if B chose
ISNB =7000, then we would add 7000 to all the seq values in the B sends column and all the ack values in
the A sends column. The table above up to the point B sends goodbye, with actual sequence numbers
instead of relative sequence numbers, is below:
242
12 TCP Transport
1
2
3
4
5
6
7
8
9
10
A, ISN=1000
SYN, seq=1000
B, ISN=7000
SYN+ACK, seq=7000, ack=1001
If B had not been LISTENing at the port to which A sent its SYN, its response would have been RST
(reset), meaning in this context connection refused. Similarly, if A sent data to B before the SYN
packet, the response would have been RST. Finally, either side can abort the connection at any time by
sending RST.
If A sends a series of small packets to B, then B has the option of assembling them into a full-sized I/O buffer
before releasing them to the receiving application. However, if A sets the PSH bit on each packet, then B
should release each packet immediately to the receiving application. In Berkeley Unix and most (if not all)
BSD-derived socket-library implementations, there is in fact no way to set the PSH bit; it is set automatically
for each write. (But even this is not guaranteed as the sender may leave the bit off or consolidate several
PuSHed writes into one packet; this makes using the PSH bit as a record separator difficult. In the program
written to generate the WireShark packet trace, below, most of the time the strings abc, defg, etc were
PuSHed separately but occasionally they were consolidated into one packet.)
As for the URG bit, imagine an ssh connection, in which A has sent a large amount of data to B, which
is momentarily stalled processing it. The user at A wishes to abort the connection by sending the interrupt
character CNTL-C. Under normal processing, the application at B would have to finish processing all the
pending data before getting to the CNTL-C; however, the use of the URG bit can enable immediate asynchronous delivery of the CNTL-C. The bit is set, and the TCP headers Urgent Pointer field points to the
CNTL-C far ahead in the normal data stream. The receiver then skips processing the arriving data stream in
first-come-first-served order and processes the urgent data first. For this to work, the receiving process must
have signed up to receive an asynchronous signal when urgent data arrives.
The urgent data does appear in the ordinary TCP data stream, and it is up to the protocol to determine the
length of the urgent data substring, and what to do with the unread, buffered data sent ahead of the urgent
data. For the CNTL-C example, the urgent data consists of a single character, and the earlier data is simply
discarded.
243
The data in the packet can be inferred from the WireShark Len field, as each of the data strings sent has a
different length.
The packets are numbered the same as in the table above up through packet 8, containing the string foobar.
At that point the table shows B replying by a combined ACK plus the string hello; in fact, TCP sent the
ACK alone and then the string hello; these are WireShark packets 9 and 10 (note packet 10 has Len=5).
Wireshark packet 11 is then a standalone ACK from A to B, acknowledging the hello. WireShark packet
12 (the packet highlighted) then corresponds to table packet 10, and contains goodbye (Len=7); this string
can be seen at the right side of the bottom pane.
The table view shows As FIN (packet 11) crossing with Bs ACK of goodbye (packet 12). In the WireShark view, As FIN is packet 13, and is sent about 0.01 seconds after goodbye; B then ACKs them both
with packet 14. That is, the table-view packet 12 does not exist in the WireShark view.
Packets 11, 13, 14 and 15 in the table and 13, 14, 15 and 16 in the WireShark screen dump correspond to
the connection closing. The program that generated the exchange at Bs side had to include a sleep delay
of 40 ms between detecting the closed connection (that is, reading As FIN) and closing its own connection
(and sending its own FIN); otherwise the ACK of As FIN traveled in the same packet with Bs FIN.
The ISN for A here was 551144795 and Bs ISN was 1366676578. The actual pcap packet-capture file is at
demo_tcp_connection.pcap. The hardware involved used TCP checksum offloading to have the networkinterface card do the actual checksum calculations; as a result, the checksums are wrong in the actual capture
file. WireShark has an option to disable the reporting of this.
244
12 TCP Transport
245
try {
s = ss.accept();
} catch (IOException ioe) {
System.err.println("Cant accept");
break;
}
if (THREADING) {
Talker talk = new Talker(s);
(new Thread(talk)).start();
} else {
line_talker(s);
}
} // accept loop
} // end of main
try {istr.close();}
catch (IOException ioe) {System.err.println("bad stream close");return;}
try {s.close();}
catch (IOException ioe) {System.err.println("bad socket close");return;}
System.err.println("socket to port " + port + " closed");
} // line_talker
246
12 TCP Transport
Here is the client tcp_stalkc.java. As with the UDP version, the default host to connect to is localhost.
We first call InetAddress.getByName() to perform the DNS lookup. Part of the construction of the
Socket object is the connection to the desired dest and destport.
// TCP simplex-talk CLIENT in java
import java.net.*;
import java.io.*;
public class stalkc {
static public BufferedReader bin;
static public int destport = 5431;
static public void main(String args[]) {
String desthost = "localhost";
if (args.length >= 1) desthost = args[0];
bin = new BufferedReader(new InputStreamReader(System.in));
InetAddress dest;
System.err.print("Looking up address of " + desthost + "...");
try {
dest = InetAddress.getByName(desthost);
}
catch (UnknownHostException uhe) {
System.err.println("unknown host: " + desthost);
return;
}
System.err.println(" got it!");
System.err.println("connecting to port " + destport);
Socket s;
try {
s = new Socket(dest, destport);
}
catch(IOException ioe) {
System.err.println("cannot connect to <" + desthost + "," + destport + ">");
return;
}
247
OutputStream sout;
try {
sout = s.getOutputStream();
}
catch (IOException ioe) {
System.err.println("I/O failure!");
return;
}
//============================================================
while (true) {
String buf;
try {
buf = bin.readLine();
}
catch (IOException ioe) {
System.err.println("readLine() failed");
return;
}
if (buf == null) break;
// user typed EOF character
buf = buf + "\n";
// protocol requires sender includes \n
byte[] bbuf = buf.getBytes();
try {
sout.write(bbuf);
}
catch (IOException ioe) {
System.err.println("write() failed");
return;
}
} // while
}
}
248
12 TCP Transport
Here is the ladder diagram for the 14-packet connection described above, this time labeled with TCP states.
249
The reader who is implementing TCP is encouraged to consult RFC 793 and updates. For the rest of us,
below are a few general observations about opening and closing connections.
Either side may elect to close the connection (just as either party to a telephone call may elect to hang up).
The first side to send a FIN takes the Active CLOSE path; the other side takes the Passive CLOSE path.
Although it essentially never occurs in practice, it is possible for each side to send the other a SYN, requesting a connection, simultaneously (that is, the SYNs cross on the wire). The telephony analogue occurs
when each party dials the other simultaneously. On traditional land-lines, each party then gets a busy signal.
On cell phones, your mileage may vary. With TCP, a single connection is created. With OSI TP4, two
connections are created. The OSI approach is not possible in TCP, as a connection is determined only by
the socketpair involved; if there is only one socketpair then there can be only one connection.
A simultaneous close having both sides simultaneously send each other FINs is a little more likely,
though still not very. Each side would move to state FIN_WAIT_1. Then, upon receiving each others FIN
packets, each side would move to CLOSING, and then to TIMEWAIT.
A TCP endpoint is half-closed if it has sent its FIN (thus promising not to send any more data) and is
waiting for the other sides FIN. A TCP endpoint is half-open if it is in the ESTABLISHED state, but in the
meantime the other side has rebooted. As soon as the ESTABLISHED side sends a packet, the other side
will respond with RST and the connection will be fully closed.
The normal close is
250
12 TCP Transport
In this scenario, A has moved from ESTABLISHED to FIN_WAIT_1 to FIN_WAIT_2 to TIMEWAIT (below) to CLOSED. B moves from ESTABLISHED to CLOSE_WAIT to LAST_ACK to CLOSED. All this
essentially amounts to two separate two-way closure handshakes.
However, it is possible for Bs ACK and FIN to be combined. In this case, A moves directly from
FIN_WAIT_1 to TIMEWAIT. In order for this to happen, when As FIN arrives at B, the socket-owning
process at B must immediately wake up, recognize that A has closed its end, and immediately close its own
end as well. This generates Bs FIN; all this must happen before Bs TCP layer sends the ACK of As FIN.
If the TCP layer adopts a policy of immediately sending ACKs upon receipt of any packet, this will never
happen, as the FIN will arrive well before Bs process can be scheduled to do anything. However, if B delays
its ACKs slightly, then it is possible for Bs ACK and FIN to be sent together.
Although this is not evident from the state diagram, the per-state response rules of TCP require that in the
ESTABLISHED state, if the receiver sends an ACK outside the current sliding window, then the correct
response is to reply with ones own current ACK. This includes the case where the receiver acknowledges
data not yet sent.
It is possible to view connection states under either linux or Windows with netstat -a. Most states are
ephemeral, exceptions being LISTEN, ESTABLISHED, TIMEWAIT, and CLOSE_WAIT. One sometimes
sees large numbers of connections in CLOSE_WAIT, meaning that the remote endpoint has closed the
connection and sent its FIN, but the process at your end has not executed close() on its socket. Often
this represents a programming error; alternatively, the process at the local end is still working on something.
Given a local port number p in state CLOSE_WAIT on a linux system, the (privileged) command lsof -i
:p will identify the process using port p.
251
For TCP, it is the actual sequence numbers, rather than the relative sequence numbers, that would have to
match up. The diagram above ignores that.
As with TFTP, coming up with a possible scenario accounting for the generation of such a late packet is not
easy. Nonetheless, many of the design details of TCP represent attempts to minimize this risk.
Solutions to the old-duplicates problem generally involve setting an upper bound on the lifetime of any
packet, the MSL, as we shall see in the next section. T/TCP (12.11 TCP Faster Opening) introduced a
connection-count field for this.
TCP is also vulnerable to sequence-number wraparound: arrival of an old duplicates from the same instance
of the connection. However, if we take the MSL to be 60 seconds, sequence-number wrap requires sending
232 bytes in 60 seconds, which requires a data-transfer rate in excess of 500 Mbps. TCP offers a fix for this
(Protection Against Wrapped Segments, or PAWS), but it was introduced relatively late; we return to this in
12.10 Anomalous TCP scenarios.
12.8 TIMEWAIT
The TIMEWAIT state is entered by whichever side initiates the connection close; in the event of a simultaneous close, both sides enter TIMEWAIT. It is to last for a time 2MSL, where MSL = Maximum Segment
Lifetime is an agreed-upon value for the maximum lifetime on the Internet of an IP packet. Traditionally MSL was taken to be 60 seconds, but more modern implementations often assume 30 seconds (for a
TIMEWAIT period of 60 seconds).
One function of TIMEWAIT is to solve the external-old-duplicates problem. TIMEWAIT requires that
between closing and reopening a connection, a long enough interval must pass that any packets from the
252
12 TCP Transport
first instance will disappear. After the expiration of the TIMEWAIT interval, an old duplicate cannot arrive.
A second function of TIMEWAIT is to address the lost-final-ACK problem (11.2 Fundamental Transport
Issues). If host A sends its final ACK to host B and this is lost, then B will eventually retransmit its final
packet, which will be its FIN. As long as A remains in state TIMEWAIT, it can appropriately reply to a
retransmitted FIN from B with a duplicate final ACK.
TIMEWAIT only blocks reconnections for which both sides reuse the same port they used before. If A
connects to B and closes the connection, A is free to connect again to B using a different port at As end.
Conceptually, a host may have many old connections to the same port simultaneously in TIMEWAIT; the
host must thus maintain for each of its ports a list of all the remote IP_address,port sockets currently in
TIMEWAIT for that port. If a host is connecting as a client, this list likely will amount to a list of recently
used ports; no port is likely to have been used twice within the TIMEWAIT interval. If a host is a server,
however, accepting connections on a standardized port, and happens to be the side that initiates the active
close and thus later goes into TIMEWAIT, then its TIMEWAIT list for that port can grow quite long.
Generally, busy servers prefer to be free from these bookkeeping requirements of TIMEWAIT, so many
protocols are designed so that it is the client that initiates the active close. In the original HTTP protocol,
version 1.0, the server sent back the data stream requested by the http GET message, and indicated the end
of this stream by closing the connection. In HTTP 1.1 this was fixed so that the client initiated the close;
this required a new mechanism by which the server could indicate I am done sending this file. HTTP 1.1
also used this new mechanism to allow the server to send back multiple files over one connection.
In an environment in which many short-lived connections are made from host A to the same port on server
B, port exhaustion having all ports tied up in TIMEWAIT is a theoretical possibility. If A makes 1000
connections per second, then after 60 seconds it has gone through 60,000 available ports, and there are
essentially none left. While this rate is high, early Berkeley-Unix TCP implementations often made only
about 4,000 ports available to clients; with a 120-second TIMEWAIT interval, port exhaustion would occur
with only 33 connections per second.
If you use ssh to connect to a server and then issue the netstat -a command on your own host (or, more
conveniently, netstat -a |grep -i tcp), you should see your connection in ESTABLISHED state.
If you close your connection and check again, your connection should be in TIMEWAIT.
253
of a reopening the same connection, the arrival of a second SYN with a new ISN means that the original
connection cannot proceed, because that ISN is now wrong.
The clock-driven ISN also originally added a second layer of protection against external old duplicates.
Suppose that A opens a connection to B, and chooses a clock-based ISN N1 . A then transfers M bytes
of data, closed the connection, and reopens it with ISN N2 . If N1 + M < N2 , then the old-duplicates
problem cannot occur: all of the absolute sequence numbers used in the first instance of the connection
are less than or equal to N1 + M, and all of the absolute sequence numbers used in the second instance
will be greater than N2 . In fact, early Berkeley-Unix implementations of the socket library often allowed
a second connection meeting this ISN requirement to be reopened before TIMEWAIT would have expired;
this potentially addressed the problem of port exhaustion. Of course, if the first instance of the connection
transferred data faster than the ISN clock rate, that is at more than 250,000 bytes/sec, then N1 + M would
be greater than N2 , and TIMEWAIT would have to be enforced. But in the era in which TCP was first
developed, sustained transfers exceeding 250,000 bytes/sec were not common.
The three-way handshake was extensively analyzed by Dalal and Sunshine in [DS78]. The authors noted
that with a two-way handshake, the second side receives no confirmation that its ISN was correctly received.
The authors also observed that a four-way handshake in which the ACK of ISNA is sent separately from
ISNB , as in the diagram below could fail if one side restarted.
For this failure to occur, assume that after sending the SYN in line 1, with ISNA1 , A restarts. The ACK
in line 2 is either ignored or not received. B now sends its SYN in line 3, but A interprets this as a new
connection request; it will respond after line 4 by sending a fifth, SYN packet containing a different ISNA2 .
For B the connection is now ESTABLISHED, and if B acknowledges this fifth packet but fails to update its
record of As ISN, the connection will fail as A and B would have different notions of ISNA .
254
12 TCP Transport
to be B, and thus get a privileged command invoked. The connection only needs to be started; if the ruse
is discovered after the command is executed, it is too late. M can easily send a SYN packet to A with
Bs IP address in the source-IP field; M can probably temporarily disable B too, so that As SYN-ACK
response, which is sent to B, goes unnoticed. What is harder is for M to figure out how to guess how to ACK
ISNA . But if A generates ISNs with a slowly incrementing clock, M can guess the pattern of the clock with
previous connection attempts, and can thus guess ISNA with a considerable degree of accuracy. So M sends
SYN to A with B as source, A sends SYN-ACK to B containing ISNA , and M guesses this value and sends
ACK(ISNA +1) to A, again with B listed in the IP header as source, followed by a single-packet command.
This IP-spoofing technique was first described by Robert T Morris in [RTM85]; Morris went on to launch
the Internet Worm of 1988 using unrelated attacks. The IP-spoofing technique was used in the 1994 Christmas Day attack against UCSD, launched from Loyolas own apollo.it.luc.edu; the attack was associated with
Kevin Mitnick though apparently not actually carried out by him. Mitnick was arrested a few months later.
RFC 1948, in May 1996, introduced a technique for introducing a degree of randomization in ISN selection,
while still ensuring that the same ISN would not be used twice in a row for the same connection. The ISN
is to be the sum of the 4-s clock, C(t), and a secure hash of the connection information as follows:
ISN = C(t) + hash(local_addr, local_port, remote_addr, remote_port, key)
The key value is a random value chosen by the host on startup. While M, above, can poll A for its current
ISN, and can probably guess the hash function and the first four parameters above, without knowing the key
it cannot determine (or easily guess) the ISN value A would have sent to B. Legitimate connections between
A and B, on the other hand, see the ISN increasing at the 4-s rate.
255
no new connections are to be accepted for 1*MSL. No known implementations actually do this; instead,
they assume that the restarting process itself will take at least one MSL. This is no longer as certain as it
once was, but serious consequences have not ensued.
12 TCP Transport
the sender receives a TCP ACK for the packet, on the other hand, indicating that it made it through to the
other end, it might try a larger size. Usually, the size range of 512-1500 bytes is covered by less than a dozen
discrete values; the point is not to find the exact Path MTU but to determine a reasonable approximation
rapidly.
257
Bandwidth Conservation
Delayed ACKs and the Nagle algorithm both originated in a bygone era, when bandwidth was in much
shorter supply than it is today. In RFC 896, John Nagle writes (in 1984) In general, we have not been
able to afford the luxury of excess long-haul bandwidth that the ARPANET possesses, and our longhaul links are heavily loaded during peak periods. Transit times of several seconds are thus common
in our network. Today, it is unlikely that extra small packets would cause significant problems.
258
12 TCP Transport
Each of these polling packets elicits the receivers current ACK; the end result is that the sender will
receive the eventual window-enlargement announcement reliably. These polling packets are regulated by
the so-called persist timer.
12.18 KeepAlive
There is no reason that a TCP connection should not be idle for a long period of time; ssh/telnet connections,
for example, might go unused for days. However, there is the turned-off-at-night problem: a workstation
12.17 TCP Timeout and Retransmission
259
might telnet into a server, and then be shut off (not shut down gracefully) at the end of the day. The
connection would now be half-open, but the server would not generate any traffic and so might never detect
this; the connection itself would continue to tie up resources.
KeepAlive in action
One evening long ago, when dialed up (yes, that long ago) into the Internet, my phone line disconnected
while I was typing an email message in an ssh window. I dutifully reconnected, expecting to find my
message in the file dead.letter, which is what would have happened had I been disconnected while
using the even-older tty dialup. Alas, nothing was there. I reconstructed my email as best I could and
logged off.
The next morning, there was my lost email in a file dead.letter, dated two hours after the initial crash!
What had happened, apparently, was that the original ssh connection on the server side just hung there,
half-open. Then, after two hours, KeepAlive kicked in, and aborted the connection. At that point ssh
sent my mail program the HangUp signal, and the mail program wrote out what it had in dead.letter.
To avoid this, TCP supports an optional KeepAlive mechanism: each side polls the other with a dataless
packet. The original RFC 1122 KeepAlive timeout was 2 hours, but this could be reduced to 15 minutes. If
a connection failed the KeepAlive test, it would be closed.
Supposedly, some TCP implementations are not exactly RFC 1122-compliant: either KeepAlives are enabled by default, or the KeepAlive interval is much smaller than called for in the specification.
12.20 Epilog
At this point we have covered the basic mechanics of TCP, but have one important topic remaining: how
TCP manages its window size so as to limit congestion. That will be the focus of the next three chapters.
12.21 Exercises
1. Experiment with the TCP version of simplex-talk. How does the server respond differently with
threading enabled and without, if two simultaneous attempts to connect are made?
260
12 TCP Transport
2. Trace the states visited if nodes A and B attempt to create a TCP connection by simultaneously sending
each other SYN packets, that then cross in the network. Draw the ladder diagram.
3. When two nodes A and B simultaneously attempt to connect to one another using the OSI TP4 protocol,
two bidirectional network connections are created (rather than one, as with TCP). If TCP had instead chosen
the TP4 semantics here, what would have to be added to the TCP header? Hint: if a packet from A,port1
arrives at B,port2, how would we tell to which of the two possible connections it belongs?
4. Simultaneous connection initiations are rare, but simultaneous connection termination is relatively common. How do two TCP nodes negotiate the simultaneous sending of FIN packets to one another? Draw the
ladder diagram. Which node goes into TIMEWAIT state?
5. (a) Suppose you see multiple connections on your workstation in state FIN_WAIT_1. What is likely
going on? Whose fault is it?
(b). What might be going on if you see connections languishing in state FIN_WAIT_2?
6. Suppose that, after downloading a file, a user workstation is unplugged from the network. The workstation
may or may not have first sent a FIN to start closing the connection.
(a). Suppose the receiver has not sent the first FIN. According to the transitions shown in the TCP state
diagram, what TCP states could be reached by the server?
(b). Now suppose the receiver has sent its FIN before being unplugged. Again, what TCP states could be
reached by the server?
(Eventually, the server connection should transition to CLOSED due to repeated timeouts, but this is not
shown in the state diagram.)
7. Suppose A and B create a TCP connection with ISNA =20,000 and ISNB =5,000. A sends three 1000-byte
packets (Data1, Data2 and Data3 below), and B ACKs each. Then B sends a 1000-byte packet DataB to A
and terminates the connection with a FIN. In the table below, fill in the SEQ and ACK fields for each packet
shown.
A sends
SYN, ISNA =20,000
B sends
SYN, ISNB =5,000, ACK=______
261
8. Suppose you are watching a video on a site like YouTube, where there is a progress bar showing how
much of the video has been downloaded (and which hopefully stays comfortably head of the second progress
bar showing your position in viewing the video). Occasionally, the download-progress bar jumps ahead by
a modest but significant amount, instantaneously fast (much faster than the bandwidth alone could account
for). At the TCP layer, what is going on to cause this jump? Hint: what does a TCP receiver perceive when
a packet is lost and retransmitted?
9. Suppose you are creating software for a streaming-video site. You want to limit the video read-ahead
the gap between how much has been downloaded and how much the viewer has actually watched to 100
KB; the server should pause in sending when necessary to enforce this. An ordinary TCP connection will
simply transfer data as fast as possible. What support, if any, do you need the receiving (client) application
layer to provide, in order to enable this? What must the server-side application then do?
10. A user moves the computer mouse and sees the mouse-cursors position updated on the screen. Suppose
the mouse-position updates are being transmitted over a TCP connection with a relatively long RTT. The
user attempts to move the cursor to a specific point. How will the user perceive the mouses motion
11. Suppose you have fallen in with a group that wants to add to TCP a feature so that, if A and B1 are
connected, then B1 can hand off its connection to a different host B2; the end result is that A and B2 are
connected and A has received an uninterrupted stream of data. Either A or B1 can initiate the handoff.
(a). Suppose B1 is the host to send the final FIN (or HANDOFF) packet to A. How would you handle
appropriate analogues of the TIMEWAIT state for host B1? Does the fact that A is continuing the
connection, just not with B1, matter?
(b). Now suppose A is the party to send the final FIN/HANDOFF, to B1. What changes to TIMEWAIT
would have to be made at As end? Note that A may potentially hand off the connection again and again, eg
to B3, B4 and then B5.
262
12 TCP Transport
This chapter addresses how TCP manages congestion, both for the connections own benefit (to improve its
throughput) and for the benefit of other connections as well (which may result in our connection reducing its
own throughput). Early work on congestion culminated in 1990 with the flavor of TCP known as TCP Reno.
The congestion-management mechanisms of TCP Reno remain the dominant approach on the Internet today,
though alternative TCPs are an active area of research and we will consider a few of them in 15 Newer
TCP Implementations.
The central TCP mechanism here is for a connection to adjust its window size. A smaller winsize means
fewer packets are out in the Internet at any one time, and less traffic means less congestion. All TCPs reduce
winsize when congestion is apparent, and increase it when it is not. The trick is in figuring out when and
by how much to make these winsize changes. Many of the improvements to TCP have come from mining
more and more information from the ACK stream.
Recall Chiu and Jains definition from 1.7 Congestion that the knee of congestion occurs when the queue
first starts to grow, and the cliff of congestion occurs when packets start being dropped. Congestion can
be managed at either point, though dropped packets can be a significant waste of resources. Some newer
TCP strategies attempt to take action at the congestion knee, but TCP Reno is a cliff-based strategy: packets
must be lost before the sender reduces the window size.
In 19 Quality of Service we will consider some router-centric alternatives to TCP for Internet congestion
management. However, for the most part these have not been widely adopted, and TCP is all that stands in
the way of Internet congestive collapse.
The first question one might ask about TCP congestion management is just how did it get this job? A TCP
sender is expected to monitor its transmission rate so as to cooperate with other senders to reduce overall
congestion among the routers. While part of the goal of every TCP node is good, stable performance for
its own connections, this emphasis on end-user cooperation introduces the prospect of cheating: a host
might be tempted to maximize the throughput of its own connections at the expense of others. Putting
TCP nodes in charge of congestion among the core routers is to some degree like putting the foxes in
charge of the henhouse. More accurately, such an arrangement has the potential to lead to the Tragedy of
the Commons. Multiple TCP senders share a common resource the Internet backbone and while the
backbone is most efficient if every sender cooperates, each individual sender can improve its own situation
by sending faster than allowed. Indeed, one of the arguments used by virtual-circuit routing adherents is that
it provides support for the implementation of a wide range of congestion-management options under control
of a central authority.
Nonetheless, TCP has been quite successful at distributed congestion management. In part this has been
because system vendors do have an incentive to take the big-picture view, and in the past it has been quite
difficult for individual users to replace their TCP stacks with rogue versions. Another factor contributing to
TCPs success here is that most bad TCP behavior requires cooperation at the server end, and most server
managers have an incentive to behave cooperatively. Servers generally want to distribute bandwidth fairly
among their multiple clients, and theoretically at least a servers ISP could penalize misbehavior. So far,
at least, the TCP approach has worked remarkably well.
263
264
loss, ie the cwnd that at that particular moment completely fills but does not overflow the bottleneck queue.
We have reached the ceiling when the queue is full.
In Chiu and Jains terminology, the far side of the ceiling is the cliff, at which point packets are lost. TCP
tries to stay above the knee, which is the point when the queue first begins to be persistently utilized, thus
keeping the queue at least partially occupied; whenever it sends too much and falls off the cliff, it retreats.
The ceiling concept is often useful, but not necessarily as precise as it might sound. If we have reached the
ceiling by gradually expanding the sliding-windows window size, then winsize will be as large as possible.
But if the sender suddenly releases a burst of packets, the queue may fill and we will have reached the ceiling
without fully utilizing the transit capacity. Another source of ceiling ambiguity is that the bottleneck link
may be shared with other connections, in which case the ceiling represents our connections particular share,
which may fluctuate greatly with time. Finally, at the point when the ceiling is reached, the queue is full
and so there are a considerable number of packets waiting in the queue; it is not possible for a sender to pull
back instantaneously.
It is time to acknowledge the existence of different versions of TCP, each incorporating different congestionmanagement algorithms. The two we will start with are TCP Tahoe (1988) and TCP Reno (1990); the
names Tahoe and Reno were originally the codenames of the Berkeley Unix distributions that included
these respective TCP implementations. The ideas behind TCP Tahoe came from a 1988 paper by Jacobson
and Karels [JK88]; TCP Reno then refined this a couple years later. TCP Reno is still in widespread use
over twenty years later, and is still the undisputed TCP reference implementation, although some modest
improvements (NewReno, SACK) have crept in.
A common theme to the development of improved implementations of TCP is for one end of the connection
(usually the sender) to extract greater and greater amounts of information from the packet flow. For example,
TCP Tahoe introduced the idea that duplicate ACKs likely mean a lost packet; TCP Reno introduced the
idea that returning duplicate ACKs are associated with packets that have successfully been transmitted but
follow a loss. TCP Vegas (15.4 TCP Vegas) introduced the fine-grained measurement of RTT, to detect
when RTT > RTTnoLoad .
It is often helpful to think of a TCP sender as having breaks between successive windowfuls; that is, the
sender sends cwnd packets, is briefly idle, and then sends another cwnd packets. The successive windowfuls of packets are often called flights. The existence of any separation between flights is, however, not
guaranteed.
265
We are informally measuring cwnd in units of full packets; strictly speaking, cwnd is measured in bytes
and is incremented by the maximum TCP segment size.
This strategy here is known as Additive Increase, Multiplicative Decrease, or AIMD; cwnd = cwnd+1 is
the additive increase and cwnd = cwnd/2 is the multiplicative decrease. Typically, setting cwnd=cwnd/2
is a medium-term goal; in fact, TCP Tahoe briefly sets cwnd=1 in the immediate aftermath of an actual
timeout. With no losses, TCP will send successive windowfuls of, say, 20, 21, 22, 23, 24, .... This amounts
to conservative probing of the network and, in particular, of the queue at the bottleneck router. TCP tries
larger cwnd values because the absence of loss means the current cwnd is below the network ceiling; that
is, the queue at the bottleneck router is not yet overfull.
If a loss occurs (including multiple losses in a single windowful), TCPs response is to cut the window size
in half. (As we will see, TCP Tahoe actually handles this in a somewhat roundabout way.) Informally, the
idea is that the sender needs to respond aggressively to congestion. More precisely, lost packets mean the
queue of the bottleneck router has filled, and the sender needs to dial back to a level that will allow the queue
to clear. If we assume that the transit capacity is roughly equal to the queue capacity (say each is equal to
N), then we overflow the queue and drop packets when cwnd = 2N, and so cwnd = cwnd/2 leaves us with
cwnd = N, which just fills the transit capacity and leaves the queue empty. (When the sender sets cwnd=N,
the actual number of packets in transit takes at least one RTT to fall from 2N to N.)
Of course, assuming any relationship between transit capacity and queue capacity is highly speculative. On
a 5,000 km fiber-optic link with a bandwidth of 1 Gbps, the round-trip transit capacity would be about 6
MB. That is much larger than the likely size of any router queue. A more contemporary model of a typical
long-haul high-bandwidth TCP path might be that the queue size is a small fraction of the bandwidthdelay
product; we return to this in 13.7 TCP and Bottleneck Link Utilization.
Note that if TCP experiences a packet loss, and there is an actual timeout, then the sliding-window pipe
has drained. No packets are in flight. No self-clocking can govern new transmissions. Sliding windows
therefore needs to restart from scratch.
The congestion-avoidance algorithm leads to the classic TCP sawtooth graph, where the peaks are at
the points where the slowly rising cwnd crossed above the network ceiling. We emphasize that the
TCP sawtooth is specific to TCP Reno and related TCP implementations that share Renos additiveincrease/multiplicative-decrease mechanism.
266
During periods of no loss, TCPs cwnd increases linearly; when a loss occurs, TCP sets cwnd = cwnd/2.
This diagram is an idealization as when a loss occurs it takes the sender some time to discover it, perhaps as
much as the TimeOut interval.
The fluctuation shown here in the red ceiling curve is somewhat arbitrary. If there are only one or two
other competing senders, the ceiling variation may be quite dramatic, but with many concurrent senders the
variations may be smoothed out.
For some TCP sawtooth graphs created through actual simulation, see 16.2.1 Graph of cwnd v time and
16.4.1 Some TCP Reno cwnd graphs.
13.1.1.1 A first look at fairness
The transit capacity of the path is more-or-less unvarying, as is the physical capacity of the queue at the
bottleneck router. However, these capacities are also shared with other connections, which may come and
go with time. This is why the ceiling does vary in real terms. If two other connections share a path with
total capacity 60 packets, the fairest allocation might be for each connection to get about 20 packets as its
share. If one of those other connections terminates, the two remaining ones might each rise to 30 packets.
And if instead a fourth connection joins the mix, then after equilibrium is reached each connection might
hope for a fair share of 15 packets.
Will this kind of fair allocation actually happen? Or might we end up with one connection getting 90% of
the bandwidth while two others each get 5%?
Chiu and Jain [CJ89] showed that the additive-increase/multiplicative-decrease algorithm does indeed converge to roughly equal bandwidth sharing when two connections have a common bottleneck link, provided
also that
both connections have the same RTT
during any given RTT, either both connections experience a packet loss, or neither connection does
267
To see this, let cwnd1 and cwnd2 be the connections congestion-window sizes, and consider the quantity
cwnd1 cwnd2. For any RTT in which there is no loss, cwnd1 and cwnd2 both increment by 1, and so
cwnd1 cwnd2 stays the same. If there is a loss, then both are cut in half and so cwnd1 cwnd2 is also
cut in half. Thus, over time, the original value of cwnd1 cwnd2 is repeatedly cut in half (during each
RTT in which losses occur) until it dwindles to inconsequentiality, at which point cwnd1 cwnd2.
Graphical and tabular versions of this same argument are in the next chapter, in 14.3 TCP Fairness with
Synchronized Losses.
The second bulleted hypothesis above we may call the synchronized-loss hypothesis. While it is very
reasonable to suppose that the two connections will experience the same number of losses as a long-term
average, it is a much stronger statement to suppose that all loss events are shared by both connections. This
behavior may not occur in real life and has been the subject of some debate; see [GV02]. We return to this
point in 16.3 Two TCP Senders Competing. Fortunately, equal-RTT fairness still holds if each connection
is equally likely to experience a packet loss: both connections will have the same loss rate, and so, as we
shall see in 14.5 TCP Reno loss rate versus cwnd, will have the same cwnd. However, convergence to
fairness may take rather much longer. In 14.3 TCP Fairness with Synchronized Losses we also look at some
alternative hypotheses for the unequal-RTT case.
268
For a different case, with a much smaller RTT, see 13.2.3 Slow-Start Multiple Drop Example.
Eventually the bottleneck queue gets full, and drops a packet. Let us suppose this is after N RTTs, so
cwnd=2N . Then during the previous RTT, cwnd=2N-1 worked successfully, so we go back to that previous
value by setting cwnd = cwnd/2.
269
actually measures cwnd in bytes, floating-point arithmetic is normally not required; see exercise 13. An
exact equivalent to the per-windowful incrementing strategy is cwnd = cwnd + 1/cwnd0 , where cwnd0 is
the value of cwnd at the start of that particular windowful. Another, simpler, approach is to use cwnd +=
1/cwnd, and to keep the fractional part recorded, but to use floor(cwnd) (the integer part of cwnd) when
actually sending packets.
Most actual implementations keep track of cwnd in bytes, in which case using integer arithmetic is sufficient
until cwnd becomes quite large.
If delayed ACKs are implemented (12.14 TCP Delayed ACKs), then in bulk transfers one arriving ACK
actually acknowledges two packets. RFC 3465 permits a TCP receiver to increment cwnd by 2/cwnd in
that situation, which is the response consistent with incrementing cwnd by 1 for each windowful.
270
Assume that R has a queue capacity of 100, not including the packet it is currently forwarding to B,
and that ACKs travel instantly from B back to A. In this and later examples we will continue to use the
Data[N]/ACK[N] terminology of 6.2 Sliding Windows, beginning with N=1; TCP numbering is not done
quite this way but the distinction is inconsequential.
When A uses slow-start here, the successive windowfuls will almost immediately begin to overlap. A will
send one packet at T=0; it will be delivered at T=1. The ACK will travel instantly to A, at which point A
will send two packets. From this point on, ACKs will arrive regularly at A at a rate of one per second. Here
is a brief chart:
271
Time
0
1
2
3
4
5
..
N
A receives
R sends
Data[1]
Data[2]
3
4
5
6
Rs queue
ACK[1]
ACK[2]
ACK[3]
ACK[4]
ACK[5]
A sends
Data[1]
Data[2],Data[3]
4,5
6,7
8,9
10,11
ACK[N]
2N,2N+1
N+1
N+2 .. 2N+1
Data[3]
4,5
5..7
6..9
7..11
At T=N, Rs queue contains N packets. At T=100, Rs queue is full. Data[200], sent at T=100, will be
delivered and acknowledged at T=200, giving it an RTT of 100. At T=101, R receives Data[202] and
Data[203] and drops the latter one. Unfortunately, As timeout interval must of course be greater than the
RTT, and so A will not detect the loss until, at an absolute minimum, T=200. At that point, A has sent
packets up through Data[401], and the 100 packets Data[203], Data[205], ..., Data[401] have all been lost.
In other words, at the point when A first receives the news of one lost packet, in fact at least 100 packets
have already been lost.
Fortunately, unbounded slow start generally occurs only once per connection.
The problem TCP often faces, in both slow-start and congestion-avoidance phases, is that when a packet is
lost the sender will not detect this until much later (at least until the bottleneck routers current queue has
been sent); by then, it may be too late to avoid further losses.
272
Response
ACK[1]
ACK[1]
ACK[1]
ACK[4]
ACK[4]
ACK[6]
Waiting for the third dupACK is in most cases a successful compromise between responsiveness to lost
packets and reasonable evidence that the data packet in question is actually lost.
However, a router that does more substantial delivery reordering would wreck havoc on connections using
Fast Retransmit. In particular, consider the router R in the diagram below; when sending packets to B it
might in principle wish to alternate on a packet-by-packet basis between the path via R1 and the path via R2.
This would be a mistake; if the R1 and R2 paths had different propagation delays then this strategy would
introduce major packet reordering. R should send all the packets belonging to any one TCP connection via
a single path.
In the real world, routers generally go to considerable lengths to accommodate Fast Retransmit; in particular,
use of multiple paths for a single TCP connection is almost universally frowned upon. Some actual data on
packet reordering can be found in [VP97]; the author suggests that a switch to retransmission on the second
dupACK would be risky.
273
274
Data[9] elicits the initial ACK[9], and the nine packets Data[11] through Data[19] each elicit a dupACK[9].
We denote the dupACK[9] elicited by Data[N] by dupACK[9]/N; these are shown along the upper right.
Unless SACK TCP (below) is used, the sender will have no way to determine N or to tell these dupACKs
apart. When dupACK[9]/13 (the third dupACK) arrives at the sender, the sender uses Fast Recovery to infer
that Data[10] was lost and retransmits it. At this point EFS = 7: the sender has sent the original batch of 10
data packets, plus Data[19], and received one ACK and three dupACKs, for a total of 10+1-1-3 = 7. The
sender has also inferred that Data[10] is lost (EFS = 1) but then retransmitted it (EFS += 1). Six more
dupACK[9]s are on the way.
EFS is decremented for each subsequent dupACK arrival; after we get two more dupACK[9]s, EFS is 5.
The next dupACK[9] (dupACK[9]/16) reduces EFS to 4 and so allows us transmit Data[20] (which promptly
bumps EFS back up to 5). We have
receive
dupACK[9]/16
dupACK[9]/17
dupACK[9]/18
dupACK[9]/19
send
Data[20]
Data[21]
Data[22]
Data[23]
275
We emphasize again that the TCP sender does not see the numbers 16 through 19 in the receive column
above; it determines when to begin transmission by counting dupACK[9] arrivals.
Working Backwards
Figuring out when a fast-recovery sender should resume transmissions of new data is error-prone.
Perhaps the simplest approach is to work backwards from the retransmitted lost packet: it should
trigger at the receiver an ACK for the entire original windowful. When Data[10] above was lost, the
stuck window was Data[10]-Data[19]. The retransmitted Data[10] thus triggers ACK[19]; when
ACK[19] arrives, cwnd should be 10/2 = 5 so Data[24] should be sent. That in turn means the four
packets Data[20] through Data[23] must have been sent earlier, via Fast Recovery. There are 101 = 9
dupACKs, so to send on the last four we must start with the sixth. The diagram above indeed shows
new Fast Recovery transmissions beginning with the sixth dupACK.
The next thing to arrive at the sender side is the ACK[19] elicited by the retransmitted Data[10]; at the point
Data[10] arrives at the receiver, Data[11] through Data[19] have already arrived and so the cumulative-ACK
response is ACK[19]. The sender responds to ACK[19] with Data[24], and the transition to cwnd=5 is now
complete.
During sliding windows without losses, a sender will send cwnd packets per RTT. If a coarse timeout
occurs, typically it is not discovered until after at least one complete RTT of link idleness; there are additional
underutilized RTTs during the slow-start phase. It is worth examining the Fast Recovery sequence shown
in the illustration from the perspective of underutilized bandwidth. The diagram shows three round-trip
times, as seen from the sender side. During the first RTT, the ten packets Data[9]-Data[18] are sent. The
second RTT begins with the sending of Data[19] and continues through sending Data[22], along with the
retransmitted Data[10]. The third RTT begins with sending Data[23], and includes through Data[27]. In
terms of recovery efficiency, the RTTs send 9, 5 and 5 packets respectively (we have counted Data[10]
twice); this is remarkably close to the ideal of reducing cwnd to 5 instantaneously.
The reason we cannot use cwnd directly in the formulation of Fast Recovery is that, until the lost Data[10]
is acknowledged, the window is frozen at Data[10]-Data[19]. The original RFC 2001 description of Fast
Recovery described retransmission in terms of cwnd inflation and deflation. Inflation would begin at the
point the sender resumed transmitting new packets, at which point cwnd would be incremented for each
dupACK; in the diagram above, cwnd would finish at 15. When the ACK[20] elicited by the lost packet
finally arrived, the recovery phase would end and cwnd would immediately deflate to 5. For a diagram
illustrating cwnd inflation and deflation, see 17.2.1 Running the Script.
276
In the diagram below, packets 1 and 4 are lost in a window 0..11 of size 12. Initially the sender will get
dupACK[0]s; the first 11 ACKs (dashed lines from right to left) are ACK[0] and 10 dupACK[0]s. When
packet 1 is successfully retransmitted on receipt of the third dupACK[0], the receivers response will be
ACK[3] (the heavy dashed line). This is the first partial ACK (a full ACK would have been ACK[12]). On
receipt of any partial ACK during the Fast Recovery process, TCP NewReno assumes that the immediately
following data packet was lost and retransmits it immediately; the sender does not wait for three dupACKs
because if the following data packet had not been lost, no instances of the partial ACK would ever have been
generated, even if packet reordering had occurred.
The TCP NewReno sender response here is, in effect, to treat each partial ACK as a dupACK[0], except that
the sender also retransmits the data packet that based upon receipt of the partial ACK it is able to infer
is lost. NewReno continues pacing Fast Recovery by whatever ACKs arrive, whether they are the original
dupACKs or later partial ACKs or dupACKs.
277
When the receivers first ACK[3] arrives at the sender, NewReno infers that Data[4] was lost and resends it;
this is the second heavy data line. No dupACK[3]s need arrive; as mentioned above, the sender can infer
from the single ACK[3] that Data[4] is lost. The sender also responds as if another dupACK[0] had arrived,
and sends Data[17].
The arrival of ACK[3] signals a reduction in the EFS by 2: one for the inference that Data[4] was lost, and
one as if another dupACK[0] had arrived; the two transmissions in response (of Data[4] and Data[17]) bring
EFS back to where it was. At the point when Data[16] is sent the actual (not estimated) flightsize is 5, not 6,
278
because there is one less dupACK[0] due to the loss of Data[4]. However, once NewReno resends Data[4]
and then sends Data[17], the actual flightsize is back up to 6.
There are four more dupACK[3]s that arrive. NewReno keeps sending new data on receipt of each of these;
these are Data[18] through Data[21].
The receivers response to the retransmitted Data[4] is to send ACK[16]; this is the cumulative of all the
data received up to that moment. At the point this ACK arrives back at the sender, it had just sent Data[21]
in response to the fourth dupACK[3]; its response to ACK[16] is to send the next data packet, Data[22].
The sender is now back to normal sliding windows, with a cwnd of 6. Similarly, the Data[17] immediately
following the retransmitted Data[4] elicits an ACK[17] (this is the first Data[N] to elicit an exactly matching ACK[N] since the losses began), and the corresponding response to the ACK[17] is to continue with
Data[23].
As with the previous Fast Recovery example, we consider the number of packets sent per RTT; the diagram
shows four RTTs as seen from the sender side.
RTT
first
second
third
fourth
First packet
Data[0]
Data[12]
Data[16]
Data[21]
Packets sent
Data[0]-Data[11]
Data[12]-Data[15], Data[1]
Data[16]-Data[20], Data[4]
Data[21]-Data[26]
count
12
5
6
6
Again, after the loss is detected we segue to the new cwnd of 6 with only a single missed packet (in the
second RTT). NewReno is, however, only able to send one retransmitted packet per RTT.
Note that TCP Newreno, like TCPs Tahoe and Reno, is a sender-side innovation; the receiver does not have
to do anything special. The next TCP flavor, SACK TCP, requires receiver-side modification.
279
280
In the first diagram, the bottleneck link is always 100% utilized, even at the left edge of the teeth. In
the second the interval between loss events (the left and right margins of the tooth) is divided into a linkunsaturated phase and a queue-filling phase. In the unsaturated phase, the bottleneck link utilization is
less than 100% and the queue is empty; in the later phase, the link is saturated and the queue begins to fill.
Consider again the idealized network below, with an RB bandwidth of 1 packet/ms.
We first consider the queue transit case. Assume that the total RTTnoLoad delay is 20 ms, mostly due to
propagation delay; this makes the bandwidthdelay product 20 packets. The question for consideration is
to what extent TCP Reno, once slow-start is over, sometimes leaves the RB link idle.
The RB link will be saturated at all times provided A always keeps 20 packets in transit, that is, we always
have cwnd 20 (6.3.2 RTT Calculations). If cwndmin = 20, then cwndmax = 2cwndmin = 40. For
this to be the maximum, the queue capacity must be at least 19, so that the path can accommodate 39
packets without loss: 20 packets in transit + 19 packets in the queue. In general, TCP Reno never leaves the
bottleneck link idle as long as the queue capacity in front of that link is at least as large as the path round-trip
transit capacity.
Now suppose instead that the queue size is 9. Packet loss will occur when cwnd reaches 30, and so
cwndmin = 15. Qualitatively this case is represented by the second diagram above, though the queue-tonetwork_ceiling proportion illustrated there is more like 1:8 than 1:3. There are now periods when the RB
link is idle. During RTT intervals when cwnd=15, throughput will be 75% of the maximum and the RB
link will be idle 25% of the time.
However, cwnd will be 15 just for the first RTT following the loss. After 5 RTTs, cwnd will be back to 20,
and the link will be saturated. So we have 5 RTTs with an average cwnd of 17.5, where the link is 17.5/20
= 87.5% saturated, followed by 10 RTTs where the link is 100% saturated. The long-term average here is
95.8% utilization of the bottleneck link. This is not bad at all, given that using 10% of the link bandwidth
on packet headers is almost universally considered reasonable. Furthermore, at the point when cwnd drops
after a loss to cwndmin =15, the queue must have been full. It may take one or two RTTs for the queue to
drain; during this time, link utilization will be even higher.
If most or all of the time the bottleneck link is saturated, as in the first diagram, it may help to consider
the average queue size. Let the queue capacity be Cqueue and the transit capacity be Ctransit , with Cqueue >
Ctransit . Then cwnd will vary from a maximum of Cqueue +Ctransit to a minimum of what works out to be
(Cqueue -Ctransit )/2 + Ctransit . We would expect an average queue size about halfway between these, less the
Ctransit term: 3/4Cqueue - 1/4Ctransit . If Cqueue =Ctransit , the expected average queue size should be about
Cqueue /2.
The second case, with queue capacity less than transit capacity, is arguably the more common situation,
and becoming more so as time goes on and bandwidths increase. This is almost always the case that
applies to high-bandwidthdelay connections, where the queue size is typically much smaller than the
bandwidthdelay product. Cisco routers, for example, generally are configured to use queue sizes considerably less than 100, regardless of bandwidths involved, while the bandwidthdelay product for a 50ms RTT
with 1 Gbps bandwidth is ~6MB, or 4000 packets of 1500 bytes each. In this case the tooth is divided into
a large link-unsaturated phase and a small queue-filling phase.
281
The worst case for TCP link utilization is if the queue size is close to zero. Using again a bandwidthdelay
product 20 of packets, we can see that cwndmax will be 20 (or 21), and so cwndmin will be 10. Link
utilization therefore ranges from a low of 10/20 = 50% to a high 100%, over 10 RTTs; the average utilization
is 75%. While this is not ideal, and while some TCP variants have attempted to improve this figure, 75%
link utilization is not all that bad, and can again be compared with the 10% of the bandwidth consumed as
packet headers. (Note that a literally zero-sized queue may not work at all well with slow start, because the
sender will regularly send packets two at a time.)
We will return to this point in 16.2.6 Single-sender Throughput Experiments and (for two senders)
16.3.10.2 Higher bandwidth and link utilization, using the ns simulator to get experimental data. See
also exercise 12.
Let T be the bandwidth delay at R, so that packets leaving R are spaced at least time T apart. A will
therefore transmit packets T time units apart, except for those times when cwnd has just been incremented
and A sends a pair of packets back-to-back. Let us call the second packet of such a back-to-back pair the
extra packet. To simplify the argument slightly, we will assume that the two packets of a pair arrive at R
essentially simultaneously.
Only an extra packet can result in an increase in queue utilization; every other packet arrives after an interval
T from the previous packet, giving R enough time to remove a packet from its queue.
A consequence of this is that cwnd will reach the sum of the transit capacity and the queue capacity without
R dropping a packet. (This is not necessarily the case if a cwnd this large were sent as a single burst.)
Let C be this combined capacity, and assume cwnd has reached C. When A executes its next cwnd += 1
additive increase, it will as usual send a pair of back-to-back packets. The second of this pair the extra
is doomed; it will be dropped when it reaches the bottleneck router.
At this point there are C = cwnd 1 packets outstanding, all spaced at time intervals of T. Sliding windows
will continue normally until the ACK of the packet just before the lost packet arrives back at A. After this
point, A will receive only dupACKs. A has received C = cwnd1 ACKs since the last increment to cwnd,
but must receive C+1 = cwnd ACKs in order to increment cwnd again. This will not happen, as no more
new ACKs will arrive until the lost packet is transmitted.
Following this, cwnd is reduced and the next sawtooth begins; the only packet that is lost is the extra
packet of the previous flight.
See 16.2.3 Single Losses for experimental confirmation, and exercise 14.
282
283
The second justification for the reduction factor of 1/2 applies directly to the congestion avoidance phase;
written in 1988, it is quite remarkable to the modern reader:
If the connection is steady-state running and a packet is dropped, its probably because a new
connection started up and took some of your bandwidth.... [I]ts probable that there are now
exactly two conversations sharing the bandwidth. I.e., you should reduce your window by half
because the bandwidth available to you has been reduced by half. [JK88], D
Today, busy routers may have thousands of simultaneous connections. To be sure, Jacobson and Karels
go on to state, if there are more than two connections sharing the bandwidth, halving your window is
conservative and being conservative at high traffic intensities is probably wise. This advice remains apt
today.
But while they do not play a large role in setting cwnd or in avoiding congestive collapse, it turns out that
these increase-increment and decrease-factor values of 1 and 1/2 respectively play a great role in fairness:
making sure competing connections get the bandwidth allocation they should get. We will return to this
in 14.3 TCP Fairness with Synchronized Losses, and also 14.7 AIMD Revisited.
13.11 Epilog
TCP Renos core congestion algorithm is based on algorithms in Jacobson and Karels 1988 paper [JK88],
now twenty-five years old, although NewReno and SACK have been almost universally added to the standard
Reno implementation.
There are also broad changes in TCP usage patterns. Twenty years ago, the vast majority of all TCP traffic
represented downloads from major servers. Today, over half of all Internet TCP traffic is peer-to-peer
rather than server-to-client. The rise in online video streaming creates new demands for excellent TCP
real-time performance.
In the next chapter we will examine the dynamic behavior of TCP Reno, focusing in particular on fairness
between competing connections, and on other problems faced by TCP Reno senders. Then, in 15 Newer
TCP Implementations, we will survey some attempts to address these problems.
13.12 Exercises
1. Consider the following network, with each link other than the first having a bandwidth delay of 1
packet/second. Assume ACKs travel instantly from B to R (and thus to A). Assume there are no propagation delays, so the RTTnoLoad is 4; the bandwidthRTT product is then 4 packets. If A uses sliding
windows with a window size of 6, the queue at R1 will eventually have size 2.
Suppose A uses threshold slow start with ssthresh = 6, and with cwnd initially 1. Complete the table
below until two rows after cwnd = 6; for these final two rows, A will send only one new packet for each
ACK received. How big will the queue at R1 grow?
284
T
0
1
2
3
4
5
6
7
8
A sends
1
R1 queues
R1 sends
1
B receives/ACKs
cwnd
1
2,3
2
3
2
2
4,5
Note that if, instead of using slow start, A simply sends the initial windowful of 6 packets all at once, then
the queue at R1 will initially hold 6-1 = 5 packets.
2. Consider the following network from 13.2.3 Slow-Start Multiple Drop Example, with links labeled with
bandwidths in packets/ms. Assume ACKs travel instantly from B to R (and thus to A).
A begins sending to B using slow start, beginning with Data[1] at T=0. Write out all packet deliveries
assuming Rs queue size is 5, up until the first dupACK triggered by the arrival at B of a packet that followed
a packet that was lost.
3. Repeat the previous problem, except assume Rs queue size is 2. Assume no retransmission mechanism
is used at all (no timeouts, no fast retransmit), and that A sends new data only when it receives new ACKs
(dupACKs, in other words, do not trigger new data transmissions). With these assumptions, new data
transmissions will eventually cease; continue the table until all transmitted data packets are received by B.
4. Suppose a connection starts with cwnd=1 and increments cwnd by 1 each RTT with no loss, and sets
cwnd to cwnd/2, rounding down, on each RTT with at least one loss. Lost packets are not retransmitted,
and propagation delays dominate so each windowful is sent more or less together. Packets 5, 13, 14, 23 and
30 are lost. What is the window size each RTT, up until the first 40 packets are sent? What packets are sent
each RTT? Hint: in the first RTT, Data[1] is sent. There is no loss, so in the second RTT cwnd = 2 and
Data[2] and Data[3] are sent.
5. Suppose TCP Reno is used to transfer a large file over a path with a bandwidth of one packet per 10 sec
and an RTT of 80 ms. Assume the receiver places no limits on window size. Note the bandwidthdelay
product is 8,000 packets.
(a). How many RTTs will it take for the window size to first reach ~8,000 packets, assuming unbounded
slow start is used and there are no packet losses?
(b). Approximately how many packets will have been sent and acknowledged by that point?
(c). What fraction of the total bandwidth will have been used up to that point? Hint: the total bandwidth is
8,000 packets per RTT.
6. (a) Repeat the diagram in 13.4 TCP Reno and Fast Recovery, done there with cwnd=10, for a window
13.12 Exercises
285
size of 8. Assume as before that the lost packet is Data[10]. There will be seven dupACK[9]s, which it
may be convenient to tag as dupACK[9]/11 through dupACK[9]/17. Be sure to indicate clearly when
sending resumes.
(b). Suppose you try to do this with a window size of 6. Is this window size big enough for Fast Recovery
still to work? If so, at what dupACK[9]/N does new data transmission begin? If not, what goes wrong?
7. Suppose the window size is 100, and Data[1001] is lost. There will be 99 dupACK[1000]s sent, which
we may denote as dupACK[1000]/1002 through dupACK[1000]/1100. TCP Reno is used.
(a). At which dupACK[1000]/N does the sender start sending new data?
(b). When the retransmitted data[1001] arrives at the receiver, what ACK is sent in response?
(c). When the acknowledgment in (b) arrives back at the sender, what data packet is sent?
Hint: express EFS in terms of dupACK[1000]/N, for N>1004.
8. Suppose the window size is 40, and Data[1001] is lost. Packet 1000 will be ACKed normally. Packets
1001-1040 will be sent, and 1002-1040 will each trigger a duplicate ACK[1000].
(a). What actual data packets trigger the first three dupACKs? (The first ACK[1000] is triggered by
Data[1000]; dont count this one as a duplicate.)
(b). After the third dupACK[1000] has been received and the lost data[1001] has been retransmitted, how
many packets/ACKs should the sender estimate as in flight?
When the retransmitted Data[1001] arrives at the receiver, ACK[1040] will be sent back.
(c). What is the first Data[N] sent for which the response is ACK[N], for N>1000?
(d). What is the first N for which Data[N+20] is sent in response to ACK[N] (this represents the point when
the connection is back to normal sliding windows, with a window size of 20)?
9. Suppose slow-start is modified so that, on each arriving ACK, three new packets are sent rather than two;
cwnd will now triple after each RTT.
(a). For each arriving ACK, by how much must cwnd now be incremented?
(b). Suppose a path has mostly propagation delay. Progressively larger windowfuls are sent, until a cwnd
is reached where a packet loss occurs. What window size can the sender be reasonably sure does work,
based on earlier experience?
10. Suppose in the example of 13.5 TCP NewReno, Data[4] had not been lost.
286
11. Suppose in the example of 13.5 TCP NewReno, Data[1] and Data[2] had been lost, but not Data[4].
12. Suppose two TCP connections have the same RTT and share a bottleneck link, on which there is no other
traffic. The size of the bottleneck queue is negligible when compared to the bandwidth RTTnoLoad product.
Loss events occur at regular intervals, and are completely synchronized. Show that the two connections
together will use 75% of the total bottleneck-link capacity, as in 13.7 TCP and Bottleneck Link Utilization
(there done for a single connection).
See also Exercise 18 of the next chapter.
13. In 13.2.1 Per-ACK Responses we stated that the per-ACK response of a TCP sender was to increment
cwnd as follows:
cwnd = cwnd + 1/cwnd
(a). What is the corresponding formulation if the window size is in fact measured in bytes rather than
packets? Let SMSS denote the senders maximum segment size, and let bwnd = SMSScwnd denote the
congestion window as measured in bytes.
(b). What is the appropriate formulation if delayed ACKs are used (12.14 TCP Delayed ACKs) and we
still want cwnd to be incremented by 1 for each windowful?
14. In 13.8 Single Packet Losses we simplified the argument slightly by assuming that when A sent a pair
of packets, they arrived at R essentially simultaneously.
Give a scenario in which it is not the extra packet (the second of the pair) that is lost, but the packet that
follows it. Hint: see 16.3.4.1 Single-sender phase effects.
13.12 Exercises
287
288
In this chapter we introduce, first and foremost, the possibility that there are other TCP connections out
there competing with us for throughput. In 6.3 Linear Bottlenecks (and in 13.7 TCP and Bottleneck Link
Utilization) we looked at the performance of TCP through an uncontested bottleneck; now we allow for
competition.
We also look more carefully at the long-term behavior of TCP Reno (and Reno-like) connections, as the
value of cwnd increases and decreases according to the TCP sawtooth. In particular we analyze the average cwnd; recall that the average cwnd divided by the RTT is the connections average throughput (we
momentarily ignore here the fact that RTT is not constant, but the error this introduces is usually small).
A few of the ideas presented here apply as well to non-Reno connections as well. Some non-Reno TCP
alternatives are presented in the following chapter; the section on TCP Friendliness below addresses how to
extend TCP Renos competitive behavior even to UDP.
We also consider some router-based mechanisms such as RED and ECN that take advantage of TCP Renos
behavior to provide better overall performance.
The chapter closes with a summary of the central real-world performance problems faced by TCP today;
this then serves to introduce the proposed TCP fixes in the following chapter.
289
The bottleneck link for AB traffic is at R2, and the queue will form at R2s outbound interface.
We claimed earlier that if the sender uses sliding windows with a fixed window size, then the network will
converge to a steady state in relatively short order. This is also true if multiple senders are involved; however,
a mathematical proof of convergence may be more difficult.
290
For a moment, assume R3 uses priority queuing, with the BC path given priority over AC. If Bs
flow to C is fixed at 3 packets/ms, then As share of the R3C link will be 1 packet/ms, and As bottleneck
will be at R3. However, if Bs total flow rate drops to 1 packet/ms, then the R3C link will have 3 packets/ms
available, and the bottleneck for the AC path will become the 2 packet/ms R1R3 link.
Now let us switch to the more-realistic FIFO queuing at R3. If Bs flow is 3 packets/ms and As is 1
packet/ms, then the R3C link will be saturated, but just barely: if each connection sticks to these rates, no
queue will develop at R3. However, it is no longer accurate to describe the 1 packet/ms as As share: if A
wishes to send more, it will begin to compete with B. At first, the queue at R3 will grow; eventually, it is
quite possible that Bs total flow rate might drop because B is losing to A in the competition for R3s queue.
This latter effect is very real.
In general, if two connections share a bottleneck link, they are competing for the bandwidth of that link.
That bandwidth share, however, is precisely dictated by the queue share as of a short while before. R3s
fixed rate of 4 packets/ms means one packet every 250 s. If R3 has a queue of 100 packets, and in that
queue there are 37 packets from A and 63 packets from B, then over the next 25 ms (= 100 250 s) R3s
traffic to C will consist of those 37 packets from A and the 63 from B. Thus the competition between A and
B for R3C bandwidth is first fought as a competition for R3s queue space. This is important enough to
state as as rule:
Queue-Competition Rule: in the steady state, if a connection utilizes fraction 1 of a FIFO
routers queue, then that connection has a share of of the routers total outbound bandwidth.
Below is a picture of R3s queue and outbound link; the queue contains four packets from A and eight from
B. The link, too, contains packets in this same ratio; presumably packets from B are consistently arriving
twice as fast as packets from A.
In the steady state here, A and B will use four and eight packets, respectively, of R3s queue capacity. As
14.2 Bottleneck Links with Competition
291
acknowledgments return, each sender will replenish the queue accordingly. However, it is not in As longterm interest to settle for a queue utilization at R3 of four packets; A may want to take steps that will lead in
this setting to a gradual increase of its queue share.
Although we started the discussion above with fixed packet-sending rates for A and B, in general this leads
to instability. If A and Bs combined rates add up to more than 4 packets/ms, R3s queue will grow without
bound. It is much better to have A and B use sliding windows, and give them each fixed window sizes; in
this case, as we shall see, a stable equilibrium is soon reached. Any combination of window sizes is legal
regardless of the available bandwidth; the queue utilization (and, if necessary, the loss rate) will vary as
necessary to adapt to the actual bandwidth.
If there are several competing flows, then a given connection may have multiple bottlenecks, in the sense
that there are several routers on the path experiencing queue buildups. In the steady state, however, we can
still identify the link (or first link) with minimum bandwidth; we can call this link the bottleneck. Note that
the bottleneck link in this sense can change with the senders winsize and with competing traffic.
The network layout here, with the shared RC link as the bottleneck, is sometimes known as the singlebell
topology. A perhaps-more-common alternative is the dumbbell topology of 14.3 TCP Fairness with
Synchronized Losses, though the two are equivalent for our purposes.
Suppose A and B each send to C using sliding windows, each with fixed values of winsize wA and wB .
Suppose further that these winsize values are large enough to saturate the RC link. How big will the queue
be at R? And how will the bandwidth divide between the AC and BC flows?
For the two-competing-connections example above, assume we have reached the steady state. Let denote
the fraction of the bandwidth that the AC connection receives, and let = 1- denote the fraction that
the BC connection gets; because of our normalization choice for the RC bandwidth, and are the
respective throughputs. From the Queue-Competition Rule above, these bandwidth proportions must agree
with the queue proportions; if Q denotes the combined queue utilization of both connections, then that queue
will have about Q packets from the AC flow and about Q packets from the BC flow.
We worked out the queue usage precisely in 6.3.2 RTT Calculations for a single flow; we derived there the
following:
292
293
294
Let us suppose the AB link is idle, and the CD connection begins sending with a window size chosen
so as to create a queue of 30 of Cs packets at R1 (if propagation delays are such that two packets can be in
transit each direction, we would achieve this with winsize=34).
Now imagine A begins sending. If A sends a single packet, is not shut out even though the R1R2 link is
100% busy. As packet will simply have to wait at R1 behind the 30 packets from C; the waiting time in the
queue will be 30 packets (5 packets/ms) = 6 ms. If we change the winsize of the CD connection, the
delay for As packets will be directly proportional to the number of Cs packets in R1s queue.
To most intents and purposes, the CD flow here has increased the RTT of the AB flow by 6 ms. As
long as As contribution to R1s queue is small relative to Cs, the delay at R1 for As packets looks more
like propagation delay than bandwidth delay, because if A sends two back-to-back packets they will likely
be enqueued consecutively at R1 and thus be subject to a single 6 ms queuing delay. By varying the CD
window size, we can, within limits, increase or decrease the RTT for the AB flow.
Let us return to the fixed CD window size denoted wC chosen to yield a queue of 30 of Cs packets
at R1. As A increases its own window size from, say, 1 to 5, the CD throughput will decrease slightly,
but Cs contribution to R1s queue will remain dominant.
As in the argument at the end of 14.2.3.3 The fixed-wB case, small propagation delays mean that wC will
not be much larger than 30. As wA climbs from zero to infinity, Cs contribution to R1s queue rises from 30
to at most wC , and so the 6ms delay for AB packets remains relatively constant even as As winsize rises
to the point that As contribution to R1s queue far outweighed Cs. (As we will argue in the next paragraphs,
this can actually happen only if the R2R3 bandwidth is increased). Each packet from A arriving at R1 will,
on average, face 30 or so of Cs packets ahead of it, along with anywhere from many fewer to many more
of As packets.
If As window size is 1, its one packet at a time will wait 6 ms in the queue at R1. If As window size
is greater than 1 but remains small, so that A contributes only a small proportion of R1s queue, then As
packets will wait only at R1. Initially, as As winsize increases, the queue at R1 grows but all other queues
remain empty.
However, if As winsize grows large enough that its packets consume 40% of R1s queue in the steady state,
then this situation changes. At the point when A has 40% of R1s queue, by the Queue Competition Rule it
will also have a 40% share of the R1R2 links bandwidth, that is, 40% 5 = 2 packets/ms. Because the
R2R3 link has a bandwidth of 2 packets/ms, the AB throughput can never grow beyond this. If the CD
contribution to R1s queue is held constant at 30 packets, then this point is reached when As contribution to
R1s queue is 20 packets.
Because As proportional contribution to R1s queue cannot increase further, any additional increase to As
14.2 Bottleneck Links with Competition
295
In Exercise 5 we consider some minor changes needed if propagation delay is not inconsequential.
Despite situations like this, we will usually use the term bottleneck link as if it were a precisely defined
concept. In Examples 2, 3 and 4 above, a better term might be competitive link; for Example 5 we perhaps
should say competitive links.
296
The layout illustrated here, with the shared link somewhere in the middle of each path, is sometimes known
as the dumbbell topology.
For the time being, we will also continue to assume the synchronized-loss hypothesis: that in any one
RTT either both connections experience a loss or neither does. (This assumption is suspect; we explore
it further in 14.3.3 TCP RTT bias and in 16.3 Two TCP Senders Competing). This was the model
reviewed previously in 13.1.1.1 A first look at fairness; we argued there that in any RTT without a loss, the
expression (cwnd1 - cwnd2 ) remained the same (both cwnds incremented by 1), while in any RTT with a
loss the expression (cwnd1 - cwnd2 ) decreased by a factor of 2 (both cwnds decreased by factors of 2).
14.3 TCP Fairness with Synchronized Losses
297
Here is a graphical version of the same argument, as originally introduced in [CJ89]. We plot cwnd1 on the
x-axis and cwnd2 on the y-axis. An additive increase of both (in equal amounts) moves the point (x,y) =
(cwnd1 ,cwnd2 ) along the line parallel to the 45 line y=x; equal multiplicative decreases of both moves the
point (x,y) along a line straight back towards the origin. If the maximum network capacity is Max, then a
loss occurs whenever x+y exceeds Max, that is, the point (x,y) crosses the line x+y=Max.
Beginning at the initial state, additive increase moves the state at a 45 angle up to the line x+y=Max, in
small increments denoted by the small arrowheads. At this point a loss would occur, and the state jumps
back halfway towards the origin. The state then moves at 45 incrementally back to the line x+y=Max, and
continues to zigzag slowly towards the equal-shares line y=x.
Any attempt to increase cwnd faster than linear will mean that the increase phase is not parallel to the line
y=x, but in fact veers away from it. This will slow down the process of convergence to equal shares.
Finally, here is a timeline version of the argument. We will assume that the AC path capacity, the BD path
capacity and Rs queue size all add up to 24 packets, and that in any RTT in which cwnd1 + cwnd2 > 24,
both connections experience a packet loss. We also assume that, initially, the first connection has cwnd=20,
and the second has cwnd=1.
T
0
1
2
298
AC
20
21
22
BD
1
2
3
T
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
...
32
33
12
6
loss
cwnds are equal
299
T
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
AC
1
2
3
4
5
6
7
8
9
4
5
6
7
8
9
4
BD
1
3
5
7
9
11
13
15
17
8
10
12
14
16
18
9
second loss
essentially where we were at T=9
The effect here is that the second connections average cwnd, and thus its throughput, is double that of
the first connection. Thus, changes to the additive-increase increment lead to very significant changes in
fairness.
AC
20
21
22
11
12
13
14
15
BD
1
2
3
4
5
2
3
4
5
6
7
8
9
10
11
first loss
second loss
Continued on next page
300
T
15
16
17
18
20
22
24
26
28
30
32
34
35
36
38
40
42
44
45
46
The interval 36T<46 represents the steady state here; the first connections average cwnd is 6 while the
second connections average is (8+9+...+16+17)/10 = 12.5. Worse, the first connection sends a windowful
only half as often. In the interval 36T<46 the first connection sends 4+5+6+7+8 = 30 packets; the second
connection sends 125. The cost of the first connections longer RTT is quadratic; in general, as we argue
more formally below, if the first connection has RTT = > 1 relative to the seconds, then its bandwidth will
be reduced by a factor of 1/2 .
Is this fair?
Early thinking was that there was something to fix here; see [F91] and [FJ92], 3.3 where the Constant-Rate
window-increase algorithm is discussed. A more recent attempt to address this problem is TCP Hybla,
[CF04]; discussed later in 15.8 TCP Hybla.
Alternatively, we may simply define TCP Renos bandwidth allocation as fair, at least in some contexts.
This approach is particularly common when the issue at hand is making sure other TCP implementations
and non-TCP flows compete for bandwidth in roughly the same way that TCP Reno does. While TCP
Renos strategy is now understood to be greedy in some respects, fixing it in the Internet at large is
generally recognized as a very difficult option.
301
we will assume FIFO droptail queuing at the bottleneck router, and also that the network ceiling (and hence
cwnd at the point of loss) is sufficiently large. We will also assume, for simplicity, that the network
ceiling C is constant.
We need one more assumption: that most loss events are experienced by both connections. This is the
synchronized losses hypothesis, and is the most debatable; we will explore it further in the next section.
But first, here is the general argument with this assumption.
Let connection 1 be the faster connection, and assume a steady state has been reached. Both connections
experience loss when cwnd1 +cwnd2 C, because of the synchronized-loss hypothesis. Let c1 and c2
denote the respective window sizes at the point just before the loss. Both cwnd values are then halved.
Let N be the number of RTTs for connection 1 before the network ceiling is reached again. During this
time c1 increases by N; c2 increases by approximately N/ if N is reasonably large. Each of these increases
represents half the corresponding cwnd; we thus have c1 /2 = N and c2 /2 = N/. Taking ratios of respective
sides, we get c1 /c2 = N/(N/) = , and from that we can solve to get c1 = C/(1+) and c2 = C/(1+).
To get the relative bandwidths, we have to count packets sent during the interval between losses. Both
connections have cwnd averaging about 3/4 of the maximum value; that is, the average cwnds are 3/4 c1
and 3/4 c2 respectively. Connection 1 has N RTTs and so sends about 3/4 c1 N packets. Connection 2, with
its slower RTT, has only about N/ RTTs (again we use the assumption that N is reasonably large), and so
sends about 3/4 c2 N/ packets. The ratio of these is c1 /(c2 /) = 2 . Connection 1 sends fraction 2 /(1+2 )
of the packets; connection 2 sends fraction 1/(1+2 ).
302
in the second paragraph of this section. This results in a larger cwnd than the synchronized-loss hypothesis
would predict.
In the diagram above, the AC connection wants its cwnd to be about 200 ms 10 packets/ms = 2,000
packets; it is competing for the RC link with the BD connection which is happy with a cwnd of 22. If Rs
queue capacity is also about 20, then with most of the bandwidth the BC connection will experience a loss
about every 20 RTTs, which is to say every 22 ms. If the AC link shares even a modest fraction of those
losses, it is indeed in trouble.
However, the AC cwnd cannot fall below 1.0; to test the 10,000-fold hypothesis taking this constraint into
account we would have to scale up the numbers on the BC link so the transit capacity there was at least
10,000. This would mean a 400 Gbps RC bandwidth, or else an unrealistically large AR delay.
As a second issue, realistically the AC link is much more likely to have its bottleneck somewhere in the
middle of its long path. In a typical real scenario along the lines of that diagrammed above, B, C and R are
all local to a site, and bandwidth of long-haul paths is almost always less than the local LAN bandwidth
14.3 TCP Fairness with Synchronized Losses
303
within a site. If the AR path has a 1 packet/ms bottleneck somewhere, then it may be less likely to be as
dramatically affected by BC traffic.
A few actual simulations using the methods of 16.3 Two TCP Senders Competing resulted in an average
cwnd for the AC connection of between 1 and 2, versus a BC cwnd of 20-25, regardless of whether the
two links shared a bottleneck or if the AC link had its bottleneck somewhere along the AR path. This may
suggest that the AC connection was indeed saved by the 1.0 cwnd minimum.
As another example, known as the parking-lot topology, suppose we have the following network:
304
There are four connections: one from A to D covering all three links, and three single-link connections AB,
BC and CD. Each link has the same bandwidth. If bandwidth allocations are incrementally distributed
among the four connections, then the first point at which any link bandwidth is maxed out occurs when all
four connections each have 50% of the link bandwidth; max-min fairness here means that each connection
has an equal share.
Suppose the AB and BC links have bandwidth 1 unit, and we have three connections AB, BC and AC.
Then a proportionally fair solution is to give the AC link a bandwidth of 1/3 and each of the AB and BC
links a bandwidth of 2/3 (so each link has a total bandwidth of 1). For any change b in the bandwidth for
the AC link, the AB and BC links each change by -b. Equilibrium is achieved at the point where a 1%
reduction in the AC link results in two 0.5% increases, that is, the bandwidths are divided in proportion
1:2. Mathematically, if x is the throughput of the AC connection, we are minimizing log(x) + 2log(1-x).
Proportional fairness partially addresses the problem of TCP Renos bias against long-RTT connections;
specifically, TCPs bias here is still not proportionally fair, but TCPs response is closer to proportional
fairness than it is to max-min fairness.
305
The average cwnd in this scenario is 3/2 N (that is, the average of N=cwndmin and 2N=cwndmax ). If we
let M = 3/2 N represent the average cwnd, cwndmean , we can express the above loss rate in terms of M: the
number of packets between losses is 2/3 M2 , and so p=3/2 M-2 .
Now let us solve this for M=cwndmean in terms of p; we get M2 = 3/2 p-1 and thus
M = cwndmean = 1.225 p-1/2
where 1.225 is the square root of 3/2. Seen in this form, a given network loss rate sets the window size; this
loss rate is ultimately be tied to the network capacity. If we are interested in the maximum cwnd instead of
the mean, we multiply the above by 4/3.
From the above, the bandwidth available to a connection is now as follows (though RTT may not be constant):
306
in terms of the bandwidth and RTT ratios, without using the synchronized-loss hypothesis.
Note that we are comparing the total number of loss events (or loss responses) here the total number of TCP
Reno teeth over a large time interval, and not the relative per-packet loss probabilities. One connection
might have numerically more losses than a second connection but, by dint of a smaller RTT, send more
packets between its losses than the other connection and thus have fewer losses per packet.
Let losscount1 and losscount2 be the number of loss responses for each connection over a long time interval
T. The ith connections per-packet loss probability is pi = losscounti /(bandwidthi T) = (losscounti
RTTi )/(cwndi T). But by the result of 14.5 TCP Reno loss rate versus cwnd, we also have cwndi =
307
that assuming losses are independent events, which is definitely not quite right but which is often Close
Enough in a long-enough time interval, all connections sharing a common bottleneck can be expected to
experience approximately the same packet loss rate.
The point of TCP Friendliness is to regulate the number of the non-Reno connections outstanding packets
in the presence of competition with TCP Reno, so as to achieve a degree of fairness. In the absence of
competition, the number of any connections outstanding packets will be bounded by the transit capacity
plus capacity of the bottleneck queue. Some non-Reno protocols (eg TCP Vegas, 15.4 TCP Vegas, or
constant-rate traffic, 14.6.2 RTP) may in the absence of competition have a loss rate of zero, simply
because they never overflow the queue.
Another way to approach TCP Friendliness is to start by defining Reno Fairness to be the bandwidth
allocations that TCP Reno assigns in the face of competition. TCP Friendliness then simply means that the
given non-Reno connection will get its Reno-Fair share not more, not less.
We will return to TCP Friendliness in the context of general AIMD in 14.7 AIMD Revisited.
14.6.1 TFRC
TFRC, or TCP-Friendly Rate Control, RFC 3448, uses the loss rate experienced, p, and the formulas above
to calculate a sending rate. It then allows sending at that rate; that is, TFRC is rate-based rather than windowbased. As the loss rate increases, the sending rate is adjusted downwards, and so on. However, adjustments
are done more smoothly than with TCP.
From RFC 5348:
TFRC is designed to be reasonably fair when competing for bandwidth with TCP flows, where
we call a flow reasonably fair if its sending rate is generally within a factor of two of the
sending rate of a TCP flow under the same conditions. [emphasis added; a factor of two might
not be considered close enough in some cases.]
The penalty of having smoother throughput than TCP while competing fairly for bandwidth is that TFRC
responds more slowly than TCP to changes in available bandwidth.
TFRC senders include in each packet a sequence number, a timestamp, and an estimated RTT.
The TFRC receiver is charged with sending back feedback packets, which serve as (partial) acknowledgments, and also include a receiver-calculated value for the loss rate over the previous RTT. The response
packets also include information on the current actual RTT, which the sender can use to update its estimated
RTT. The TFRC receiver might send back only one such packet per RTT.
The actual response protocol has several parts, but if the loss rate increases, then the primary feedback
mechanism is to calculate a new (lower) sending rate, using some variant of the cwnd = k/ p formula, and
then shift to that new rate. The rate would be cut in half only if the loss rate p quadrupled.
Newer versions of TFRC have a various features for responding more promptly to an unusually sudden
problem, but in normal use the calculated sending rate is used most of the time.
308
14.6.2 RTP
The Real-Time Protocol, or RTP, is sometimes (though not always) coupled with TFRC. RTP is a UDPbased protocol for streaming time-sensitive data.
Some RTP features include:
The sender establishes a rate (rather than a window size) for sending packets
The receiver returns periodic summaries of loss rates
ACKs are relatively infrequent
RTP is suitable for multicast use; a very limited ACK rate is important when every packet sent might
have hundreds of recipients
The sender adjusts its cwnd-equivalent up or down based on the loss rate and the TCP-friendly
cwnd=k/ p rule
Usually some sort of stability rule is incorporated to avoid sudden changes in rate
As a common RTP example, a typical VoIP connection using a DS0 (64 Kbps) rate might send one packet
every 20 ms, containing 160 bytes of voice data, plus headers.
For a combination of RTP and TFRC to be useful, the underlying application must be rate-adaptive, so that
the application can still function when the available rate is reduced. This is often not the case for simple
VoIP encodings; see 19.11.4 RTP and VoIP.
We will return to RTP in 19.11 Real-time Transport Protocol (RTP).
309
Geometrically, the number of packets sent per tooth is the area of the tooth, so two connections with the
same per-packet loss rate will have teeth with the same area. TCP Friendliness means that two connections
will have the same mean cwnd and thus the same tooth height. If the teeth of two connections have the
same area and the same height, they must have the same width (in RTTs), and thus that the rates of loss per
unit time must be equal, not just the rates of loss per number of packets.
The diagram below shows a TCP Reno tooth (blue) together with some unfriendly AIMD(,) teeth on the
left (red) and two friendly teeth on the right (green), the second friendly tooth is superimposed on the Reno
tooth.
The additional dashed lines within the central Reno tooth demonstrate the Reno 112 dimensions, and
show that the horizontal dashed line, representing cwndmean , is at height 3/2 w, where w is, as before, the
width.
In the rightmost green tooth, superimposed on the Reno tooth, we can see that h = (3/2)w + (/2)w. We
already know h = (/)w; setting these expressions equal, canceling the w and multiplying by 2 we get
(3+) = 2/, or = 2/(3+). Solving for we get
= 3/(2-)
or 1.5 for small . As the reduction factor 1- gets closer to 1, the protocol can remain TCP-friendly
by appropriately reducing ; eg AIMD(1/5, 1/8).
Having a small means that a connection does not have sudden bandwidth drops when losses occur; this can
be important for applications that rely on a regular rate of data transfer (such as voice). Such applications
are sometimes said to be slowly responsive, in contrast to TCPs cwnd = cwnd/2 rapid response.
D in their cwnd values. The two connections will each increase cwnd by each RTT, and so when losses
are not occurring D will remain constant. At loss events, D will be reduced by a factor of 1-. If =1/4,
corresponding to =3/7, then at each loss event D will be reduced only to 3/4 D, and the half-life of D
will be almost twice as large. The two connections will still converge to fairness as D0, but it will take
twice as long.
14.8.1 DECbit
In the congestion-avoidance technique proposed in [RJ90], routers encountering early signs of congestion
marked the packets they forwarded; senders used these markings to adjust their window size. The system
became known as DECbit in reference to the authors employer and was implemented in DECnet (closely
related to the OSI protocol suite), though apparently there was never a TCP/IP implementation. The idea
behind DECbit eventually made it into TCP/IP in the form of ECN, below, but while ECN like TCPs
other congestion responses applies control near the congestion cliff, DECbit proposed introducing control
when congestion was still minimal, just above the congestion knee.
The DECbit mechanism allowed routers to set a designated congestion bit. This would be set in the data
packet being forwarded, but the status of this bit would be echoed back in the corresponding ACK (otherwise
the sender would never hear about the congestion).
DECbit routers defined congestion as an average queue size greater than 1.0; that is, congestion meant that
the connection was just past the knee. Routers would set the congestion bit whenever this average-queue
condition was met.
The target for DECbit senders would then be to have 50% of packets marked as congested. If fewer than
50% of packets were marked, cwnd would be incremented by 1; if more than 50% were marked, then cwnd
would be decreased by a factor of 0.875. Note this is very different from the TCP approach in that DECbit
begins marking packets at the congestion knee while TCP Reno responds only to packet losses which
occur just over the cliff.
A consequence of this knee-based mechanism is that DECbit shoots for very limited queue utilization,
unlike TCP Reno. At a congested router, a DECbit connection would attempt to keep about 1.0 packets
in the routers queue, while a TCP Reno connection might fill the remainder of the queue. Thus, DECbit
would in principle compete poorly with any connection where the sender ignored the marked packets and
simply tried to keep cwnd as large as possible. As we will see in 15.4 TCP Vegas, TCP Vegas also strives
for limited queue utilization; in 16.5 TCP Reno versus TCP Vegas we investigate through simulation how
fairly TCP Vegas competes with TCP Reno.
311
312
cwnd
8.3
83
833
8333
83333
313
Note the very small loss probability needed to support 10 Gbps; this works out to a bit error rate of less than
2 10-14 . For fiber optic data links, alas, a physical bit error rate of 10-13 is often considered acceptable;
there is thus no way to support the window size of the final row above.
Here is a similar table, expressing cwnd in terms of the packet loss rate:
Packet Loss Rate P
10-2
10-3
10-4
10-5
10-6
10-7
10-8
10-9
10-10
cwnd
12
38
120
379
1,200
3,795
12,000
37,948
120,000
The above two tables indicate that large window sizes require extremely small drop rates. This is the highbandwidth-TCP problem: how do we maintain a large window when a path has a large bandwidthdelay
product? The primary issue is that non-congestive (noise) packet losses bring the window size down, potentially far below where it could be. A secondary issue is that, even if such random drops are not significant,
the increase of cwnd to a reasonable level can be quite slow. If the network ceiling were about 2,000 packets, then the normal sawtooth return to the ceiling after a loss would take 1,000 RTTs. This is slow, but the
sender would still average 75% throughput, as we saw in 13.7 TCP and Bottleneck Link Utilization. Perhaps more seriously, if the network ceiling were to double to 4,000 packets due to decreases in competing
traffic, it would take the sender an additional 2,000 RTTs to reach the point where the link was saturated.
In the following diagram, the network ceiling and the ideal TCP sawtooth are shown in green. The ideal
TCP sawtooth should range between 50% and 100% of the ceiling; in the diagram, noise or non-congestive
losses occur at the red xs, bringing down the throughput to a much lower average level.
314
bandwidthdelay product is much in excess of 1.22/ p then the sender will be unable to maintain a cwnd
close to the network ceiling.
14.12 Epilog
TCP Renos core congestion algorithm is based on algorithms in Jacobson and Karels 1988 paper [JK88],
now twenty-five years old. There are concerns both that TCP Reno uses too much bandwidth (the greediness
issue) and that it does not use enough (the high-bandwidth-TCP problem).
In the next chapter we consider alternative versions of TCP that attempt to solve some of the above problems
associated with TCP Reno.
14.13 Exercises
1. Consider the following network, where the bandwidths marked are all in packets/ms. C is sending to D
using sliding windows and A and B are idle.
C
100
315
100
R1
R2
100
100
D
Suppose the propagation delay on the 100 packet/ms links is 1 ms, and the propagation delay on the R1R2
link is 2 ms. The RTTnoLoad for the CD path is thus about 8 ms, for a bandwidthdelay product of 40
packets. If C uses winsize = 50, then the queue at R1 will have size 10.
Now suppose A starts sending to B using sliding windows, also with winsize = 50. What will be the size of
the queue at R1?
Hint: by symmetry, the queue will be equally divided between As packets and Cs, and A and C will each
see a throughput of 2.5 packets/ms. RTTnoLoad , however, does not change.
2. In the previous exercise, give the average number of data packets in transit on each link:
(a). for the original case in which C is the only sender, with winsize = 50 (the only active links here are
CR1, R1R2 and R2D).
(b). for the new case in which B is also sending, also with winsize = 50. In this case all links are active.
Each link will also have an equal number of ACK packets in transit in the reverse direction.
3. Consider the CD path from the diagram of 14.2.4 Example 4: cross traffic and RTT variation:
C
100
R1
R2
100
(a). Give propagation delays for the links CR1 and R2D so that there will be an average of 5 packets in
transit on the CR1 and R2D links, in each direction, if C uses a winsize sufficient to saturate the
bottleneck R1R2 link.
(b). Give propagation delays for all three links so that, when C uses a winsize equal to the round-trip transit
capacity, there are 5 packets each way on the CR1 link, 10 on the R1R2 link, and 20 on the R2D link.
4. Suppose we have the network layout below of 14.2.4 Example 4: cross traffic and RTT variation, except
that the R1R2 bandwidth is 6 packets/ms and the R2R3 bandwidth is 3. Suppose also that A and C have
settled upon window sizes so that each contributes 30 packets to R1s queue and thus each has 50% of the
bandwidth. R2 will then be sending 3 packets/ms to R3 and so will have no queue.
Now As winsize is incremented by 10, initially, at least, leading to A contributing more than 50% of R1s
queue. When the steady state is reached, how will these extra 10 packets be distributed between R1 and R2?
Hint: As As winsize increases, As overall throughput cannot rise due to the bandwidth restriction of the
R2R3 link.
316
C
100
A
100
R1
R2
R3
100
100
D
(a). Suppose that A and C have window sizes such that, with both transmitting, each has 30 packets in the
queue at R1. What is Cs winsize? Hint: Cs bandwidth is now 3 packets/ms.
(b). Now suppose Cs winsize, with A idle, is 60. In this case the CD transit capacity would be 5 ms 6
packets/ms = 30 packets, and so C would have 6030 = 30 packets in R1s queue. A then begins sending,
with a winsize chosen so that A and Cs contributions to R1s queue are equal; Cs winsize remains at 60.
What will be Cs (and thus As) queue usage at R1? Hint: find the transit capacity for a bandwidth of 3
packets/ms.
(c). Suppose the AB RTTnoLoad is 10 ms. If Cs winsize is 60, find the winsize for A that makes A and Cs
contributions to R1s queue equal.
6. One way to address the reduced bandwidth TCP Reno gives to long-RTT connections is for all connections to use an increase increment of RTT2 instead of 1; that is, everyone uses AIMD(RTT2 ,1/2) instead of
AIMD(1,1/2) (or AIMD(kRTT2 ,1/2), where k is an arbitrary scaling factor that applies to everyone).
(a). Construct a table in the style of of 14.3.2 Example 3: Longer RTT above, showing the result of two
connections using this strategy, where one connection has RTT = 1 and the other has RTT = 2. Start the
connections with cwnd=RTT2 , and assume a loss occurs when cwnd1 + cwnd2 > 24.
(b). Explain why this strategy might not be desirable if one connection is over a direct LAN with an RTT of
1 ms, while the second connection has a very long path and an RTT of 1.0 sec.
7. For each value or below, find the other value so that AIMD(,) is TCP-friendly.
(a). = 1/5
(b). = 2/9
(c). = 1/5
14.13 Exercises
317
Then pick the pair that has the smallest , and draw a sawtooth diagram that is approximately proportional:
should be the slope of the linear increase, and should be the decrease fraction at the end of each tooth.
8. Suppose two TCP flows compete. The first flow uses AIMD(1 , 1 ) and the second uses AIMD(2 , 2 );
neither flow is necessarily TCP-Reno-friendly. The two connections, however, compete fairly with one
another; that is, they have the same average packet-loss rates. Show that 1 / 1 = (2- 2 )/(2- 1 ) 2 / 2 .
Assume regular losses, and use the methods of 14.7 AIMD Revisited.
9. Suppose two 1KB packets are sent as part of a packet-pair probe, and the minimum time measured
between arrivals is 5 ms. What is the estimated bottleneck bandwidth?
10. Consider the following three causes of a 1-second network delay between A and B. In all cases, assume
ACKs travel instantly from B back to A.
(i) An intermediate router with a 1-second-per-packet bandwidth delay; all other bandwidth
delays negligible
(ii) An intermediate link with a 1-second propagation delay; all bandwidth delays negligible
(iii) An intermediate router with a 100-ms-per-packet bandwidth delay, and a steadily
replenished queue of 10 packets, from another source (as in the diagram in 14.2.4 Example 4:
cross traffic and RTT variation).
How might a sender distinguish between these three cases? Hint: consider packet pairs, or some variant.
11. Consider again the three-link parking-lot network from 14.4.1 Max-Min Fairness:
(a). Suppose we have two end-to-end connections, in addition to one single-link connection for each link.
Find the max-min-fair allocation.
(b). Suppose we have a single end-to-end connection, and one BC and CD connection, but two AB
connections. Find the max-min-fair allocation.
Suppose there are two AC connections, one AB connection and one AC connection. Find the allocation
that is proportionally fair.
13. Suppose we use TCP Reno to send K packets over R RTT intervals. The transmission experiences n
not-necessarily-uniform loss events; the TCP cwnd graph thus has n sawtooth peaks of heights N1 through
Nn . At the start of the graph, cwnd = A, and at the end of the graph, cwnd = B. Show that the sum N1 +
... + Nn is 2(R+A-B), and in particular the average tooth height is independent of the distribution of the loss
events.
318
14. Suppose TCP Reno has regularly spaced sawtooth peaks of the same height, but the packet losses come
in pairs, with just enough separation that both losses in a pair are counted separately. N is large enough that
the spacing between the two losses is negligible. The net effect is that each large-scale tooth ranges from
height N/4 to N. As in 14.5 TCP Reno loss rate versus cwnd, cwndmean = K/ p for some constant K. Find
the constant. Note that the loss rate here is p = 2/(number of packets sent in one tooth).
15. As in the previous exercise, suppose a TCP transmission has large-scale teeth of height N. Between
each pair of consecutive large teeth, however, there are K-1 additional losses resulting in K-1 additional tiny
teeth; N is large enough that these tiny teeth can be ignored. A non-Reno variant of TCP is used, so that
between these tiny teeth cwnd is assumed not to be cut in half; during the course of these tiny teeth cwnd
does not change much at all. The large-scale tooth has width N/2 and height ranging from N/2 to N, and
there are K losses per large-scale tooth. Find the ratio cwnd/ p, in terms of K. When K=1 your answer
should reduce to that derived in 14.5 TCP Reno loss rate versus cwnd.
14.13 Exercises
319
16. Suppose a TCP Reno tooth starts with cwnd = c, and contains N packets. Let w be the width of the
tooth, in RTTs as usual. Show that w = (c2 + 2N)1/2 c. Hint: the maximum height of the tooth will be c+w,
and so the average height will be c + w/2. Find an equation relating c, w and N, and solve for w using the
quadratic formula.
17. Suppose in a TCP Reno run each packet is equally likely to be lost; the number of packets N in each
tooth will therefore be distributed exponentially. That is, N = -k log(X), where X is a uniformly distributed
random number in the range 0<X<1 (k, which does not really matter here, is the mean interval between
losses). Write a simple program that simulates such a TCP Reno run. At the end of the simulation, output
an estimate of the constant C in the formula cwndmean = C/ p. You should get a value of about 1.31, as in
the formula in 14.5.1 Irregular teeth.
Hint: There is no need to simulate packet transmissions; we simply create a series of teeth of random size,
and maintain running totals of the number of packets sent, the number of RTT intervals needed to send them,
and the number of loss events (that is, teeth). After each loss event (each tooth), we update:
total_packets += packets sent in this tooth
RTT_intervals += RTT intervals in this tooth
loss_events += 1 (one tooth = one loss event)
If a loss event marking the end of one tooth occurs at a specific value of cwnd, the next tooth begins at
height c = cwnd/2. If N is the random value for the number of packets in this tooth, then by the previous
exercise the tooth width in RTTs is w = (c2 + 2N)1/2 c; the next peak (that is, loss event) therefore occurs
when cwnd = c+w. Update the totals as above and go on to the next tooth. It should be possible to run this
simulation for 1 million teeth in modest time.
18. Suppose two TCP connections have the same RTT and share a bottleneck link, for which there is no other
competition. The size of the bottleneck queue is negligible when compared to the bandwidth RTTnoLoad
product. Loss events occur at regular intervals.
In Exercise 12 of the previous chapter, you were to show that if losses are synchronized then the two
connections together will use 75% of the total bottleneck-link capacity
Now assume the two TCP connections have no losses in common, and, in fact, alternate losses at regular
intervals as in the following diagram.
320
Both connections have a maximum cwnd of C. When Connection 1 experiences a loss, Connection 2 will
have cwnd = 75% of C, and vice-versa.
(a). What is the combined transit capacity of the paths, in terms of C? (Because the queue size is
negligible, the transit capacity is approximately the sum of the cwnds at the point of loss.)
(b). Find the bottleneck-link utilization. Hint: it should be at least 85%.
14.13 Exercises
321
322
Since the rise of TCP Reno, several TCP alternatives to Reno have been developed; each attempts to address
some perceived shortcoming of Reno. While many of them are very specific attempts to address the highbandwidth problem we considered in 14.9 The High-Bandwidth TCP Problem, some focus primarily or
entirely on other TCP Reno foibles. One such issue is TCP Renos greediness in terms of queue utilization;
another is the lossy-link problem (14.10 The Lossy-Link TCP Problem) experienced by, say, Wi-Fi users.
Generally speaking, a TCP implementation can respond to congestion at the cliff that is, it can respond
to packet losses or can respond to congestion at the knee that is, it can detect the increase in RTT
associated with the filling of the queue. These strategies are sometimes referred to as loss-based and delaybased, respectively; the latter term because of the rise in RTT. TCP implementers can tweak both the loss
response the multiplicative decrease of TCP Reno and also the way TCP increases its cwnd in the
absence of loss. There is a rich variety of options available.
The concept of monitoring the RTT to avoid congestion at the knee was first introduced in TCP Vegas
(15.4 TCP Vegas). One striking feature of TCP Vegas is that, in the absence of competition, the queue
may never fill, and thus there may not be any congestive losses. The TCP sawtooth, in other words, is not
inevitable.
When losses do occur, most of the mechanisms reviewed here continue to use the TCP NewReno recovery
strategy. As most of the implementations here are relatively recent, the senders can generally expect that the
receiving end will support SACK TCP, which allows more rapid recovery from multiple losses.
On linux systems, the TCP congestion-control mechanism can be set by writing an appropriate string to
/proc/sys/net/ipv4/tcp_congestion_control; the options on my system as of this writing
are
highspeed
htcp
hybla
illinois
vegas
veno
westwood
bic
cubic
We review several of these below. TCP Cubic is currently (2013) the default linux congestion-control
implementation; TCP Bic was a precursor.
323
15.2 RTTs
The exact performance of some of the faster TCPs we consider for that matter, the exact performance of
TCP Reno is influenced by the RTT. This may affect individual TCP performance and also competition
324
between different TCPs. For reference, here are a few typical RTTs from Chicago to various other places:
US West Coast: 50-100 ms
Europe: 100-150 ms
Southeast Asia: 100-200 ms
N(cwnd)
1.0
1.0
1.4
3.6
9.2
23.0
325
This might be an appropriate time to point out that in TCP Reno, the cwnd-versus-time graph between
losses is actually slightly concave (lying below its tangent). We do get a strictly linear graph if we plot
cwnd as a function of the count of elapsed RTTs, but the RTTs are themselves slowly increasing as a
function of time once the queue starts filling up. At that point, the cwnd-versus-time graph bends slightly
down. If the bottleneck queue capacity matches the total path transit capacity, the RTTs for a full queue are
about double the RTTs for an empty queue.
In general, when Highspeed-TCP competes with a new TCP Reno flow it is N times as aggressive, and grabs
N times the bandwidth, where N = N(cwnd). In this it behaves very much like N separate TCP flows, or,
more precisely, N separate TCP flows that have all their loss events completely synchronized.
327
Note that if the bottleneck router used Fair Queuing (to be introduced in 18.5 Fair Queuing) on a perconnection basis, then the TCP Reno connections queue greediness would not be of any benefit, and both
connections would get similar shares of bandwidth with the TCP Vegas connection experiencing lower delay.
Let us next consider how TCP Vegas behaves when there is an increase in RTT due to the kind of cross
traffic shown in 14.2.4 Example 4: cross traffic and RTT variation and again in the diagram below. Let
AB be the TCP Vegas connection and assume that its queue-size target is 4 packets (eg =3, =5). We will
also assume that the RTTnoLoad for the AB path is about 5ms and the RTT for the CD path is also low. As
before, the link labels represent bandwidths in packets/ms, meaning that the round-trip AB transit capacity
is 10 packets.
Initially, in the absence of CD traffic, the AB connection will send at a rate of 2 packets/ms (the R2R3
bottleneck), and maintain a queue of four packets at R2. Because the RTT transit capacity is 10 packets, this
will be achieved with a window size of 10+4 = 14.
Now let the CD traffic start up, with a winsize so as to keep about four times as many packets in R1s queue
as A, once the new steady-state is reached. If all four of the AB connections queue packets end up now
at R1 rather than R2, then C would need to contribute at least 16 packets. These 16 packets will add a delay
of about 16/5 3ms; the AB path will see a more-or-less-fixed 3ms increase in RTT. A will also see a
decrease in bandwidth due to competition; with C consuming 80% of R1s queue, As share wll fall to 20%
and thus its bandwidth will fall to 20% of the R1R2 link bandwidth, that is, 1 packet/ms. Denote this new
value by BWEnew . TCP Vegas will attempt to decrease cwnd so that
cwnd BWEnew RTTnoLoad + 4
As estimate of RTTnoLoad , as RTTmin , will not change; the RTT has gotten larger, not smaller. So we have
BWEnew RTTnoLoad 1 packet/ms 5 ms = 5 packets; adding the 4 reserved for the queue, the new value
of cwnd is now about 9, down from 14.
On the one hand, this new value of cwnd does represent 5 packets now in transit, plus 4 in R1s queue; this
is indeed the correct response. But note that this division into transit and queue packets is an average. The
actual physical AB round-trip transit capacity remains about 10 packets, meaning that if the new packets
were all appropriately spaced then none of them might be in any queue.
328
petition with TCP Reno. Details can be found in [JWL04] and [WJLH06]. FAST TCP is patented; see
patent 7,974,195.
As with TCP Vegas, the sender estimates RTTnoLoad as RTTmin . At regular short fixed intervals (eg 20ms)
cwnd is updated via the following weighted average:
cwndnew = (1-)cwnd + ((RTTnoLoad /RTT)cwnd + )
where is a constant between 0 and 1 determining how volatile the cwnd update is (1 is the most
volatile) and is a fixed constant, which, as we will verify shortly, represents the number of packets the
sender tries to keep in the bottleneck queue, as in TCP Vegas. Note that the cwnd update frequency is not
tied to the RTT.
If RTT is constant for multiple consecutive update intervals, and is larger than RTTnoLoad , the above will
converge to a constant cwnd, in which case we can solve for it. Convergence implies cwndnew = cwnd =
((RTTnoLoad /RTT)cwnd + ), and from there we get cwnd(RTTRTTnoLoad )/RTT = . As we saw in
6.3.2 RTT Calculations, cwnd/RTT is the throughput, and so = throughput (RTTRTTnoLoad ) is then
the number of packets in the queue. In other words, FAST TCP, when it reaches a steady state, leaves
packets in the queue. As long as this is the case, the queue will not overflow (assuming is less than the
queue capacity).
Whenever the queue is not full, though, we have RTT = RTTnoLoad , in which case FAST TCPs cwnd-update
strategy reduces to cwndnew = cwnd + . For =0.5 and =10, this increments cwnd by 5. Furthermore,
FAST TCP performs this increment at a specific rate independent of the RTT, eg every 20ms; for long-haul
links this is much less than the RTT. FAST TCP will, in other words, increase cwnd very aggressively until
the point when queuing delays occur and RTT begins to increase.
When FAST TCP is competing with TCP Reno, it does not directly address the queue-utilization competition
problem experienced by TCP Vegas. FAST TCP will try to limit its queue utilization to ; TCP Reno,
however, will continue to increase its cwnd until the queue is full. Once the queue begins to fill, TCP Reno
will pull ahead of FAST TCP just as it did with TCP Vegas. However, FAST TCP does not reduce its cwnd
in the face of TCP Reno competition as quickly as TCP Vegas.
Additionally, FAST TCP can often offset this Reno-competition problem in other ways as well. First, the
value of can be increased from the value of around 4 packets originally proposed for TCP Vegas; in
[TWHL05] the value =30 is suggested. Second, for high bandwidthdelay products, the queue-filling
phase of a TCP Reno sawtooth (see 13.7 TCP and Bottleneck Link Utilization) becomes relatively smaller.
In the earlier link-unsaturated phase of each sawtooth, TCP Reno increases cwnd by 1 each RTT. As noted
above, however, FAST TCP is allowed to increase cwnd much more rapidly in this earlier phase, and so
FAST TCP can get substantially ahead of TCP Reno. It may fall back somewhat during the queue-filling
phase, but overall the FAST and Reno flows may compete reasonably fairly.
329
The diagram above illustrates a FAST TCP graph of cwnd versus time, in blue; it is superimposed over one
sawtooth of TCP Reno with the same network ceiling. Note that cwnd rises rapidly when it is below the
path transit capacity, and then levels off sharply.
330
331
The second problem with late-arriving ACKs is that they can lead to inaccurate or fluctuating measurements
of bandwidth, upon which both TCP Vegas and TCP Westwood depend. For example, if bandwidth is
estimated as cwnd/RTT, late-arriving ACKs can lead to inaccurate calculation of RTT. The original TCP
Westwood strategy was to estimate bandwidth from the spacing between consecutive ACKs, much as is
done with the packet-pairs technique (14.2.6 Packet Pairs) but smoothed with a suitable running average.
This strategy turned out to be particularly vulnerable to ACK-compression errors.
For TCP Vegas, ACK compression means that occasionally the senders cwnd may fail to be decremented
by 1; this does not appear to be a significant impact, perhaps because cwnd is changed by at most 1 each
RTT. For Westwood, however, if ACK compression happens to be occurring at the instant of a packet loss,
then a resultant transient overestimation of BWE may mean that the new post-loss cwnd is too large; at a
point when cwnd was supposed to fall to the transit capacity, it may fail to do so. This means that the sender
has essentially taken a congestion loss to be non-congestive, and ignored it. The influence of this ignored
loss will persist through the much-too-high value of cwnd until the following loss event.
To fix these problems, TCP Westwood has been amended to Westwood+, by increasing the time interval over
which bandwidth measurements are made and by inclusion of an averaging mechanism in the calculation of
BWE. Too much smoothing, however, will lead to an inaccurate BWE just as surely as too little.
Suitable smoothing mechanisms are given in [FGMPC02] and [GM03]; the latter paper in particular examines several smoothing algorithms in terms of their resistance to aliasing effects: the tendency for intermittent measurement of a periodic signal (the returning ACKs) to lead to much greater inaccuracy than might
initially be expected. One smoothing filter suggested by [GM03] is to measure BWE only over entire RTTs,
and then to keep a cumulative running average as follows, where BWMk is the measured bandwidth over
the kth RTT:
BWEk = BWEk-1 + (1)BWMk
A suggested value of is 0.9.
332
if Nqueue <, the loss is probably not due to congestion; set cwnd = (4/5)cwnd
if Nqueue , the loss probably is due to congestion; set cwnd = (1/2)cwnd as usual
The idea here is that most router queues will have a total capacity much larger than , so a loss with fewer
than likely does not represent a queue overflow. Note that, by comparison, TCP Westwood leaves cwnd
unchanged if it thinks the loss is not due to congestion, and its threshold for making that determination is
Nqueue =0.
If TCP Veno encounters a series of non-congestive losses, the above rules make it behave like AIMD(1,0.8).
Per the analysis in 14.7 AIMD Revisited, this is equivalent to AIMD(2,0.5); this means TCP Veno will
be about twice as aggressive as TCP Reno in recovering from non-congestive losses. This would provide
a definite improvement in lossy-link situations with modest bandwidthdelay products, but may not be
enough to make a major dent in the high-bandwidth problem.
333
Whenever RTT = RTTnoLoad , delay=0 and so (delay) = max . However, as soon as queuing delay just
barely starts to begin, we will have delay > delaythresh and so (delay) begins to fall rather precipitously
to min . The value of (delay) is always positive, though, so cwnd will continue to increase (unlike TCP
Vegas) until a congestive loss eventually occurs. However, at that point the change in cwnd is very small,
which minimizes the probability that multiple packets will be lost.
334
Note that, as with FAST TCP, the increase in delay is used to trigger the reduction in .
TCP Illinois also supports having be a decreasing function of delay, so that (small_delay) might be
0.2 while (larger_delay) might match TCP Renos 0.5. However, the authors of [LBS06] explain that
the adaptation of as a function of average queuing delay is only relevant in networks where there are
non-congestion-related losses, such as wireless networks or extremely high speed networks.
15.10 H-TCP
H-TCP, or TCP-Hamilton, is described in [LSL05]. Like Highspeed-TCP it primarily allows for faster
growth of cwnd; unlike Highspeed-TCP, the cwnd increment is determined not by the size of cwnd but
by the elapsed time since the previous loss event. The threshold for accelerated cwnd growth is generally
set to be 1.0 seconds after the most recent loss event. Using an RTT of 50 ms, that amounts to 20 RTTs,
suggesting that when cwndmin is less than 20 then H-TCP behaves very much like TCP Reno.
The specific H-TCP acceleration rule first defines a time threshold tL . If t is the elapsed time in seconds
since the previous loss event, then for ttL the per-RTT window-increment is 1. However, for t>tL we
define
(t) = 1 + 10(ttL ) + (ttL )2 /4
We then increment cwnd by (t) after each RTT, or, equivalently, by (t)/cwnd after each received ACK.
At t=tL +1 seconds (nominally 2 seconds), is 12. The quadratic term dominates the linear term when ttL
> 40. If RTT = 50 ms, that is 800 RTTs.
Even if cwnd is very large, growth is at the same rate as for TCP Reno until t>tL ; one consequence of this
is that, at least in the first second after a loss event, H-TCP competes fairly with TCP Reno, in the sense that
both increase cwnd at the same absolute rate. H-TCP starts from scratch after each packet loss, and does
not re-enter its high-speed mode, even if cwnd is large, until after time tL .
A full H-TCP implementation also adjusts the multiplicative factor as follows (the paper [LSL05] uses
to represent what we denote by 1). The RTT is monitored, as with TCP Vegas. However, the RTT
increase is not used for per-packet or per-RTT adjustments; instead, these measurements are used after each
loss event to update so as to have
1 = RTTmin /RTTmax
The value 1 is capped at a maximum of 0.8, and at a minimum of 0.5. To see where the ratio above comes
from, first note that RTTmin is the usual stand-in for RTTnoLoad , and RTTmax is, of course, the RTT when the
bottleneck queue is full. Therefore, by the reasoning in 6.3.2 RTT Calculations, equation 5, 1 is the ratio
transit_capacity / (transit_capacity + queue_capacity). At a congestion event involving a single uncontested
flow we have cwnd = transit_capacity + queue_capacity, and so after reducing cwnd to (1)cwnd, we
have cwndnew = transit_capacity, and hence (as in 13.7 TCP and Bottleneck Link Utilization) the bottleneck
link will remain 100% utilized after a loss. The cap on 1 of 0.8 means that if the queue capacity is smaller
than a quarter of the transit capacity then the bottleneck link will experience some idle moments.
When is changed, H-TCP also adjusts to = 2(t) so as to improve fairness with other H-TCP
connections with different current values of .
15.10 H-TCP
335
As mentioned above, TCP Cubic is currently (2013) the default linux congestion-control implementation.
TCP Cubic is documented in [HRX08]. TCP Cubic is not described in an RFC, but there is an Internet Draft
http://tools.ietf.org/id/draft-rhee-tcpm-cubic-02.txt.
TCP Cubic has a number of interrelated features, in an attempt to address several TCP issues:
Reduction in RTT bias
TCP Friendliness when most appropriate
Rapid recovery of cwnd following its decrease due to a loss event, maximizing throughput
Optimization for an unchanged network ceiling (corresponding to cwndmax )
Rapid expansion of cwnd when a raised network ceiling is detected
The eponymous cubic polynomial y=x3 , appropriately shifted and scaled, is used to determine changes in
cwnd. No special algebraic properties of this polynomial are used; the point is that the curve, while steadily
increasing, is first concave and then convex; the authors of [HRX08] write [t]he choice for a cubic function
is incidental and out of convenience. This y=x3 polynomial has an inflection point at x=0 where the tangent
line is horizontal; this is the point where the graph changes from concave to convex.
We start with the basic outline of TCP Cubic and then consider some of the bells and whistles. We assume
a loss has just occurred, and let Wmax denote the value of cwnd at the point when the loss was discovered.
TCP Cubic then sets cwnd to 0.8Wmax ; that is, TCP Cubic uses = 0.2. The corresponding for TCPFriendly AIMD(,) would be =1/3, but TCP Cubic uses this only in its TCP-Friendly adjustment,
below.
336
We now define a cubic polynomial W(t), a shifted and scaled version of w=t3 . The parameter t represents
the elapsed time since the most recent loss, in seconds. At time t>0 we set cwnd = W(t). The polynomial
W(t), and thus the cwnd rate of increase, as in TCP Hybla, is no longer tied to the connections RTT; this is
done to reduce if not eliminate the RTT bias that is so deeply ingrained in TCP Reno.
We want the function W(t) to pass through the point representing the cwnd just after the loss, that is, t,W
= 0,0.8Wmax . We also want the inflection point to lie on the horizontal line y=Wmax . To fully determine
the curve, it is at this point sufficient to specify the value of t at this inflection point; that is, how far
horizontally W(t) must be stretched. This horizontal distance from t=0 to the inflection point is represented
by the constant K in the following equation; W(t) returns to its pre-loss value Wmax at t=K. C is a second
constant.
W(t) = C(tK)3 + Wmax
It suffices algebraically to specify either C or K; the two constants are related by the equation obtained by
plugging in t=0. K changes with each loss event, but it turns out that the value of C can be constant, not only
for any one connection but for all TCP Cubic connections. TCP Cubic specifies for C the ad hoc value 0.4;
we can then set t=0 and, with a bit of algebra, solve to obtain
K = (Wmax /2)1/3 seconds
If Wmax = 250, for example, K=5; if RTT = 100 ms, this is 50 RTTs.
When each ACK arrives, TCP Cubic records the arrival time t, calculates W(t), and sets cwnd = W(t). At
the next packet loss the parameters of W(t) are updated.
If the network ceiling does not change, the next packet loss will occur when cwnd again reaches the same
Wmax ; that is, at time t=K after the previous loss. As t approaches K and the value of cwnd approaches
Wmax , the curve W(t) flattens out, so cwnd increases slowly.
This concavity of the cubic curve, increasing rapidly but flattening near Wmax , achieves two things. First,
throughput is boosted by keeping cwnd close to the available path transit capacity. In 13.7 TCP and
Bottleneck Link Utilization we argued that if the path transit capacity is large compared to the bottleneck
queue capacity (and this is the case for which TCP Cubic was designed), then TCP Reno averages 75%
utilization of the available bandwidth. The bandwidth utilization increases linearly from 50% just after a
loss event to 100% just before the next loss. In TCP Cubic, the initial rapid rise in cwnd following a loss
means that the average will be much closer to 100%. Another important advantage of the flattening is that
when cwnd is finally incremented to the point of loss, it likely is just over the network ceiling; the connection
has an excellent chance that only one or two packets are lost rather than a large burst. This facilitates the
NewReno Fast Recovery algorithm, which TCP Cubic still uses if the receiver does not support SACK TCP.
Once t>K, W(t) becomes convex, and in fact begins to increase rapidly. In this region, cwnd > Wmax , and
so the sender knows that the network ceiling has increased since the previous loss. The TCP Cubic strategy
here is to probe aggressively for additional capacity, increasing cwnd very rapidly until the new network
ceiling is encountered. The cubic increase function is in fact quite aggressive when compared to any of the
other TCP variants discussed here, and time will tell what strategy works best. As an example in which
the TCP Cubic approach seems to pay off, let us suppose the current network ceiling is 2,000 packets, and
then (because competing connections have ended) increases to 3,000. TCP Reno would take 1,000 RTTs for
cwnd to reach the new ceiling, starting from 2,000; if one RTT is 50 ms that is 50 seconds. To find the time
t-K that TCP Cubic will need to increase cwnd from 2,000 to 3,000, we solve 3000 = W(t) = C(tK)3 +
2000, which works out to t-K 13.57 seconds (recall 2000 = W(K) here).
The constant C=0.4 is determined empirically. The cubic inflection point occurs at t = K = (Wmax /C)1/3 .
15.11 TCP CUBIC
337
A larger C reduces the time K between the a loss event and the next inflection point, and thus the time
between consecutive losses. If Wmax = 2000, we get K=10 seconds when =0.2 and C=0.4. If the RTT were
50 ms, 10 seconds would be 200 RTTs.
For TCP Reno, on the other hand, the interval between adjacent losses is Wmax /2 RTTs. If we assume a
specific value for the RTT, we can compare the Reno and Cubic time intervals between losses; for an RTT
of 50 ms we get
Wmax
2000
250
54
Reno
50 sec
6.2 sec
1.35 sec
Cubic
10 sec
5 sec
3 sec
For smaller RTTs, the basic TCP Cubic strategy above runs the risk of being at a competitive disadvantage
compared to TCP Reno. For this reason, TCP Cubic makes a TCP-Friendly adjustment in the windowsize calculation: on each arriving ACK, cwnd is set to the maximum of W(t) and the window size that TCP
Reno would compute. The TCP Reno calculation can be based on an actual count of incoming ACKs, or be
based on the formula (1-)Wmax + t/RTT.
Note that this adjustment is only half-friendly: it guarantees that TCP Cubic will not choose a window
size smaller than TCP Renos, but places no restraints on the choice of a larger window size.
A consequence of the TCP-Friendly adjustment is that, on networks with modest bandwidthdelay products, TCP Cubic behaves exactly like TCP Reno.
TCP Cubic also has a provision to detect if a given Wmax is lower than the previous value, suggesting
increasing congestion; in this situation, cwnd is lowered by an additional factor of 1/2. This is known as
fast convergence, and helps TCP Cubic adapt more quickly to reductions in available bandwidth.
The following graph is taken from [RX05], and shows TCP Cubic connections competing with each other
and with TCP Reno.
338
The diagram shows four connections, all with the same RTT. Two are TCP Cubic and two are TCP Reno.
The red connection, cubic-1, was established and with a maximum cwnd of about 4000 packets when the
other three connections started. Over the course of 200 seconds the two TCP Cubic connections reach a fair
equilibrium; the two TCP Reno connections reach a reasonably fair equilibrium with one another, but it is
much lower than that of the TCP Cubic connections.
On the other hand, here is a graph from [LSM07], showing the result of competition between two flows using
an earlier version of TCP Cubic over a low-speed connection. One connection has an RTT of 160ms and the
other has an RTT a tenth that. The bottleneck bandwidth is 1 Mbit/sec, meaning that the bandwidthdelay
product for the 160ms connection is 13-20 packets (depending on the packet size used).
339
Note that the longer-RTT connection (the solid line) is almost completely starved, once the shorter-RTT
connection starts up at T=100. This is admittedly an extreme case, and there have been more recent fixes to
TCP Cubic, but it does serve as an example of the need for testing a wide variety of competition scenarios.
15.12 Epilog
TCP Renos core congestion algorithm is based on algorithms in Jacobson and Karels 1988 paper [JK88],
now twenty-five years old. There are concerns both that TCP Reno uses too much bandwidth (the greediness
issue) and that it does not use enough (the high-bandwidth-TCP problem).
There are also broad changes in TCP usage patterns. Twenty years ago, the vast majority of all TCP traffic
represented downloads from major servers. Today, over half of all Internet TCP traffic is peer-to-peer
rather than server-to-client. The rise in online video streaming creates new demands for excellent TCP
real-time performance.
So which TCP version to use? That depends on circumstances; some of the TCPs above are primarily
intended for relatively specific environments; for example, TCP Hybla for satellite links and TCP Veno
for mobile devices (including wireless laptops). If the sending and receiving hosts are under common
management, and especially if intervening traffic patterns are relatively stable, one can simply make sure
the receiver has what it needs for optimum performance (eg SACK TCP) and run a few simple experiments
to find what works best.
That leaves the question of what TCP to use on a server that is serving up large volumes of data, perhaps to a
340
range of disparate hosts and with a wide variety of competing-traffic scenarios. Experimentation works here
too, but likely with a much larger number of trials. There is also the possibility that one eventually finds a
solution that works well, only to discover that it succeeds at the expense of other, competing traffic. These
issues suggest a need for continued research into how to update and improve TCP, and Internet congestionmanagement generally.
Finally, while most new TCPs are designed to hold their own in a Reno world, there is some question that
perhaps we would all be better off with a radical rather than incremental change. Might TCP Vegas be a
better choice, if only the queue-grabbing greediness of TCP Reno could be restrained? Questions like these
are today entirely hypothetical, but it is not impossible to envision an Internet backbone that implemented
non-FIFO queuing mechanisms (18 Queuing and Scheduling) that fundamentally changed the rules of the
game.
15.13 Exercises
1. How would TCP Vegas respond if it estimated RTTnoLoad = 100ms, with a bandwidth of 1 packet/ms,
and then due to a routing change the RTTnoLoad increased to 200ms without changing the bandwidth? What
cwnd would be chosen? Assume no competition from other senders.
2. Suppose a TCP Vegas connection from A to B passes through a bottleneck router R. The RTTnoLoad is 50
ms and the bottleneck bandwidth is 1 packet/ms.
(a). If the connection keeps 4 packets in the queue (eg =3, =5), what will RTTactual be? What value of
cwnd will the connection choose? What will be the value of BWE?
(b). Now suppose a competing (non-Vegas) connection keeps 6 packets in the queue to the Vegas
connections 4, eventually meaning that the other connection will have 60% of the bandwidth. What will be
the Vegas connections steady-state values for RTTactual , cwnd and BWE?
3. Suppose a TCP Vegas connection has R as its bottleneck router. The transit capacity is M, and the queue
utilization is currently Q>0 (meaning that the transit path is 100% utilized, although not necessarily by the
TCP Vegas packets). The current TCP Vegas cwnd is cwndV . Show that the number of packets TCP Vegas
calculates are in the queue, queue_use, is
queue_use = cwndV Q/(Q+M)
4. Suppose that at time T=0 a TCP Vegas connection and a TCP Reno connection share the same path, and
each has 100 packets in the bottleneck queue, exactly filling the transit capacity of 200. TCP Vegas uses
=1, =2. By the previous exercise, in any RTT with cwndV TCP Vegas packets and cwndR TCP Reno
packets in flight and cwndV +cwndR >200, Nqueue is cwndV /(cwndV +cwndR ) multiplied by the total queue
utilization cwndV +cwndR 200.
Continue the following table, where T is measured in RTTs, up through the next two RTTs where cwndV
is not decremented; that is, find the next two rows where the TCP Vegas queue share is less than 2. (After
each of these RTTs, cwndV is not decremented.) This can be done either with a spreadsheet or by simple
algebra. Note that the TCP Reno cwndR will always increment.
15.13 Exercises
341
T
0
1
2
3
4
5
6
cwndV
100
101
102
101
101
100
99
cwndR
100
101
102
103
104
105
106
This exercise attempts to explain the linear decrease in the TCP Vegas graph in the diagram in 16.5 TCP
Reno versus TCP Vegas. Competition with TCP Reno means not only that cwndV stops increasing, but in
fact it decreases by 1 most RTTs.
5. Suppose that, as in the previous exercise, a FAST TCP connection and a TCP Reno connection share
the same path, and at T=0 each has 100 packets in the bottleneck queue, exactly filling the transit capacity
of 200. The FAST TCP parameter is 0.5. The FAST TCP and TCP Reno connections have respective
cwnds of cwndF and cwndR . You may use the fact that, as long as the queue is nonempty, RTT/RTTnoLoad
= (cwndF +cwndR )/200.
Find the value of cwndF at T=40, where T is counted in units of 20 ms until T = 40, using =4, =10 and
=30. Assume RTT 20 ms as well. Use of a spreadsheet is recommended. The table here uses =10.
T
0
1
2
3
4
cwndF
100
105
108.47
110.77
112.20
cwndR
100
101
102
103
104
6. Suppose A sends to B as in the layout below. The packet size is 1 KB and the bandwidth of the bottleneck
RB link is 1 packet / 10 ms; returning ACKs are thus normally spaced 10 ms apart. The RTTnoLoad for the
AB path is 200 ms.
A
However, large amounts of traffic are also being sent from C to A; the bottleneck link for that path is RA
with bandwidth 1 KB / 5 ms. The queue at R for the RA link has a capacity of 40 KB. ACKs are 50 bytes.
(a). What is the maximum possible arrival time difference on the AB path for ACK[0] and ACK[20], if
there are no queuing delays at R in the AB direction? ACK[0] should be forwarded immediately by R;
ACK[20] should have to wait for 40 KB at R
(b). What is the minimum possible arrival time difference for the same ACK[0] and ACK[20]?
7. Suppose a TCP Veno and a TCP Reno connection compete along the same path; there is no other traffic.
Both start at the same time with cwnds of 50; the total transit capacity is 160. Both share the next loss
event. The bottleneck routers queue capacity is 60 packets; sometimes the queue fills and at other times it
342
is empty. TCP Venos parameter is zero, meaning that it shifts to a slower cwnd increment as soon as the
queue just begins filling.
8. Suppose two connections use TCP Hybla. They do not compete. The first connection has an RTT of 100
ms, and the second has an RTT of 1000 ms. Both start with cwndmin = 0 (literally meaning that nothing is
sent the first RTT).
(a). How many packets are sent by each connection in four RTTs (involving three cwnd increases)?
(b). How many packets are sent by each connection in four seconds? Recall 1+2+3+...+N = N(N+1)/2.
9. Suppose that at time T=0 a TCP Illinois connection and a TCP Reno connection share the same path,
and each has 100 packets in the bottleneck queue, exactly filling the transit capacity of 200. The respective
cwnds are cwndI and cwndR . The bottleneck queue capacity is 100.
Find the value of cwndI at T=50, where T is the number of elapsed RTTs. At this point cwndR is, of course,
150.
T
0
1
2
cwndI
100
101
?
cwndR
100
101
102
You may assume that the delay, RTT RTTnoLoad , is proportional to queue_utilization = cwndI +cwndR
200. Using this expression to represent delay, delaymax = 100 and so delaythresh = 1. When calculating
(delay), assume max = 10 and min = 0.1.
10. Assume that a TCP connection has an RTT of 50 ms, and the time between loss events is 10 seconds.
11. For each of the values of Wmax below, find the change in TCP Cubics cwnd over one 100 ms RTT at
each of the following points:
i. Immediately after the previous loss event, when t = 0.
ii. At the midpoint of the tooth, when t=K/2
iii. At the point when cwnd has returned to Wmax , at t=K
15.13 Exercises
343
12. Suppose a TCP Reno connection is competing with a TCP Cubic connection. There is no other traffic.
All losses are synchronized. In this setting, once the steady state is reached, the cwnd graphs for one tooth
will look like this:
Let c be the maximum cwnd of the TCP Cubic connection (c=Wmax ) and let r be the maximum of the TCP
Reno connection. Let M be the network ceiling, so a loss occurs when c+r reaches M. The width of the
tooth for TCP Reno is (r/2)RTT, where RTT is measured in seconds; the width of the TCP Cubic tooth is
(c/2)1/3 . For the examples here, ignore the TCP-Friendly feature of TCP Cubic.
(a). If M = 200 and RTT = 50 ms = 0.05 sec, show that at the steady state r 130.4 and c = Mr 69.6.
(b). Find equilibrium r and c (to the nearest integer) for M=1000 and RTT = 50 ms. Hint: use of a
spreadsheet or scripting language makes trial-and-error quite practical.
(c). Find equilibrium r and c for M = 1000 and RTT = 100 ms.
(a). Suppose A responds to the loss using the original BWE of 1 packet/ms. How will A update its cwnd?
(b). How would A update its cwnd in response to the loss if it used a BWE of 1 packet / 2 ms?
344
(c). Suppose A calculates BWE as cwnd/RTT, where RTT is measured by monitoring one packet at a time;
A waits until the packet returns before measuring a new RTT. The current RTT-measuring packet happens
to be the one sent just before the packet that was eventually lost. What value for BWE will A have at the
time the loss is detected? Assume A discovers the loss through Fast Retransmit, just over one RTT later.
Between the idea
And the reality
Between the motion
And the act
Falls the Shadow
TS Eliot, The Hollow Men
Try to leave out the part that readers tend to skip.
Elmore Leonard, 10 Rules for Writing
15.13 Exercises
345
346
In previous chapters, especially 14 Dynamics of TCP Reno, we have at times made simplifying assumptions
about TCP Reno traffic. In the present chapter we will look at actual TCP behavior, through simulation,
enabling us to explore the accuracy of some of these assumptions. The primary goal is to provide comparison
between idealized TCP behavior and the often-messier real behavior; a secondary goal is perhaps to shed
light on some of the issues involved in simulation. For example, in the discussion in 16.3 Two TCP Senders
Competing of competition between TCP Reno connections with different RTTs, we address several technical
issues that affect the relevance of simulation results.
Parts of this chapter may serve as a primer on using the ns-2 simulator, though a primer focused on the goal
of illuminating some of the basic operation and theory of TCP through experimentation. However, some of
the outcomes described may be of interest even to those not planning on designing their own simulations.
Simulation is almost universally used in the literature when comparing different TCP flavors for effective
throughput (for example, the graphs excerpted in 15.11 TCP CUBIC were created through simulation).
We begin this chapter by looking at a single connection and analyzing how well the TCP sawtooth utilizes
the bottleneck link. We then turn to competition between two TCP senders. The primary competition
example here is between TCP Reno connections with different RTTs. This example allows us to explore
the synchronized-loss hypothesis (14.3.4 Synchronized-Loss Hypothesis) and to introduce phase effects,
transient queue overflows, and other unanticipated TCP behavior. We also introduce some elements of
designing simulation experiments. The second example compares TCP Reno and TCP Vegas. We close
with a straightforward example of a wireless simulation.
347
The native environment for ns-2 (and ns-3) is linux. Perhaps the simplest approach for Windows users is to
install a linux virtual machine, and then install ns-2 under that. It is also possible to compile ns-2 under the
Cygwin system; an older version of ns-2 may still be available as a Cygwin binary.
To create an ns-2 simulation, we need to do the following (in addition to a modest amount of standard
housekeeping).
define the network topology, including all nodes, links and router queuing rules
create some TCP (or UDP) connections, called Agents, and attach them to nodes
create some Applications usually FTP for bulk transfer or telnet for intermittent random packet
generation and attach them to the agents
start the simulation
Once started, the simulation runs for the designated amount of time, driven by the packets generated by
the Application objects. As the simulated applications generate packets for transmission, the ns-2 system
calculates when these packets arrive and depart from each node, and generates simulated acknowledgment
packets as appropriate. Unless delays are explicitly introduced, node responses such as forwarding a
packet or sending an ACK are instantaneous. That is, if a node begins sending a simulated packet from
node N1 to N2 at time T=1.000 over a link with bandwidth 60 ms per packet and with propagation delay
200 ms, then at time T=1.260 N2 will have received the packet. N2 will then respond at that same instant, if
a response is indicated, eg by enqueuing the packet or by forwarding it if the queue is empty.
Ns-2 does not necessarily require assigning IP addresses to every node, though this is possible in more
elaborate simulations.
Advanced use of ns-2 (and ns-3) often involves the introduction of randomization; for example, we will in
16.3 Two TCP Senders Competing introduce both random sliding-windows delays and traffic generators
that release packets at random times. While it is possible to seed the random-number generator so that
different runs of the same experiment yield different outcomes, we will not do this here, so the randomnumber generator will always produce the same sequence. A consequence is that the same ns-2 script
should yield exactly the same result each time it is run.
] are evaluated.
As in unix-style shell scripting, the value of a variable X is $X; the name X (without the $) is used when
setting the value (in Perl and PHP, on the other hand, many variable names begin with $, which is included
both when evaluating and setting the variable). Comments are on lines beginning with the # character.
Comments can not be appended to a line that contains a statement (although it is possible first to start a new
logical line with ;).
Objects in the simulation are generally created using built-in constructors; the constructor in the line below
is the part in the square brackets (recall that the brackets must enclose an expression to be evaluated):
set tcp0 [new Agent/TCP/Reno]
Object attributes can then be assigned values; for example, the following sets the data portion of the packets
in TCP connection tcp0 to 960 bytes:
$tcp0 set packetSize_ 960
Object attributes are retrieved using set without a value; the following assigns variable ack0 the current
value of the ack_ field of tcp0:
set ack0 [$tcp0 set ack_]
The goodput of a TCP connection is, properly, the number of application bytes received. This differs from
the throughput the total bytes sent in two ways: the latter includes both packet headers and retransmitted
packets. The ack0 value above includes no retransmissions; we will occasionally refer to it as goodput
in this sense.
The smaller bandwidth on the RB link makes it the bottleneck. The default TCP packet size in ns2 is 1000 bytes, so the bottleneck bandwidth is nominally 100 packets/sec or 0.1 packets/ms. The
bandwidthRTTnoLoad product is 0.1 packets/ms 120 ms = 12 packets. Actually, the default size of 1000
bytes refers to the data segment, and there are an additional 40 bytes of TCP and IP header. We therefore
set packetSize_ to 960 so the actual transmitted size is 1000 bytes; this makes the bottleneck bandwidth
exactly 100 packets/sec.
We want the router R to have a queue capacity of 6 packets, plus the one currently being transmitted; we
set queue-limit = 7 for this. We create a TCP connection between A and B, create an ftp sender on top
that, and run the simulation for 20 seconds. The nodes A, B and R are named; the links are not.
349
The ns-2 default maximum window size is 20; we increase that to 100 with $tcp0 set window_ 100;
otherwise we will see an artificial cap on the cwnd growth (in the next section we will increase this to
65000).
The script itself is in a file basic1.tcl, with the 1 here signifying a single sender.
# basic1.tcl simulation: A---R---B
#Create a simulator object
set ns [new Simulator]
#Open the nam file basic1.nam and the variable-trace file basic1.tr
set namfile [open basic1.nam w]
$ns namtrace-all $namfile
set tracefile [open basic1.tr w]
$ns trace-all $tracefile
#Define a finish procedure
proc finish {} {
global ns namfile tracefile
$ns flush-trace
close $namfile
close $tracefile
exit 0
}
#Create the network nodes
set A [$ns node]
set R [$ns node]
set B [$ns node]
#Create a duplex link between the nodes
$ns duplex-link $A $R 10Mb 10ms DropTail
$ns duplex-link $R $B 800Kb 50ms DropTail
# The queue size at $R is to be 7, including the packet being sent
$ns queue-limit $R $B 7
# some hints for nam
# color packets of flow 0 red
$ns color 0 Red
$ns duplex-link-op $A $R orient right
$ns duplex-link-op $R $B orient right
$ns duplex-link-op $R $B queuePos 0.5
# Create a TCP sending agent and attach it to A
set tcp0 [new Agent/TCP/Reno]
# We make our one-and-only flow be flow 0
$tcp0 set class_ 0
$tcp0 set window_ 100
$tcp0 set packetSize_ 960
$ns attach-agent $A $tcp0
350
After running this script, there is no command-line output (because we did not ask for any); however, the
files basic1.tr and basic1.nam are created. Perhaps the simplest thing to do at this point is to view the
animation with nam, using the command nam basic1.nam.
In the animation we can see slow start at the beginning, as first one, then two, then four and then eight
packets are sent. A little past T=0.7, we can see a string of packet losses. This is visible in the animation
as a tumbling series of red squares from the top of Rs queue. After that, the TCP sawtooth takes over;
we alternate between the cwnd linear-increase phase (congestion avoidance), packet loss, and threshold
slow start. During the linear-increase phase the bottleneck link is at first incompletely utilized; once the
bottleneck link is saturated the router queue begins to build.
351
Slow start is at the left edge. Unbounded slow start runs until about T=0.75, when a timeout occurs; bounded
slow start runs from about T=1.2 to T=1.8. After that, all losses have been handled with fast recovery (we
can tell this because cwnd does not drop below half its previous peak). The first three teeth following slow
start have heights (cwnd peak values) of 20.931, 20.934 and 20.934 respectively; when the simulation is
extended to 1000 seconds all subsequent peaks have exactly the same height, cwnd = 20.935. The spacing
between the peaks is also constant, 1.946 seconds.
Because cwnd is incremented by ns-2 after each arriving ACK as described in 13.2.1 Per-ACK Responses,
during the linear-increase phase there are a great many data points jammed together; the bunching effect is
made stronger by the choice here of a large-enough dot size to make the slow-start points clearly visible.
This gives the appearance of continuous line segments. Upon close examination, these line segments are
slightly concave, as discussed in 15.3 Highspeed TCP, due to the increase in RTT as the queue fills.
Individual flights of packets can just be made out at the lower-left end of each tooth, especially the first.
352
1. r for received, d for dropped, + for enqueued, - for dequeued. Every arriving packet is enqueued,
even if it is immediately dequeued. The third packet above was the first dropped packet in the entire
simulation.
2. the time, in seconds.
3. the number of the sending node, in the order of node definition and starting at 0. If the first field was
+, - or d, this is the number of the node doing the enqueuing, dequeuing or dropping. Events
beginning with - represent this node sending the packet.
4. the number of the destination node. If the first field was r, this record represents the packets arrival
at this node.
5. the protocol.
6. the packet size, 960 bytes of data (as we requested) plus 20 of TCP header and 20 of IP header.
7. some TCP flags, here represented as ------- because none of the flags are set. Flags include E
and N for ECN and A for reduction in the advertised winsize.
8. the flow ID. Here we have only one: flow 0. This value can be set via the fid_ variable in the Tcl
source file; an example appears in the two-sender version below. The same flow ID is used for both
directions of a TCP connection.
9. the source node (0.0), in form (node . connectionID). ConnectionID numbers here are simply an
abstraction for connection endpoints; while they superficially resemble port numbers, the node in
question need not even simulate IP, and each connection has a unique connectionID at each end.
ConnectionID numbers start at 0.
10. the destination node (2.0), again with connectionID.
11. the packet sequence number as a TCP packet, starting from 0.
12. a packet identifier uniquely identifying this packet throughout the simulation; when a packet is forwarded on a new link it keeps its old sequence number but gets a new packet identifier.
The three trace lines above represent the arrival of packet 28 at R, the enqueuing of packet 28, and then the
dropping of the packet. All these happen at the same instant.
Mixed in with the event records are variable-trace records, indicating a particular variable has been
changed. Here are two examples from t=0.3833:
0.38333
0.38333
0
0
0
0
2
2
0
0
ack_ 3
cwnd_ 5.000
353
354
This is a slightly better estimate of goodput. In very long simulations, however, this (or any other) byte
count will wrap around long before any of the packet counters wrap around.
In the example above every packet event was traced, a consequence of the line
$ns trace-all $trace
We could instead have asked only to trace particular links. For example, the following line would request
tracing for the bottleneck (RB) link:
$ns trace-queue $R $B $trace
This is often useful when the overall volume of tracing is large, and we are interested in the bottleneck link
only. In long simulations, full tracing can increase the runtime 10-fold; limiting tracing only to what is
actually needed can be quite important.
Like most real implementations, the ns-2 implementation of TCP increments cwnd (cwnd_ in the tracefile) by 1/cwnd on each new ACK (13.2.1 Per-ACK Responses). An additional packet is sent by A whenever cwnd is increased this way past another whole number; that is, whenever floor(cwnd) increases. At
T=3.95181, cwnd_ was incremented to 20.001, triggering the double transmission of Data[253] and the
doomed Data[254]. At this point the RTT is around 190 ms.
The loss of Data[254] is discovered by Fast Retransmit when the third dupACK[253] arrives. The first
ACK[253] arrives at A at T=4.141808, and the dupACKs arrive every 10 ms, clocked by the 10 ms/packet
transmission rate of R. Thus, A detects the loss at T=4.171808; at this time we see cwnd_ reduced by half
to 10.465; the tracefile times for variables are only to 5 decimal places, so this is recorded as
4.17181
cwnd_ 10.465
That represents an elapsed time from when Data[254] was dropped of 207.7 ms, more than one RTT. As
described in 13.8 Single Packet Losses, however, A stopped incrementing cwnd_ when the first ACK[253]
arrived at T=4.141808. The value of cwnd_ at that point is only 20.931, not quite large enough to trigger
transmission of another back-to-back pair and thus eventually a second packet drop.
16.2 A Single TCP Sender
355
356
The bottleneck link here is 800 Kb, or 100 Kbps, or 10 ms/packet, so these propagation-delay changes mean
a round-trip transit capacity of 30 packets (31 if we include the bandwidth delay at R). In the table below,
we run the simulation while varying the queue-limit parameter from 3 to 30. The simulations run for
1000 seconds, to minimize the effect of slow start. Tracing is disabled to reduce runtimes. The received
column gives the number of distinct packets received by B; if the link utilization were 100% then in 1,000
seconds B would receive 100,000 packets.
queue_limit
3
4
5
8
10
12
16
20
22
26
30
received
79767
80903
83313
87169
89320
91382
94570
97261
98028
99041
99567
utilization %, RB
79.8
80.9
83.3
87.2
89.3
91.4
94.6
97.3
98.0
99.0
99.6
In ns-2, every arriving packet is first enqueued, even if it is immediately dequeued, and so queue-limit
cannot actually be zero. A queue-limit of 1 or 2 gives very poor results, probably because of problems
with slow start. The run here with queue-limit = 3 is not too far out of line with the 75% predicted by
theory for a queue-limit close to zero. When queue-limit is 10, then cwnd will range from 20 to
40, and the link-unsaturated and queue-filling phases should be of equal length. This leads to a theoretical
link utilization of about (75%+100%)/2 = 87.5%; our measurement here of 89.3% is in good agreement. As
queue-limit continues to increase, the link utilization rapidly approaches 100%, again as expected.
16.2.6.1 Link utilization measurement
In the experiment above we estimated the utilization of the RB link by the number of distinct packets
arriving at B. But packet duplicate transmissions sometimes occur as well (see 16.2.6.4 Packets that are
delivered twice); these are part of the RB link utilization but are hard to estimate (nominally, most packets
retransmitted by A are dropped by R, but not all).
If desired, we can get an exact value of the RB link utilization through analysis of the ns-2 trace file. In
this file R is node 1 and B is node 2 and our flow is flow 0; we look for all lines of the form
357
nstrace.nsopen(filename)
while not nstrace.isEOF():
if nstrace.isEvent():
(event, time, sendnode, dest, dummy, size, dummy, flow, dummy, dummy, dummy, dum
if (event == "-" and sendnode == SEND_NODE and dest == DEST_NODE and size >= 100
count += 1
else:
nstrace.skipline()
print ("packet count:", count);
link_count(sys.argv[1])
For completeness, here is the same program implemented in the Awk scripting language.
BEGIN {count=0; SEND_NODE=1; DEST_NODE=2; FLOW=0}
$1 == "-" { if ($3 == SEND_NODE && $4 == DEST_NODE && $6 >= 1000 && $8 == FLOW) {
count++;
}
}
END
{print count;}
If we want to run this simulation with parameters ranging from 0 to 10, a simple shell script is
358
queue=0
while [ $queue -le 10 ]
do
ns basic1.tcl $queue
queue=$(expr $queue + 1)
done
If we want to pass multiple parameters on the command line, we use lindex to separate out arguments
from the $argv string; the first argument is at position 0 (in bash and awk scripts, by comparison, the
first argument is $1). For two optional parameters, the first representing queuesize and the second
representing endtime, we would use
if { $argc >= 1 } {
set queuesize [expr [lindex $argv 0]]
}
if { $argc >= 2 } {
set endtime [expr [lindex $argv 1]]
}
def queuesize(filename):
QUEUE_NODE = 1
nstrace.nsopen(filename)
sum = 0.0
size= 0
prevtime=0
while not nstrace.isEOF():
if nstrace.isEvent():
# counting regular trace lines
(event, time, sendnode, dnode, proto, dummy, dummy, flow, dummy, dummy, seqno, p
if (sendnode != QUEUE_NODE): continue
if (event == "r"): continue
359
The answer we get for the average queue size is about 23.76, which is in good agreement with our theoretical
value of 22.5.
16.2.6.4 Packets that are delivered twice
Every dropped TCP packet is ultimately transmitted twice, but classical TCP theory suggests that relatively
few packets are actually delivered twice. This is pretty much true once the TCP sawtooth phase is reached,
but can fail rather badly during slow start.
The following Python script will count packets delivered two or more times. It uses a dictionary, COUNTS,
which is indexed by sequence numbers.
#!/usr/bin/python3
import nstrace
import sys
def dup_counter(filename):
SEND_NODE = 1
DEST_NODE = 2
FLOW = 0
count = 0
COUNTS = {}
nstrace.nsopen(filename)
while not nstrace.isEOF():
if nstrace.isEvent():
(event, time, sendnode, dest, dummy, size, dummy, flow, dummy, dummy, seqno, dum
if (event == "r" and dest == DEST_NODE and size >= 1000 and flow == FLOW):
if (seqno in COUNTS):
COUNTS[seqno] += 1
else:
COUNTS[seqno] = 1
else:
nstrace.skipline()
for seqno in sorted(COUNTS.keys()):
if (COUNTS[seqno] > 1): print(seqno, COUNTS[seqno])
dup_counter(sys.argv[1])
When run on the basic1.tr file above, it finds 13 packets delivered twice, with TCP sequence numbers 43,
45, 47, 49, 51, 53, 55, 57, 58, 59, 60, 61 and 62. These are sent the second time between T=1.437824 and
360
T=1.952752; the first transmissions are at times between T=0.83536 and T=1.046592. If we look at our
cwnd-v-time graph above, we see that these first transmissions occurred during the gap between the end of
the unbounded slow-start phase and the beginning of threshold-slow-start leading up to the TCP sawtooth.
Slow start, in other words, is messy.
16.2.6.5 Loss rate versus cwnd: part 1
If we run the basic1.tcl simulation above until time 1000, there are 94915 packets acknowledged and 512
loss events. This yields a loss rate of p = 512/94915 = 0.00539, and so by the formula of 14.5 TCP Reno
loss rate versus cwnd we should expect the average cwnd to be about 1.225/ p 16.7. The true average
cwnd is the number of packets sent divided by the elapsed time in RTTs, but as RTTs are not constant here
(they get significantly longer as the queue fills), we turn to an approximation. From 16.2.1 Graph of cwnd
v time we saw that the peak cwnd was 20.935; the mean cwnd should thus be about 3/4 of this, or 15.7.
While not perfect, agreement here is quite reasonable.
See also 16.4.3 Loss rate versus cwnd: part 2.
Broadly speaking, the simulations here will demonstrate that the longer-delay BD connection receives less
bandwidth than the AD connection, but not quite so much less as was predicted in 14.3 TCP Fairness with
Synchronized Losses. The synchronized-loss hypothesis increasingly fails as the BR delay increases, in that
the BD connection begins to escape some of the packet-loss events experienced by the AD connection.
We admit at the outset that we will not, however, obtain a quantitative answer to the question of bandwidth
allocation. In fact, as we shall see, we run into some difficulties even formulating the proper question. In
the course of developing the simulation, we encounter several potential problems:
361
362
# Define a finish procedure that prints out progress for each connection
proc finish {} {
global ns tcp0 tcp1 end0 end1 queuesize trace delayB overhead RTTNL
set ack0 [$tcp0 set ack_]
set ack1 [$tcp1 set ack_]
# counts of packets *received*
set recv0 [expr round ( [$end0 set bytes_] / 1000.0)]
set recv1 [expr round ( [$end1 set bytes_] / 1000.0)]
# see numbers below in topology-creation section
set rttratio [expr (2.0*$delayB+$RTTNL)/$RTTNL]
# actual ratio of throughputs fast/slow; the 1.0 forces floating point
set actualratio [expr 1.0*$recv0/$recv1]
# theoretical ratio fast/slow with squaring; see text for discussion of ratio1 and
set rttratio2 [expr $rttratio*$rttratio]
set ratio1 [expr $actualratio/$rttratio]
set ratio2 [expr $actualratio/$rttratio2]
set outstr [format "%f %f %d %d %f %f %f %f %f" $delayB $overhead $recv0 $recv1 $rt
puts stdout $outstr
$ns flush-trace
close $trace
exit 0
}
363
364
variable
meaning
delayB
Additional BR propagation delay, compared to AR delay
overhead
overhead; a value of 0 means this is effectively disabled
recv0
Count of cumulative AD packets received at D (that is, goodput)
recv1
Count of cumulative BD packets received at D (again, goodput)
rttratio
RTT_ratio: BD/AD (long/short)
rttratio2 The square of the previous value
actualratioActual ratio of AD/BD goodput, that is, 14863/14771 (note change in order
versus RTT_ratio)
1.006228 ratio1
actual_ratio/RTT_ratio
1.006228 ratio2
actual_ratio/RTT_ratio2
The one-way AD propagation delay is 110 ms; the bandwidth delays as noted above amount to 11.44 ms,
10 ms of which is on the RD link. This makes the AD RTTnoLoad about 230 ms. The BD delay is, for
the time being, the same, as delayB = 0. We set RTTNL = 220, and calculate the RTT ratio (within Tcl, in
the finish() procedure) as (2delayB + RTTNL)/RTTNL. We really should use RTTNL=230 instead of 220
here, but 220 will be closer when we later change bottleneckBW to 8.0 Mbit/sec rather than 0.8, below.
Either way, the difference is modest.
Note that the above RTT calculations are for when the queue at R is empty; when the queue contains 20
packets this adds another 200 ms to the AD and BD times (the reverse direction is unchanged). This
may make a rather large difference to the RTT ratio, and we will address it below, but does not matter yet
16.3 Two TCP Senders Competing
365
RTT ratio
1.000
1.045
1.209
1.218
1.227
1.236
1.318
1.327
1.364
1.373
1.409
1.545
1.555
1.773
2.000
AD goodput
14863
4229
22142
17683
14958
24034
16932
25790
20005
24977
18437
18891
25834
20463
22624
BD goodput
14771
24545
6879
9842
13754
5137
11395
3603
8580
4215
10211
9891
3135
8206
5941
(RTT ratio)2
1.000
1.093
1.462
1.484
1.506
1.529
1.738
1.762
1.860
1.884
1.986
2.388
2.417
3.143
4.000
goodput ratio
1.006
0.172
3.219
1.797
1.088
4.679
1.486
7.158
2.332
5.926
1.806
1.910
8.241
2.494
3.808
For a few rows, such as the first and the last, agreement between the last two columns is quite good. However,
there are some decidedly anomalous cases in between (particularly the numbers in bold). As delayB
changes from 35 to 36, the goodput ratio jumps from 1.486 to 7.158. Similar dramatic changes in goodput
appear as delayB ranges through the sets {23, 24, 25, 26}, {40, 41, 45}, and {60, 61}. These values were,
admittedly, specially chosen by trial and error to illustrate relatively discontinuous behavior of the goodput
ratio, but, still, what is going on?
366
367
The orange curve represents (RTT_ratio)2 ; according to 14.3 TCP Fairness with Synchronized Losses we
would expect the blue and orange curves to be about the same. When the blue curve is high, the slower BD
connection is proportionately at an unexpected disadvantage. Seldom do phase effects work in favor of the
BD connection, because As phase here is quite small (0.144, based on As exact RTTnoLoad of 231.44 ms).
(If we change the AR propagation delay (basedelay) to 12 ms, making As phase 0.544, the blue curve
oscillates somewhat more evenly both above and below the orange curve, but still with approximately the
same amplitude.)
Recall that a 5 ms change in delayB corresponds to a 10 ms change in the AD connections RTT, equal to
router Rs transmission time. What is happening here is that as the BD connections RTT increases through
368
a range of 10 ms, it cycles through from phase-effect neutrality to phase-effect deficit and back.
369
The dark-blue curve for overhead = 0 is wildly erratic due to phase effects; the light-blue curve for
overhead = 0.005 has about half the oscillation. Even the light-green overhead = 0.01 curve exhibits
some wiggling; it is not until overhead = 0.02 for the darker green curve that the graph really settles
down. We conclude that the latter two values for overhead are quite effective at mitigating phase effects.
One crude way to quantify the degree of graph oscillation is by calculating the mean deviation; the respective
deviation values for the curves above are are 1.286, 0.638, 0.136 and 0.090.
Recall that the time to send one packet on the bottleneck link is 0.01 seconds, and that the average delay
introduced by overhead d is d/2; thus, when overhead is 0.02 each connection would, if acting alone,
have an average sender delay equal to the bottleneck-link delay (though overhead delay is like propagation
delay, and so a high overhead will not prevent queue buildup).
Compared to the 10-ms-per-packet RD transmission time, average delays of 5 and 10 ms per flow
(overhead of 0.01 and 0.02 respectively) may not seem disproportionate. They are, however, quite large
when compared to the 1.0 ms bandwidth delay on the AR and BR legs. Generally, if the goal is to reduce phase effects then overhead should be comparable to the bottleneck-router transmission rate. Using
overhead > 0 does increase the RTT, but in this case not considerably.
We conclude that using overhead to break the synchronization that leads to phase effects appears to have
worked, at least in the sense that with the value of overhead = 0.02 the goodput ratio increases more-orless monotonically with increasing delayB.
The problem with using overhead this way is that it does not correspond to any physical network delay
or other phenomenon. Its use here represents a decidedly ad hoc strategy to introduce enough randomization
that phase effects disappear.
370
in the corresponding bulk (ftp) traffic. The size of the telnet packets sent is determined by the TCP agents
usual packetSize_ attribute.
For each telnet connection we create an Application/Telnet object and set its attribute interval_; in
the script fragment below this is set to tninterval. This represents the average packet spacing in seconds;
transmissions are then scheduled according to an exponential random distribution with interval_ as its
mean.
Actual (simulated) transmissions, however, are also constrained by the telnet connections sliding window.
It is quite possible that the telnet application releases a new packet for transmission, but it cannot yet be sent
because the telnet TCP connections sliding window is momentarily frozen, waiting for the next ACK. If
the telnet packets encounter congestion and the interval_ is small enough then the sender may have a
backlog of telnet packets in its outbound queue that are waiting for the sliding window to advance enough
to permit their departure.
set
$ns
set
$ns
set
set
$ns
$ns
Real telnet packets are most often quite small; in the simulations here we use an uncharacteristically large
size of 210 bytes, leading to a total packet size of 250 bytes after the 40-byte simulated TCP/IP header is
attached. We denote the latter number by actualSize. See exercise 9.
The bandwidth theoretically consumed by the telnet connection is simply actualSize/$tninterval;
the actual bandwidth may be lower if the telnet packets are encountering congestion as noted above. It is
convenient to define an attribute tndensity that denotes the fraction of the RD links bandwith that the
Telnet application will be allowed to use, eg 2%. In this case we have
$tninterval = actualSize / ($tndensity * $bottleneckBW)
For example, if actualSize = 250 bytes, and $bottleneckBW corresponds to 1000 bytes every 10 ms,
then the telnet connection could saturate the link if its packets were spaced 2.5 ms apart. However, if we
want the telnet connection to use up to 5% of the bottleneck link, then $tninterval should be 2.5/0.05
16.3 Two TCP Senders Competing
371
As we hoped, the oscillation does indeed substantially flatten out as the telnet density increases from 0 (dark
blue) to 1% (bright green); the mean deviation falls from 1.36 to 0.084. So far the use of telnet is very
promising.
Unfortunately, if we continue to increase the telnet density past 1%, the oscillation increases again, as can
be seen in the next graph:
The first two curves, plotted in green, correspond to the good densities of the previous graph, with the
vertical axis stretched by a factor of two. As densities increase, however, phase-effect oscillation returns,
372
and the curves converge towards the heavier red curve at the top.
What appears to be happening is that, beyond a density of 1% or so, the limiting factor in telnet transmission
becomes the telnet sliding window rather than the random traffic generation, as mentioned in the third
paragraph of this section. Once the sliding window becomes the limiting factor on telnet packet transmission,
the telnet connections behave much like and become synchronized with their corresponding bulk-traffic
ftp connections. At that point their ability to moderate phase effects is greatly diminished, as actual packet
departures no longer have anything to do with the exponential random distribution that generates the packets.
Despite this last issue, the fact that small levels of random traffic can lead to large reductions in phase effects
can be taken as evidence that, in the real world, where other traffic sources are ubiquitous, phase effects will
seldom be a problem.
The four reddish-hued curves represent the result of using telnet with a packet size of 250, at densities
ranging from 0.5% to 5%. These may be compared with the three green curves representing the use of
overhead, with values 0.01, 0.015 and 0.02. While telnet with a density of 1% is in excellent agreement
with the use of overhead, it is also apparent that smaller telnet densities give a larger ratio1 while larger
densities give a smaller. This raises the awkward possibility that the exact mechanism by which we introduce
randomization may have a material effect on the fairness ratios that we ultimately observe. There is no
right answer here; different randomization sources or levels may simply lead to differing fairness results.
The advantage of using telnet for randomization is that it represents an actual network phenomenon, unlike
overhead. The drawback to using telnet is that the effect on the bulk-traffic goodput ratio is, as the graph
16.3 Two TCP Senders Competing
373
above shows, somewhat sensitive to the exact value chosen for the telnet density.
In the remainder of this chapter, we will continue to use the overhead model, for simplicity, though we
do not claim this is a universally appropriate approach.
The first observation to make is that ratio1 is generally too large and ratio2 is generally too small, when
374
compared to 1.0. In other words, neither is an especially good fit. This appears to be a fairly general
phenomenon, in both simulation and the real world: TCP Reno throughput ratios tend to be somewhere
between the corresponding RTT ratio and the square of the RTT ratio.
The synchronized-loss hypothesis led to the prediction (14.3 TCP Fairness with Synchronized Losses) that
the goodput ratio would be close to RTT_ratio2 . As this conclusion appears to fail, the hypothesis too must
fail, at least to a degree: it must be the case that not all losses are shared.
Throughout the graph we can observe a fair amount of noise variation. Most of this variation appears
unrelated to the 5 ms period we would expect for phase effects (as in the graph at 16.3.4.2 Two-sender
phase effects). However, it is important to increment delayB in amounts much smaller than 5 ms in order
to rule this out, hence the increment of 1.0 here.
There are strong peaks at delayB = 110, 220 and 330. These delayB values correspond to increasing
the RTT by integral multiples 2, 3 and 4 respectively, and the peaks are presumably related to some kind of
higher-level phase effect.
16.3.10.1 Possible models
If the synchronized-loss fairness model fails, with what do we replace it? Here are two ad hoc options. First,
we can try to fit a curve of the form
goodput_ratio = K(RTT_ratio)
to the above data. If we do this, the value for the exponent comes out to about 1.58, sort of a compromise
between ratio2 (=2) and ratio1 (=1), although the value of the exponent here is somewhat sensitive to the
details of the simulation.
An entirely different curve to fit to the data, based on the appearance in the graph that ratio2 0.5 past 120,
is
goodput_ratio = 1/2 (RTT_ratio)2
We do not, however, possess for either of these formulas a model for the relative losses in the two primary
TCP connections that is precise enough to offer an explanation of the formula (though see the final paragraph
of 16.4.2.2 Relative loss rates).
16.3.10.2 Higher bandwidth and link utilization
One consequence of raising the bottleneck bandwidth is that total link utilization drops, for delayB = 0, to
80% of the bandwidth of the bottleneck link, from 98%; this is in keeping with the analysis of 13.7 TCP
and Bottleneck Link Utilization. The transit capacity is 220 packets and another 20 can be in the queue at
R; thus an ideal sawtooth would oscillate between 120 and 240 packets. We do have two senders here, but
when delayB = 0 most losses are synchronized, meaning the two together behave like one sender with an
additive-increase value of 2. As cwnd varies linearly from 120 to 240, it spends 5/6 of the time below the
transit capacity of 220 packets during which period the average cwnd is (120+220)/2 = 170 and 1/6 of
the time with the path 100% utilized; the weighted average estimating total goodput is thus (5/6)170/220
+ (1/6)1 = 81%.
When delayB = 400, combined TCP Reno goodput falls to about 51% of the bandwidth of the bottleneck
link. This low utilization, however, is indeed related to loss and timeouts; the corresponding combined good16.3 Two TCP Senders Competing
375
put percentage for SACK TCP (which as we shall see in 16.4.2 SACK TCP and Avoiding Loss Anomalies
is much better behaved) is 68%.
376
sec to over 1.0 sec as delayB increases from 0 to 400 ms. In order to cover coarse timeouts, a granularity
of from two to three seconds often seems to work well for packet drops.
If we are trying to count losses to estimate the loss rate as in the formula cwnd = 1.225/ p as in 14.5 TCP
Reno loss rate versus cwnd, then we should count every loss response separately; the argument in 14.5 TCP
Reno loss rate versus cwnd depended on counting all loss responses. The difference between one fastrecovery response and two in rapid succession is that in the latter case cwnd is halved twice, to about a
quarter of its original value.
However, if we are interested in whether or not losses between two connections are synchronized, we need
again to make use of granularity to make sure two close losses are counted as one. In this setting, a
granularity of one to two seconds is often sufficient.
377
16.4.1.1 delayB = 0
We start with the equal-RTTs graph, that is, delayB = 0. In this figure the teeth (loss events) are almost
completely synchronized; the only unsynchronized loss events are the green flows losses at T=100 and
at about T=205. The two cwnd graphs, though, do not exactly move in lockstep. The red flow has three
coarse timeouts (where cwnd drops to 0), at about T=30, T=125 and T=145; the green flow has seven coarse
timeouts.
The red graph gets a little ahead of the green in the interval 50-100, despite synchronized losses there. Just
before T=50 the green graph has a fast-recovery response followed by a coarse timeout; the next three green
losses also involve coarse timeouts. Despite perfect loss synchronization in the range from T=40 to T=90,
the green graph ends up set back for a while because its three loss events all involve coarse timeouts while
none of the red graphs do.
16.4.1.2 delayB = 25
In this delayB = 25 graph, respective packet losses for the red and green flows are marked along the top,
and the cwnd graphs are superimposed over a graph of the averaged queue utilization in gold. The time
scale for queue-utilization averaging is about one RTT here. The queue graph is scaled vertically (by a
factor of 8) so the queue values (maximum 20) are numerically comparable to the cwnd values (the transit
capacity is about 230).
378
There is one large red tooth from about T=40 to T=90 that corresponds to three green teeth; from about
T=220 to T=275 there is one red tooth corresponding to two green teeth. Aside from these two points,
representing three isolated green-only losses, the red and green teeth appear quite well synchronized.
We also see evidence of loss-response clusters. At around T=143 the large green tooth peaks; halfway
down there is a little notch ending at T=145 that represents a fast-recovery loss event interpreted by TCP
as distinct. There is actually a third event, as well, representing a coarse timeout shortly after T=145, and
then a fourth fast-recovery event at about T=152 which we will examine shortly. At T80, the red tooth has
a fast-recovery loss event followed very shortly by a coarse timeout; this happens for both flows at about
T=220.
Overall, the red path has 11 teeth if we use a tooth-counting granularity of 3 seconds, and 8 teeth if the
granularity is 5 seconds. We do not count anything that happens in the slow-start phase, generally before
T=3.0, nor do we count the tooth at T=300 when the graph ends.
The slope of the green teeth is slightly greater than the slope of the red teeth, representing the longer RTT
for the red connection.
As for the queue graph, there is perhaps more noise than expected, but generally the right edges of the
teeth the TCP loss responses are very well aligned with peaks representing the queue filling up. Recall
that the transit capacity for flow1 here is about 230 packets and the queue capacity is about 20; we therefore
in general expect the sum of the two cwnds to range between 125 and 250, and that the queue should mostly
remain empty until the sum reached 230. We can indeed confirm visually that in most cases the tooth peaks
do indeed add up to about 250.
16.4.1.3 Transient queue peaks
The loss in the graph above at around T=152 is a little peculiar; this is fully 10 seconds after the primary
tooth peak that preceded it. Let us zoom in on it a little more closely, this time without any averaging of the
queue-utilization graph.
There are two very sharp, narrow peaks in the queue graph, just before T=146 and just after T=152, each
causing packet drops for both flows. Neither peak, especially the latter one, appears to have much to do with
the gradual queue-filling and loss event at T=143. Neither one of these transient queue peaks is associated
with the sum of the cwnds approaching the maximum of about 250; at the T146 peak the sum of the
cwnds is about 22+97 = 119 and at T152 the sum is about 21+42 = 63. ACKs for either sender cannot
379
return from the destination D faster than the bottleneck-link rate of 1 packet/ms; this might suggest new
packets from senders A and B cannot arrive at R faster than a combined rate of 1 packet/ms. What is going
on?
What is happening is that each flow sends a flight of about 20 packets, spaced 1 ms apart, but coincidentally
timed so they begin arriving at R at the same moment. The runup in the queue near T=152 occurs from
T=152.100 to the first drop at T=152.121. During this 21 ms interval, a flight of 20 packets arrive from
node A (flow 0), and a flight of 19 packets arrive from node B (flow 1). These 39 packets in 21 ms means
the queue utilization at R must increase by 39-21 = 18, which is sufficient to overflow the queue as it was
not quite empty beforehand. The ACK flights that triggered these data-packet flights were indeed spaced 1
ms apart, consistent with the bottleneck link, but the ACKs (and the data-packet flights that triggered those
ACKs) passed through R at quite different times, because of the 25-ms difference in propagation delay on
the AR and BR links.
Transient queue peaks like this complicate any theory of relative throughput based on gradually filling the
queue. Fortunately, while transient peaks themselves are quite common (as can be seen from the zoomed
graph above), only occasionally do they amount to enough to overflow the queue. And, in the examples
created here, the majority of transient overflow peaks (including the one analyzed above) are within a few
seconds of a TCP coarse timeout response and may have some relationship to that timeout.
16.4.1.4 delayB = 61
For this graph, note that the red and green loss events at T=130 are not quite aligned; the same happens at
T=45 and T=100.
We also have several multiple-loss-response clusters, at T=40, T=130, T=190 and T=230.
The queue graph gives the appearance of much more solid yellow; this is due to a large number of transient
queue spikes. Under greater magnification it becomes clear these spikes are still relatively sparse, however.
There are transient queue overflows at T=46 and T=48, following a normal overflow at T=44, at T=101
following a normal overflow at T=99, at T=129 and T=131 following a normal overflow at T=126, and at
T=191 following a normal overflow at T=188.
380
Here we have (without the queue data) a good example of highly unsynchronized teeth: the green graph has
12 teeth, after the start, while the red graph has five. But note that the red-graph loss events are a subset of
the green loss events.
381
The SACK TCP sender is specified in ns-2 via Agent/TCP/Sack1. The receiver must also be SACKaware; this is done by creating the receiver with Agent/TCPSink/Sack1.
When we switch to SACK TCP, the underlying fairness situation does not change much. Here is a graph
similar to that above in 16.3.10 Raising the Bandwidth. The run time is 3000 seconds, bottleneckBW
is 8 Mbps, and delayB runs from 0 to 400 in increments of 5. (We did not use an increment of 1, as in the
similar graph in 16.3.10 Raising the Bandwidth, because we are by now confident that phase effects have
been taken care of.)
Ratio1 is shown in blue and ratio2 in green. We again try to fit curves as in 16.3.10.1 Possible models above.
For the exponential model, goodput_ratio = K(RTT_ratio) , the value for the exponent comes out this
time to about 1.31, noticeably below TCP Renos value of 1.57. There remains, however, considerable
noise despite the less-frequent sampling interval, and it is not clear the difference is significant. The
second model, goodput_ratio 0.5 (RTT_ratio)2 , still also appears to remain a good fit.
16.4.2.1 Counting teeth in Python
Using our earlier Python nstrace.py module, we can easily count the times cwnd_ is reduced; these events
correspond to loss responses. Anticipating more complex scenarios, we define a Python class flowstats
to hold the per-flow information, though in this particular case we know there are exactly two flows.
We count every time cwnd_ gets smaller; we also keep a count of coarse timeouts. As a practical matter,
two closely spaced reductions in cwnd_ are always due to a fast-recovery event followed about a second
later by a coarse timeout. If we want a count of each separate TCP response, we count them both.
We do not count anything until after STARTPOINT. This is to avoid counting losses related to slow start.
#!/usr/bin/python3
import nstrace
import sys
# counts all points where cwnd_ drops.
STARTPOINT = 3.0
382
class flowstats:
def __init__(self):
self.toothcount = 0
self.prevcwnd = 0
self.CTOcount = 0
def countpeaks(filename):
global STARTPOINT
nstrace.nsopen(filename)
flow0 = flowstats()
flow1 = flowstats()
while not nstrace.isEOF():
if nstrace.isVar():
# counting cwnd_ trace lines
(time, snode, dummy, dummy, dummy, varname, cwnd) = nstrace.getVar()
if (time < STARTPOINT):
continue
if varname != "cwnd_":
continue
if snode == 0: flow=flow0
else: flow=flow1
if cwnd < flow.prevcwnd:
flow.toothcount += 1
# count this as a tooth
if cwnd == 1.0: flow.CTOcount += 1
# coarse timeout
flow.prevcwnd=cwnd
else:
nstrace.skipline()
A more elaborate version of this script, set up to count clusters of teeth and clusters of drops, is available in
teeth.py. If a new cluster starts at time T, all teeth/drops by time T + granularity are part of the same cluster;
the next tooth/drop after T + granularity starts the next new cluster.
16.4.2.2 Relative loss rates
At this point we accept that the AD/BD throughput ratio is generally smaller than the value predicted by
the synchronized-loss hypothesis, and so the AD flow must have additional losses to account for this. Our
next experiment is to count the loss events in each flow, and to identify the loss events common to both
flows.
The following graph demonstrates the rise in AD losses as a percentage of the total, using SACK TCP. We
use the tooth-counting script from the previous section, with a granularity of 1.0 sec. With SACK TCP it is
rare for two drops in the same flow to occur close to one another; the granularity here is primarily to decide
when two teeth in the two different flows are to be counted as shared.
383
The blue region at the bottom represents the percentage of all loss-response events (teeth) that are shared
between both flows. The green region in the middle represents the percentage of all loss-response events
that apply only to the faster AD connection (flow 0); these rise to 30% very quickly and remain at about
50% as delayB ranges from just over 100 up to 400. The uppermost yellow region represents the loss
events that affected only the slower BD connection (flow 1); these are usually under 10%.
As a rough approximation, if we assume a 50/50 division between shared (blue) and flow-0-only (green)
loss events, we have losscount0 /losscount1 = = 2. Applying the formula of 14.5.2 Unsynchronized TCP
Losses, we get bandwidth0 /bandwidth1 = (RTT_ratio)2 /2, or ratio2 = 1/2. While it is not at all clear this
will continue to hold as delayB continues to increase beyond 400, it does suggest a possible underlying
explanation for the second formula of 16.3.10.1 Possible models.
In 14.5 TCP Reno loss rate versus cwnd we argued that the average value of cwnd was about K/ p, where
the constant K was somewhere between 1.225 and 1.309. In 16.2.6.5 Loss rate versus cwnd: part 1 above
we tested this for a single connection with very regular teeth; in this section we will test this hypothesis in
the two-connections simulation. In order to avoid coarse timeouts, which were not included in the original
model, we will use SACK TCP.
Over the lifetime of a connection, the average cwnd is the number of packets sent divided by the number
of RTTs. This simple relationship is slightly complicated by the fact that the RTT varies with the fullness
of the bottleneck queue, but, as before, this effect can be minimized if we choose models where RTTnoLoad
is large compared with the queuing delay. We will as before set bottleneckBW = 8.0; at this bandwidth
queuing delays add less than 10% to RTTnoLoad . For the longer-path flow1, RTTnoLoad 220 + 2delayB
ms. We will for both flows use the appropriate RTTnoLoad as an approximation for RTT, and will use the
number of packets acknowledged as an approximation for the number transmitted. These calculations give
the true average cwnd values in the table below. The estimated average cwnd values are from the formula
K/ p, where the loss ratio p is the total number of teeth, as counted earlier, divided by the number of that
flows packets.
With the relatively high value for bottleneckBW it takes a long simulation to get a reasonable number
384
of losses. The simulations used here ran 3000 seconds, long enough that each connection ended up with
100-200 losses.
Here is the data, for each flow, using K=1.225. The bold columns represent the extent by which the ratio of
the estimated average cwnd to the true average cwnd differs from unity; these error values are reasonably
close to zero.
est avg
cwnd0
93.5
95.9
99.2
108.9
121.6
126.9
129.8
error0
0
10
20
40
70
100
150
true avg
cwnd0
89.2
92.5
95.6
104.9
117.0
121.9
125.2
4.8%
3.6%
3.7%
3.8%
3.9%
4.1%
3.7%
true avg
cwnd1
87.7
95.6
98.9
103.3
101.5
104.1
112.7
est avg
cwnd1
92.2
99.2
101.9
105.6
102.8
104.1
109.7
200
133.0
137.5
3.4%
95.9
93.3
250
133.4
138.2
3.6%
81.0
78.6
300
134.6
138.8
3.1%
74.2
70.9
350
135.2
139.1
2.9%
69.8
68.4
400
137.2
140.6
2.5%
60.4
58.5
delayB
error1
5.1%
3.8%
3.0%
2.2%
1.3%
0%
2.6%
2.7%
3.0%
4.4%
2.0%
3.2%
The table also clearly shows the rise in flow0s cwnd, cwnd0, and also the fall in cwnd1.
A related observation is that the absolute number of losses for either flow slowly declines as delayB
increases, at least in the delayB 400 range we have been considering. For flow 1, with the longer RTT,
this is because the number of packets transmitted drops precipitously (almost sevenfold) while cwnd and
therefore the loss-event probability p stays relatively constant. For flow 0, the total number of packets
sent rises as flow 1 drops to insignificance, but only by about 50%, and the average cwnd0 (above) rises
sufficiently fast (due to flow 1s decline) that
total_losses = total_sent p = total_sent K/cwnd2
generally does not increase.
385
flow, in each cluster. Each packet loss is indicated by an ns-2 event-trace line beginning with d. In our
simple arrangement, all losses are at the router R and all losses are of data (rather than ACK) packets.
At each drop cluster (loss event), each flow loses zero or more packets. For each pair of integers (X,Y)
we will count the number of drop clusters for which flow 0 experienced X losses and flow 1 experienced
Y losses. In the simplest model of synchronized losses, we might expect the majority of drop clusters to
correspond to the pair (1,1).
Whenever we identify a new drop cluster, we mark its start time (dcstart below); one cluster then consists
of all drop events from that time until one granularity interval of length DC_GRANULARITY has elapsed,
and the next loss after the end of that interval marks the start of the next drop cluster. The granularity interval
must be larger than the coarse-grained timeout interval and smaller than the tooth-to-tooth interval. Because
the timeout interval is often 1.0 sec, it is convenient to set DC_GRANULARITY to 1.5 or 2.0; in that range
the clusters we get are not very sensitive to the exact value.
We will represent a drop cluster as a Python tuple (pair) of integers (d0,d1) where d0 represents the
number of packet losses for flow 0 and d1 the losses for flow 1. As we start each new cluster, we add the
old one to a Python dictionary clusterdict, mapping each cluster to the count of how many times it has
occurred.
The code looks something like this, where dropcluster is initialized to the pair (0,0) and
addtocluster(c,f) adds to cluster c a drop for flow f. (The portion of the script here fails to include the final cluster in clusterdict.)
if nstrace.istrace():
# counting regular trace lines for DROP CLUSTERS
(event, time, sendnode, dnode, proto, dummy, dummy, flow, dummy, dummy, seqno, pktid
if (event==d):
# ignore all others
if (time > dcstart + DC_GRANULARITY):
# save old cluster, start ne
dcstart = time
if (dropcluster != (0,0)):
# dropcluster==(0,0) means i
inc_cluster(dropcluster, clusterdict)
# add dropcluster to
dropcluster = addtocluster((0,0), flow)
# start new cluster
else:
# add to dropcluster
dropcluster = addtocluster(dropcluster, flow)
At the end of the tracefile, we can print out clusterdict; if (f0,f1) had an occurrence count of N then
there were N drop events where flow 0 had f0 losses and flow 1 had f1. We can also count the number of
loss events that involved only flow 0 by adding up the occurrence counts of all drop clusters of the form
(x,0), that is, with 0 as the number of flow-1 losses in the cluster; similarly for flow 1.
A full Python script for this is at teeth.py.
If we do this for our earlier 16.3 Two TCP Senders Competing scenario, with bottleneckBW = 0.8 and
with a runtime of 3000 seconds, we find out that for equal paths, the two connections have completely
synchronized loss events (that is, no flow-0-only or flow-1-only losses) for overhead values of 0, 0.0012,
0.003, 0.0055 and 0.008 sec (recall the bottleneck-link packet send time is 0.01 sec). For all these overhead
values but 0.008, each loss event does indeed consist of exactly one loss for flow 0 and one loss for flow 1.
For overhead = 0.008, the final clusterdict is {(1,2):1, (1,1):674, (2,1):1}, meaning there were 674 loss
events with exactly one loss from each flow, one loss event with one loss from flow 0 and two from flow 1,
and one loss event with two losses from flow 0 and one from flow 1.
As overhead increases further, however, the final clusterdicts become more varied; below is a diagram
representing the dictionary for overhead = 0.02.
386
The numbers are the occurrence counts for each dropcluster pair; for example, the entry of 91 in row 1
column 2 means that there were 91 times flow 0 had two losses and flow 1 had a single loss.
387
The vertical axis plots the Reno-to-Vegas goodput ratio; the horizontal axis represents the queue capacity.
The performance of TCP Vegas falls off quite quickly as the queue size increases. One reason the larger
may help is that this slightly increases the range in which TCP Vegas behaves like TCP Reno.
To create an ns-2 TCP Vegas connection and set and one uses
set tcp1 [new Agent/TCP/Vegas]
$tcp1 set v_alpha_ 3
$tcp1 set v_beta_ 6
In prior simulations we have also made the following setting, in order to make the total TCP packetsize
including headers be 1000:
Agent/TCP set packetSize_ 960
It turns out that for TCP Vegas objects in ns-2, the packetSize_ includes the headers, as can be verified
by looking at the tracefile, and so we need
Agent/TCP/Vegas set packetSize_ 1000
Here is a cwnd-v-time graph comparing TCP Reno and TCP Vegas; the queuesize is 20, bottleneckBW
is 8 Mbps, overhead is 0.002, and =3 and =6. The first 300 seconds are shown. During this period the
bandwidth ratio is about 1.1; it rises to close to 1.3 (all in TCP Renos favor) when T=1000.
388
The red plot represents TCP Reno and the green represents TCP Vegas. The green plot shows some spikes
that probably represent implementation artifacts.
Five to ten seconds before each sharp TCP Reno peak, TCP Vegas has its own softer peak. The RTT has
begun to rise, and TCP Vegas recognizes this and begins decrementing cwnd by 1 each RTT. At the point of
packet loss, TCP Vegas begins incrementing cwnd again. During the cwnd-decrement phase, TCP Vegas
falls behind relative to TCP Reno, but it may catch up during the subsequent increment phase because TCP
Vegas often avoids the cwnd = cwnd/2 multiplicative decrease and so often continues after a loss event
with a larger cwnd than TCP Reno.
We conclude that, for smaller bottleneck-queue sizes, TCP Vegas does indeed hold its own. Unfortunately,
in the scenario here the bottleneck-queue size has to be quite small for this to work; TCP Vegas suffers in
competition with TCP Reno even for moderate queue sizes. That said, queue capacities out there in the real
Internet tend to increase much more slowly than bandwidth, and there may be real-world situations were
TCP Vegas performs quite well when compared to TCP Reno.
389
wireless-phy.cc, the variable Pt_ for transmitter power is declared; the default value of
0.28183815 translates to a physical range of 250 meters using the appropriate radio-attenuation model.
We create a simulation here in which one node (mover) moves horizontally above a sequence of fixedposition nodes (stored in the Tcl array rownodes). The leftmost fixed-position node transmits continuously
to the mover node; as the mover node progresses, packets must be routed through other fixed-position
nodes. The fixed-position nodes here are 200 m apart, and the mover node is 150 m above their line; this
means that the mover reaches the edge of the range of the ith rownode when it is directly above the i+1th
rownode.
We use Ad hoc On-demand Distance Vector (AODV) as the routing protocol. When the mover moves out
of range of one fixed-position node, AODV finds a new route (which will be via the next fixed-position
node) quite quickly; we return to this below. DSDV (9.4.1 DSDV) is much slower, which leads to many
packet losses until the new route is discovered. Of course, whether a given routing mechanism is fast enough
depends very much on the speed of the mover; the simulation here does not perform nearly as well if the
time is set to 10 seconds rather than 100 as the mover moves too fast even for AODV to keep up.
Because there are so many configuration parameters, to keep them together we adopt the common convention of making them all attributes of a single Tcl object, named opt.
We list the simulation file itself in pieces, with annotation; the complete file is at wireless.tcl. We begin with
the options.
# ======================================================================
# Define options
# ======================================================================
set opt(chan)
Channel/WirelessChannel ;# channel type
set opt(prop)
Propagation/TwoRayGround ;# radio-propagation model
set opt(netif)
Phy/WirelessPhy
;# network interface type
set opt(mac)
Mac/802_11
;# MAC type
set opt(ifq)
Queue/DropTail/PriQueue ;# interface queue type
set opt(ll)
LL
;# link layer type
set opt(ant)
Antenna/OmniAntenna
;# antenna model
set opt(ifqlen)
50
;# max packet in ifq
set
set
set
set
set
set
opt(bottomrow)
opt(spacing)
opt(mheight)
opt(brheight)
opt(x)
opt(y)
set opt(adhocRouting)
set opt(finish)
5
;# number of bottom-row nodes
200
;# spacing between bottom-row nodes
150
;# height of moving node above bottom-row
50
;# height of bottom-row nodes from bottom
[expr ($opt(bottomrow)-1)*$opt(spacing)+1]
;# x coordinate of to
300
;# y coordinate of topology
AODV
100
;# routing protocol
;# time to stop simulation
# the next value is the speed in meters/sec to move across the field
set opt(speed)
[expr 1.0*$opt(x)/$opt(finish)]
The Channel/WirelessChannel class represents the physical terrestrial wireless medium; there is
also a Channel/Sat class for satellite radio. The Propagation/TwoRayGround is a particular radiopropagation model. The TwoRayGround model takes into account ground reflection; for larger inter-node
distances d, the received power level is proportional to 1/d4 . Other models are the free-space model (in
which received power at distance d is proportional to 1/d2 ) and the shadowing model, which takes into
390
account other types of interference. Further details can be found in the Radio Propagation Models chapter
of the ns-2 manual.
The Phy/WirelessPhy class specifies the standard wireless-node interface to the network; alternatives include Phy/WirelessPhyExt with additional options and a satellite-specific Phy/Sat. The
Mac/802_11 class specifies IEEE 802.11 (that is, Wi-Fi) behavior; other options cover things like generic
CSMA/CA, Aloha, and satellite. The Queue/DropTail/PriQueue class specifies the queuing behavior of each node; the opt(ifqlen) value determines the maximum queue length and so corresponds to
the queue-limit value for wired links. The LL class, for Link Layer, defines things like the behavior of
ARP on the network.
The Antenna/OmniAntenna class defines a standard omnidirectional antenna. There are many kinds
of directional antennas in the real world eg parabolic dishes and waveguide cantennas and a few have
been implemented as ns-2 add-ons.
The next values are specific to our particular layout. The opt(bottomrow) value determines the
number of fixed-position nodes in the simulation. The spacing between adjacent bottom-row nodes is
opt(spacing) meters. The moving node mover moves at height 150 meters above this fixed row.
When mover is directly above a fixed node, it is thus at distance (2002 + 1502 ) = 250 from the previous
fixed node, at which point the previous node is out of range. The fixed row itself is 50 meters above the
bottom of the topology. The opt(x) and opt(y) values are the dimensions of the simulation, in meters;
the number of bottom-row nodes and their spacing determine opt(x).
As mentioned earlier, we use the AODV routing mechanism. When the mover node moves out of range
of the bottom-row node that it is currently in contact with, AODV receives notice of the failed transmission
from the Wi-Fi link layer (ultimately this news originates from the absence of the Wi-Fi link-layer ACK).
This triggers an immediate search for a new route, which typically takes less than 50 ms to complete. The
earlier DSDV (9.4.1 DSDV) mechanism does not use Wi-Fi link-layer feedback and so does not look for
a new route until the next regularly scheduled round of distance-vector announcements, which might be
several seconds away. Other routing mechanisms include TORA, PUMA, and OLSR.
The finishing time opt(finish) also represents the time the moving node takes to move across all the
bottom-row nodes; the necessary speed is calculated in opt(speed). If the finishing time is reduced, the
mover speed increases, and so the routing mechanism has less time to find updated routes.
The next section of Tcl code sets up general bookkeeping:
# create the simulator object
set ns [new Simulator]
# set up tracing
$ns use-newtrace
set tracefd [open wireless.tr w]
set namtrace [open wireless.nam w]
$ns trace-all $tracefd
$ns namtrace-all-wireless $namtrace $opt(x) $opt(y)
# create and define the topography object and layout
set topo [new Topography]
$topo load_flatgrid $opt(x) $opt(y)
# create an instance of General Operations Director, which keeps track of nodes and
391
# node-to-node reachability. The parameter is the total number of nodes in the simulation.
create-god [expr $opt(bottomrow) + 1]
The use-newtrace option enables a different tracing mechanism, in which each attribute except the first
is prefixed by an identifying tag, so that parsing is no longer position-dependent. We look at an example
below.
Note the special option namtrace-all-wireless for tracing for nam, and the dimension parameters
opt(x) and opt(y). The next step is to create a Topography object to hold the layout (still to be
determined). Finally, we create a General Operations Director, which holds information about the layout
not necessarily available to any node.
The next step is to call node-config, which passes many of the opt() parameters to ns and which
influences future node creation:
# general node configuration
set chan1 [new $opt(chan)]
$ns node-config -adhocRouting $opt(adhocRouting) \
-llType $opt(ll) \
-macType $opt(mac) \
-ifqType $opt(ifq) \
-ifqLen $opt(ifqlen) \
-antType $opt(ant) \
-propType $opt(prop) \
-phyType $opt(netif) \
-channel $chan1 \
-topoInstance $topo \
-wiredRouting OFF \
-agentTrace ON \
-routerTrace ON \
-macTrace OFF
Finally we create our nodes. The bottom-row nodes are created within a Tcl for-loop, and are stored in a
Tcl array rownode(). For each node we set its coordinates (X_, Y_ and Z_); it is at this point that the
rownode() nodes are given positions along the horizontal line y=50 and spaced opt(spacing) apart.
# create the bottom-row nodes as a node array $rownode(), and the moving node as $mover
for {set i 0} {$i < $opt(bottomrow)} {incr i} {
set rownode($i) [$ns node]
$rownode($i) set X_ [expr $i * $opt(spacing)]
$rownode($i) set Y_ $opt(brheight)
$rownode($i) set Z_ 0
}
We now make the mover node move, using setdest. If the node reaches the destination supplied in
392
setdest, it stops, but it is also possible to change its direction at later times using additional setdest
calls, if a zig-zag path is desired. Various external utilities are available to create a file of Tcl commands
to create a large number of nodes each with a designated motion; such a file can then be imported into the
main Tcl file.
set moverdestX [expr $opt(x) - 1]
$ns at 0 "$mover setdest $moverdestX [$mover set Y_] $opt(speed)"
Next we create a UDP agent and a CBR (Constant Bit Rate) application, and set up a connection from
rownode(0) to mover. CBR traffic does not use sliding windows.
# setup UDP connection, using CBR traffic
set udp [new Agent/UDP]
set null [new Agent/Null]
$ns attach-agent $rownode(0) $udp
$ns attach-agent $mover $null
$ns connect $udp $null
set cbr1 [new Application/Traffic/CBR]
$cbr1 set packetSize_ 512
$cbr1 set rate_ 200Kb
$cbr1 attach-agent $udp
$ns at 0 "$cbr1 start"
$ns at $opt(finish) "$cbr1 stop"
The remainder of the Tcl file includes additional bookkeeping for nam, a finish{} procedure, and the
startup of the simulation.
# tell nam the initial node position (taken from node attributes)
# and size (supplied as a parameter)
for {set i 0} {$i < $opt(bottomrow)} {incr i} {
$ns initial_node_pos $rownode($i) 10
}
$ns initial_node_pos $mover 20
# set the color of the mover node in nam
$mover color blue
$ns at 0.0 "$mover color blue"
$ns at $opt(finish) "finish"
proc finish {} {
global ns tracefd namtrace
$ns flush-trace
close $tracefd
close $namtrace
exit 0
}
# begin simulation
393
$ns run
The simulation can be viewed from the nam file, available at wireless.nam. In the simulation, the mover
node moves across the topography, over the bottom-row nodes. The CBR traffic reaches mover from
rownode(0) first directly, then via rownode(1), then via rownode(1) and rownode(2), etc. The
motion of the mover node is best seen by speeding up the animation frame rate using the nam control for
this, though doing this means that aliasing effects often make the CBR traffic appear to be moving in the
opposite direction.
Above is one frame from the animation, with the mover node is almost (but not quite) directly over
rownode(3), and so is close to losing contact with rownode(2). Two CBR packets can be seen en
route; one has almost reached rownode(2) and one is about a third of the way from rownode(2) up to
the blue mover node. The packets are not shown to scale; see exercise 17.
The tracefile is specific to wireless networking, and even without the use of use-newtrace has a rather
different format from the link-based simulations earlier. The newtrace format begins with a letter for
send/receive/drop/forward; after that, each logged attribute is identified with a prefixed tag rather than by
position. Full details can be found in the ns-2 manual. Here is an edited record of the first packet drop (the
initial d indicates a drop-event record):
d -t 22.586212333 -Hs 0 -Hd 5 ... -Nl RTR -Nw CBK ... -Ii 1100 ... -Pn cbr -Pi 1100 ...
The -t tag indicates the time. The -Hs and -Hd tags indicate the source and destination, respectively.
The -Nl tag indicates the level (RouTeR) at which the loss was logged, and the -Nw tag indicates the
cause: CBK, for CallBacK, means that the packet loss was detected at the link layer but the information
was passed up to the routing layer. The -Ii tag is the packets unique serial number, and the P tags supply
information about the constant-bit-rate agent.
We can use the tracefile to find clusters of drops beginning at times 22.586, 47.575, 72.707 and 97.540,
corresponding roughly to the times when the route used to reach the $mover node shifts to passing through
one more bottom-row node. Between t=72.707 and t=97.540 there are several other somewhat more mysterious clusters of drops; some of these clusters may be related to ordinary queue overflow but others may
reflect decreasing reliability of the forwarding mechanism as the path grows longer.
394
16.7 Epilog
Simulations using ns (either ns-2 or ns-3) are a central part of networks research. Most scientific papers
addressing comparisons between TCP flavors refer to ns simulations to at least some degree; ns is also
widely used for non-TCP research (especially wireless).
But simulations are seldom a matter of a small number of runs. New protocols must be tested in a wide range
of conditions, with varying bandwidths, delays and levels of background traffic. Head-to-head comparisons
in isolation, such as our first runs in 16.3.3 Unequal Delays, can be very misleading. Good simulation
design, in other words, is not easy.
Our simulations here involved extremely simple networks. A great deal of effort has been expended by the
ns community in the past decade to create simulations involving much larger sets of nodes; the ultimate goal
is to create realistic simulations of the Internet itself. We refer the interested reader to [FP01].
16.8 Exercises
1. In the graph in 16.2.1 Graph of cwnd v time, examine the trace file to see what accounts for the dot at
the start of each tooth (at times approximately 4.1, 6.1 and 8.0). Note that the solid parts of the graph are
as solid as they are because fractional cwnd values are used; cwnd is incremented by 1/cwnd on receipt of
each ACK.
2. A problem with the single-sender link-utilization experiment at 16.2.6 Single-sender Throughput Experiments was that the smallest practical value for queue-limit was 3, which is 10% of the path transit
capacity. Repeat the experiment but arrange for the path transit capacity to be at least 100 packets, making a
queue-limit of 3 much smaller proportionally. Be sure the simulation runs long enough that it includes
multiple teeth. What link utilization do you get? Also try queue-limit values of 4 and 5.
3. Create a single-sender simulation in which the path transit capacity is 90 packets, and the bottleneck
queue-limit is 30. This should mean cwnd varies between 60 and 120. Be sure the simulation runs
long enough that it includes many teeth.
1. What link utilization do you observe?
2. What queue utilization do you observe?
4. Use the basic2 model with equal propagation delays (delayB = 0), but delay the starting time for the
first connection. Let this delay time be startdelay0; at the end of the basic2.tcl file, you will have
$ns at $startdelay0 "$ftp0 start"
Try this for startdelay0 ranging from 0 to 40 ms, in increments of 1.0 ms. Graph the ratio1 or ratio2
values as a function of startdelay0. Do you get a graph like the one in 16.3.4.2 Two-sender phase
effects?
5. If the bottleneck link forwards at 10 ms/packet (bottleneckBW = 0.8), then in 300 seconds we can
send 30,000 packets. What percentage of this, total, are sent by two competing senders as in 16.3 Two TCP
Senders Competing, for delayB = 0, 50, 100, 200 and 400.
6. Repeat the previous exercise for bottleneckBW = 8.0, that is, a bottleneck rate of 1 ms/packet.
16.7 Epilog
395
7. Pick a case above where the total is less than 100%. Write a script that keeps track of cwnd0 + cwnd1 ,
and measure how much time this quantity is less than the transit capacity. What is the average of cwnd0 +
cwnd1 over those periods when it is less than the transit capacity?
8. In the model of 16.3 Two TCP Senders Competing, it is often the case that for small delayB, eg delayB
< 5, the longer-path BD connection has greater throughput. Demonstrate this. Use an appropriate value of
overhead (eg 0.02 or 0.002).
9. In generating the first graph in 16.3.7 Phase Effects and telnet traffic, we used a packetSize_ of
210 bytes, for an actualSize of 250 bytes. Generate a similar graph using in the simulations a much
smaller value of packetSize_, eg 10 or 20 bytes. Note that for a given value of tndensity, as the
actualSize shrinks so should the tninterval.
10. In 16.3.7 Phase Effects and telnet traffic the telnet connections ran from A and B to D, so the telnet
traffic competed with the bulk ftp traffic on the bottleneck link. Change the simulation so the telnet traffic
runs only as far as R. The variable tndensity should now represent the fraction of the AR and BR
bandwidth that is used for telnet. Try values for tndensity of from 5% to 50% (note these densities
are quite high). Generate a graph like the first graph in 16.3.7 Phase Effects and telnet traffic, illustrating
whether this form of telnet traffic is effective at reducing phase effects.
11. Again using the telnet simulation of the previous exercise, in which the telnet traffic runs only as far as
R, generate a graph comparing ratio1 for the bulk ftp traffic when randomization comes from:
overhead values in the range discussed in the text (eg 0.01 or 0.02 for bottleneckBW = 0.8
Mbps)
AR and BR telnet traffic with tndensity in the range 5% to 50%.
The goal should be a graph comparable to that of 16.3.8 overhead versus telnet. Do the two randomization
mechanisms overhead and telnet still yield comparable values for ratio1?
12. Repeat the same experiment as in exercise 8, but using telnet instead of overhead. Try it with AD/B
D telnet traffic and tndensity around 1%, and also AR/BR traffic as in exercise 10 with a tndensity
around 10%. The effect may be even more marked.
13. Analyze the packet drops for each flow for the Reno-versus-Vegas competition shown in the second
graph (red v green) of 16.5 TCP Reno versus TCP Vegas. In that simulation, bottleneckBW = 8.0
Mbps, delayB = 0, queuesize = 20, =3 and =6; the full simulation ran for 1000 seconds.
(a). How many drop clusters are for the Reno flow only?
(b). How many drop clusters are for the Vegas flow only?
(c). How many shared drop clusters are there?
at least 1000 seconds. Use the default and (usually =1 and =3). Is ratio1 1 or ratio2 1 a better
fit? How does the overall fairness compare with what ratio1 1 or ratio2 1 would predict? Does either
ratio appear roughly constant? If so, what is its value?
16. Repeat the previous exercise for a much larger value of , say =10. Set =+2, and be sure
queuesize is larger than 2 (recall that, in ns, is v_alpha_ and is v_beta_). If both connections manage to keep the same number of packets in the bottleneck queue at R, then both should get about
the same goodput, by the queue-competition rule of 14.2.2 Example 2: router competition. Do they? Is the
fairness situation better or worse than with the default and ?
17. In nam animations involving point-to-point links, packet lengths are displayed proportionally: if a link
has a propagation delay of 10 ms and a bandwidth of 1 packet/ms, then each packet will be displayed with
length 1/10 the link length. Is this true of wireless as well? Consider the animation (and single displayed
frame) of 16.6 Wireless Simulation. Assume the signal propagation speed is c 300 m/sec, that the nodes
are 300 m apart, and that the bandwidth is 1 Mbps.
(a). How long is a single bit? (That is, how far does the signal travel in the time needed to send a single
bit?)
(b). If a sender transmits continuously, how many bits will it send before its first bit reaches its destination?
(c). In the nam animation of 16.6 Wireless Simulation, is it plausible that what is rendered for the CBR
packets represents just the first bit of the packet? Is the scale about right for a single bit?
(d). What might be a more accurate animated representation of wireless packets?
16.8 Exercises
397
398
In this chapter we take a somewhat cursory look at the ns-3 simulator, intended as a replacement for ns-2.
The project is managed by the NS-3 Consortium, and all materials are available at nsnam.org.
Ns-3 represents a rather sharp break from ns-2. Gone is the Tcl programming interface; instead, ns-3 simulation programs are written in the C++ language, with extensive calls to the ns-3 library, although they are often
still referred to as simulation scripts. As the simulator core itself is also written in C++, this in some cases
allows improved interaction between configuration and execution. However, configuration and execution
are still in most cases quite separate: at the end of the simulation script comes a call Simulator::Run()
akin to ns-2s $ns run at which point the user-written C++ has done its job and the library takes over.
To configure a simple simulation, an ns-2 Tcl script had to create nodes and links, create network-connection
agents attached to nodes, and create traffic-generating applications attached to agents. Much the same
applies to ns-3, but in addition each node must be configured with its network interfaces, and each network
interface must be assigned an IP address.
to
self.warnings_flags = [[-Wall], [-Wextra]]
400
{
*stream->GetStream () << Simulator::Now ().GetSeconds () << " " <<
static void
TraceCwnd ()
// Trace changes to the congestion window
{
AsciiTraceHelper ascii;
Ptr<OutputStreamWrapper> stream = ascii.CreateFileStream (fileNameRoot + ".cwnd");
Config::ConnectWithoutContext ("/NodeList/0/$ns3::TcpL4Protocol/SocketList/0/CongestionWi
}
The function TraceCwnd() arranges for tracing of cwnd; the function CwndChange is a callback, invoked by the ns-3 system whenever cwnd changes. Such callbacks are common in ns-3.
The parameter string beginning /NodeList/0/... is an example of the configuration namespace. Each
ns-3 attribute can be accessed this way. See 17.2.2 The Ascii Tracefile below.
int main (int argc, char *argv[])
{
int tcpSegmentSize = 1000;
Config::SetDefault ("ns3::TcpSocket::SegmentSize", UintegerValue (tcpSegmentSize));
Config::SetDefault ("ns3::TcpSocket::DelAckCount", UintegerValue (0));
Config::SetDefault ("ns3::TcpL4Protocol::SocketType", StringValue ("ns3::TcpReno"));
Config::SetDefault ("ns3::RttEstimator::MinRTO", TimeValue (MilliSeconds (500)));
The use of Config::SetDefault() allows us to configure objects that will not exist until some later point, perhaps not until the ns-3 simulator is running.
The first parameter
is an attribute string, of the form ns3::class::attribute.
A partial list of attributes is at
https://www.nsnam.org/docs/release/3.19/doxygen/group___attribute_list.html. Attributes of a class can
also be determined by a command such as the following:
./waf --run "basic1 --PrintAttributes=ns3::TcpSocket
The advantage of the Config::SetDefault mechanism is that often objects are created indirectly,
perhaps by helper classes, and so direct setting of class properties can be problematic.
It is perfectly acceptable to issue some Config::SetDefault calls, then create some objects (perhaps
implicitly), and then change the defaults (again with Config::SetDefault) for creation of additional
objects.
We pick the TCP congesion-control algorithm by setting ns3::TcpL4Protocol::SocketType.
Options are TcpRfc793 (no congestion control), TcpTahoe, TcpReno, TcpNewReno and
TcpWestwood. TCP Cubic and SACK TCP are not supported natively (though they are available if the
Network Simulation Cradle is installed).
Setting the DelAckCount attribute to 0 disables delayed ACKs. Setting the MinRTO value to 500 ms
avoids some unexpected hard timeouts. We will return to both of these below in 17.2.3 Unexpected
Timeouts and Other Phenomena.
Next comes our local variables and command-line-option processing. In ns-3 the latter is handled via
the CommandLine object, which also recognized the --PrintAttributes option above. Using the
--PrintHelp option gives a list of variables that can be set via command-line arguments.
401
//
//
//
//
//
seconds
ms
ms
Mbps
Mbps
// 0 means "unlimited"
CommandLine cmd;
// Here, we define command line options overriding some of the above.
cmd.AddValue ("runtime", "How long the applications should send data", runtime);
cmd.AddValue ("delayRB", "Delay on the R--B link, in ms", delayRB);
cmd.AddValue ("queuesize", "queue size at R", queuesize);
cmd.AddValue ("tcpSegmentSize", "TCP segment size", tcpSegmentSize);
cmd.Parse (argc, argv);
std::cout << "queuesize=" << queuesize << ", delayRB=" << delayRB << std::endl;
Next we create three nodes, illustrating the use of smart pointers and CreateObject().
Ptr<Node> A = CreateObject<Node> ();
Ptr<Node> R = CreateObject<Node> ();
Ptr<Node> B = CreateObject<Node> ();
Class Ptr is a smart pointer that manages memory through reference counting. The template function
CreateObject acts as the ns-3 preferred alternative to operator new. Parameters for objects created
this way can be supplied via Config::SetDefault, or by some later method call applied to the Ptr
object. For Node objects, for example, we might call A -> AddDevice(...).
A convenient alternative to creating nodes individually is to create a container of nodes all at once:
NodeContainer allNodes;
allNodes.Create(3);
Ptr<Node> A = allNodes.Get(0);
...
After the nodes are in place we create our point-to-point links, using the PointToPointHelper
class. We also create NetDeviceContainer objects; we dont use these here (we could simply call
AR.Install(A,R)), but will need them below when assigning IPv4 addresses.
// use PointToPointChannel and PointToPointNetDevice
NetDeviceContainer devAR, devRB;
PointToPointHelper AR, RB;
// create point-to-point link from A to R
AR.SetDeviceAttribute ("DataRate", DataRateValue (DataRate (fastBW * 1000 * 1000)));
AR.SetChannelAttribute ("Delay", TimeValue (MilliSeconds (delayAR)));
devAR = AR.Install(A, R);
// create point-to-point link from R to B
RB.SetDeviceAttribute ("DataRate", DataRateValue (DataRate (bottleneckBW * 1000 * 1000)));
RB.SetChannelAttribute ("Delay", TimeValue (MilliSeconds (delayRB)));
402
Next we hand out IPv4 addresses. The Ipv4AddressHelper class can help us with individual LANs (eg
AR and RB), but it is up to us to make sure our two LANs are on different subnets. If we attempt to put
A and B on the same subnet, routing will simply fail, just as it would if we were to do this with real network
nodes.
InternetStackHelper internet;
internet.Install (A);
internet.Install (R);
internet.Install (B);
// Assign IP addresses
Ipv4AddressHelper ipv4;
ipv4.SetBase ("10.0.0.0", "255.255.255.0");
Ipv4InterfaceContainer ipv4Interfaces;
ipv4Interfaces.Add (ipv4.Assign (devAR));
ipv4.SetBase ("10.0.1.0", "255.255.255.0");
ipv4Interfaces.Add (ipv4.Assign(devRB));
Ipv4GlobalRoutingHelper::PopulateRoutingTables ();
Next we print out the addresses assigned. This gives us a peek at the GetObject template and the ns-3
object-aggregation model. The original Node objects we created earlier were quite generic; they gained their
Ipv4 component in the code above. Now we retrieve that component with the GetObject<Ipv4>()
calls below.
Ptr<Ipv4> A4 = A->GetObject<Ipv4>(); // gets node As IPv4 subsystem
Ptr<Ipv4> B4 = B->GetObject<Ipv4>();
Ptr<Ipv4> R4 = R->GetObject<Ipv4>();
Ipv4Address Aaddr = A4->GetAddress(1,0).GetLocal();
Ipv4Address Baddr = B4->GetAddress(1,0).GetLocal();
Ipv4Address Raddr = R4->GetAddress(1,0).GetLocal();
std::cout
std::cout
std::cout
std::cout
<<
<<
<<
<<
"As
"Bs
"Rs
"Rs
In general, A->GetObject<T> returns the component of type T that has been aggregated to
Ptr<Object> A; often this aggregation is invisible to the script programmer but an understanding of
how it works is sometimes useful. The aggregation is handled by the ns-3 Object class, which contains
an internal list m_aggregates of aggregated companion objects. At most one object of a given type
can be aggregated to another, making GetObject<T> unambiguous. Given a Ptr<Object> A, we
can obtain an iterator over the aggregated companions via A->GetAggregateIterator(), of type
Object::AggregateIterator. From each Ptr<const Object> B returned by this iterator, we
can call B->GetInstanceTypeId().GetName() to get the class name of B.
The GetAddress() calls take two parameters; the first specfies the interface (a value of 0
gives the loopback interface) and the second distinguishes between multiple addresses assigned to
403
the same interface (which is not happening here). The call A4->GetAddress(1,0) returns an
Ipv4InterfaceAddress object containing, among other things, an IP address, a broadcast address
and a netmask; GetLocal() returns the first of these.
Next we create the receiver on B, using a PacketSinkHelper. A receiver is, in essense, a read-only
form of an application server.
// create a sink on B
uint16_t Bport = 80;
Address sinkAaddr(InetSocketAddress (Ipv4Address::GetAny (), Bport));
PacketSinkHelper sinkA ("ns3::TcpSocketFactory", sinkAaddr);
ApplicationContainer sinkAppA = sinkA.Install (B);
sinkAppA.Start (Seconds (0.01));
// the following means the receiver will run 1 min longer than the sender app.
sinkAppA.Stop (Seconds (runtime + 60.0));
Address sinkAddr(InetSocketAddress(Baddr, Bport));
Now comes the sending application, on A. We must configure and create a BulkSendApplication,
attach it to A, and arrange for a connection to be created to B. The BulkSendHelper class simplifies this.
BulkSendHelper sourceAhelper ("ns3::TcpSocketFactory", sinkAddr);
sourceAhelper.SetAttribute ("MaxBytes", UintegerValue (maxBytes));
sourceAhelper.SetAttribute ("SendSize", UintegerValue (tcpSegmentSize));
ApplicationContainer sourceAppsA = sourceAhelper.Install (A);
sourceAppsA.Start (Seconds (0.0));
sourceAppsA.Stop (Seconds (runtime));
If we did not want to use the helper class here, the easiest way to create the BulkSendApplication
is with an ObjectFactory. We configure the factory with the type we want to create and the
relevant configuration parameters, and then call factory.Create(). (We could have used the
Config::SetDefault() mechanism and CreateObject() as well.)
ObjectFactory factory;
factory.SetTypeId ("ns3::BulkSendApplication");
factory.Set ("Protocol", StringValue ("ns3::TcpSocketFactory"));
factory.Set ("MaxBytes", UintegerValue (maxBytes));
factory.Set ("SendSize", UintegerValue (tcpSegmentSize));
factory.Set ("Remote", AddressValue (sinkAddr));
Ptr<Object> bulkSendAppObj = factory.Create();
Ptr<Application> bulkSendApp = bulkSendAppObj -> GetObject<Application>();
bulkSendApp->SetStartTime(Seconds(0.0));
bulkSendApp->SetStopTime(Seconds(runtime));
A->AddApplication(bulkSendApp);
The above gives us no direct access to the actual TCP connection. Yet another alternative is to start by
creating the TCP socket and connecting it:
Ptr<Socket> tcpsock = Socket::CreateSocket (A, TcpSocketFactory::GetTypeId ());
tcpsock->Bind();
tcpsock->Connect(sinkAddr);
However, there is then no mechanism for creating a BulkSendApplication that uses a pre-existing
socket. (For a workaround, see the tutorial example fifth.cc.)
404
Before beginning execution, we set up tracing; we will look at the tracefile format later. We use the AR
PointToPointHelper class here, but both ascii and pcap tracing apply to the entire ARB network.
// Set up tracing
AsciiTraceHelper ascii;
std::string tfname = fileNameRoot + ".tr";
AR.EnableAsciiAll (ascii.CreateFileStream (tfname));
// Setup tracing for cwnd
Simulator::Schedule(Seconds(0.01),&TraceCwnd);
// this Time cannot be 0.0
// This tells ns-3 to generate pcap traces, including "-node#-dev#-" in filename
AR.EnablePcapAll (fileNameRoot);
// ".pcap" suffix is added automatically
405
Compare this graph to that in 16.2.1 Graph of cwnd v time produced by ns-2. The slow-start phase earlier
ended at around 2.0 and now ends closer to 3.0. There are several modest differences, including the halving
of cwnd just before T=1 and the peak around T=2.6; these were not apparent in the ns-2 graph.
After slow-start is over, the graphs are quite similar; cwnd ranges from 10 to 21. The period before was
1.946 seconds; here it is 2.0548; the difference is likely due to a more accurate implementation of the
recovery algorithm.
One striking difference is the presence of the near-vertical line of dots just after each peak. What is happening here is that ns-3 implements the cwnd inflation/deflation algorithm outlined at the tail end of 13.4 TCP
Reno and Fast Recovery. When three dupACKs are received, cwnd is set to cwnd/2 + 3, and is then allowed
to increase to 1.5cwnd. See the end of 13.4 TCP Reno and Fast Recovery.
4.9823
4.98312
4.98312
4.98312
/NodeList/1/DeviceList/1/$ns3::PointToPointNetDevice/TxQueue/Drop ns3::PppHeader
/NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/MacRx ns3::Ipv4Header (tos 0
/NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Enqueue ns3::PppHead
/NodeList/2/DeviceList/0/$ns3::PointToPointNetDevice/TxQueue/Dequeue ns3::PppHead
As with ns-2, the first letter indicates the action: r for received, d for dropped, + for enqueued, - for
dequeued. For Wi-Fi tracefiles, t is for transmitted. The second field represents the time.
The third field represents the name of the event in the configuration namespace, sometimes called the configuration path name. The NodeList value represents the node (A=0, etc), the DeviceList represents
the interface, and the final part of the name repeats the action: Drop, MacRx, Enqueue, Dequeue.
After that come a series of class names (eg ns3::Ipv4Header, ns3::TcpHeader), from the ns-3
attribute system, followed in each case by a parenthesized list of class-specific trace information.
406
In the output above, the final three records all refer to node B (/NodeList/2/). Packet 258 has just
arrived (Seq=258001), and ACK 259001 is then enqueued and sent.
We can enable ns-3s internal logging in the TcpReno class by entering the commands below, before
running the script. (In some cases, as with WifiHelper::EnableLogComponents(), logging output
can be enabled from within the script.) Once enabled, logging output is written to stderr.
NS_LOG=TcpReno=level_info
export NS_LOG
But then, despite Fast Recovery proceding normally, we get a hard timeout:
407
8.71463 [node 0] RTO. Reset cwnd to 960, ssthresh to 14400, restart from seqnum 510721
What is happening here is that the RTO interval was just a little too short, probably due to the use of the
awkward segment size of 960.
After the timeout, there is another triple-dupACK!
8.90344 [node 0] Triple dupack. Reset cwnd to 6240, ssthresh to 3360
Shortly thereafter, at T=8.98, cwnd is reset to 3360, in accordance with the Fast Recovery rules.
The overall effect is that cwnd is reset, not to 10, but to about 3.4 (in packets). This significantly slows
down throughput.
In recovering from the hard timeout, the sequence number is reset to Seq=510721 (packet 532), as this was
the last packet acknowledged. Unfortunately, several later packets had in fact made it through to B. By
looking at the tracefile, we can see that at T=8.7818, B received Seq=538561, or packet 561. Thus, when
A begins retransmitting packets 533, 534, etc after the timeout, Bs response is to send the ACK the highest
packet it has received, packet 561 (Ack=539521).
This scenario is not what the designers of Fast Recovery had in mind; it is likely triggered by a tooconservative timeout estimate. Still, exactly how to fix it is an interesting question; one approach might
be to ignore, in Fast Recovery, triple dupACKs of packets now beyond what the sender is currently sending.
17.3 Wireless
We next present the wireless simulation of 16.6 Wireless Simulation. The full script is at wireless.cc; the
animation output for the netanim player is at wireless.xml. As before, we have one mover node moving
horizontally 150 meters above a row of five fixed nodes spaced 200 meters apart. The limit of transmission is
set to be 250 meters, meaning that a fixed node goes out of range of the mover node just as the latter passes
over directly above the next fixed node. As before, we use Ad hoc On-demand Distance Vector (AODV) as
the routing protocol. When the mover passes over fixed node N, it goes out of range of fixed node N-1, at
which point AODV finds a new route to mover through fixed node N.
As in ns-2, wireless simulations tend to require considerably more configuration than point-to-point simulations. We now review the source code line-by-line. We start with two callback functions and the global
variables they will need to access.
408
The phyMode string represents the Wi-Fi data rate (and modulation technique). DSSS rates are
DsssRate1Mbps, DsssRate2Mbps, DsssRate5_5Mbps and DsssRate11Mbps. Also available
are ErpOfdmRate constants to 54 Mbps and OfdmRate constants to 150 Mbps with a 40 MHz bandwidth (GetOfdmRate150MbpsBW40MHz). All these are defined in src/wifi/model/wifi-phy.cc.
Next are the variables that determine the layout and network behavior. The factor variable allows slowing
down the speed of the mover node but correspondingly extending the runtime (though the new-routediscovery time is not scaled):
int
int
int
int
bottomrow = 5;
spacing = 200;
mheight = 150;
brheight = 50;
//
//
//
//
int X = (bottomrow-1)*spacing+1;
// X is the horizontal dimension of the field
int packetsize = 500;
double factor = 1.0; // allows slowing down rate and extending runtime; same total # of pa
int endtime = (int)100*factor;
double speed = (X-1.0)/endtime;
double bitrate = 80*1000.0/factor; // *average* transmission rate, in bits/sec
uint32_t interval = 1000*packetsize*8/bitrate*1000;
// in microsec
uint32_t packetcount = 1000000*endtime/ interval;
std::cout << "interval = " << interval <<", rate=" << bitrate << ", packetcount=" << packet
There are some niceties in calculating the packet transmission interval above; if we do it instead as
1000000*packetsize*8/bitrate then we sometimes run into 32-bit overflow problems or integer-divisionroundoff problems.
Now we configure some Wi-Fi settings.
17.3 Wireless
409
Here we create the mover node with CreateObject<Node>(), but the fixed nodes are created via a
NodeContainer, as is more typical with larger simulations
// Create nodes
NodeContainer fixedpos;
fixedpos.Create(bottomrow);
Ptr<Node> lowerleft = fixedpos.Get(0);
Ptr<Node> mover = CreateObject<Node>();
Now we put together a set of helper objects for more Wi-Fi configuration. We must configure both the
PHY (physical) and MAC layers.
// The below set of helpers will help us to put together the desired Wi-Fi behavior
WifiHelper wifi;
wifi.SetStandard (WIFI_PHY_STANDARD_80211b);
wifi.SetRemoteStationManager ("ns3::AarfWifiManager"); // Use AARF rate control
The AARF rate changes can be viewed by enabling the appropriate logging with, at the shell level before
./waf, NS_LOG=AarfWifiManager=level_debug. We are not otherwise interested in rate scaling
(3.3.3 Dynamic Rate Scaling) here, though.
The PHY layer helper is YansWifiPhyHelper. The YANS project (Yet Another Network Simulator)
was an influential precursor to ns-3; see [LH06]. Note the AddPropagationLoss configuration, where
we set the Wi-Fi range to 250 meters. The MAC layer helper is NqosWifiMacHelper; the nqos means
no quality-of-service, ie no use of Wi-Fi PCF (3.3.7 Wi-Fi Polling Mode).
// The PHY layer here is "yans"
YansWifiPhyHelper wifiPhyHelper = YansWifiPhyHelper::Default ();
// for .pcap tracing
// wifiPhyHelper.SetPcapDataLinkType (YansWifiPhyHelper::DLT_IEEE802_11_RADIO);
YansWifiChannelHelper wifiChannelHelper;
// *not* ::Default() !
wifiChannelHelper.SetPropagationDelay ("ns3::ConstantSpeedPropagationDelayModel"); // pld:
// the following has an absolute cutoff at distance > 250
wifiChannelHelper.AddPropagationLoss ("ns3::RangePropagationLossModel", "MaxRange", DoubleV
Ptr<YansWifiChannel> pchan = wifiChannelHelper.Create ();
wifiPhyHelper.SetChannel (pchan);
// Add a non-QoS upper-MAC layer "AdhocWifiMac", and set rate control
NqosWifiMacHelper wifiMacHelper = NqosWifiMacHelper::Default ();
wifiMacHelper.SetType ("ns3::AdhocWifiMac");
NetDeviceContainer devices = wifi.Install (wifiPhyHelper, wifiMacHelper, fixedpos);
devices.Add (wifi.Install (wifiPhyHelper, wifiMacHelper, mover));
At this point the basic Wi-Fi configuration is done! The next step is to work on the positions and motion.
First we establish the positions of the fixed nodes.
410
MobilityHelper sessile;
// for fixed nodes
Ptr<ListPositionAllocator> positionAlloc = CreateObject<ListPositionAllocator> ();
int Xpos = 0;
for (int i=0; i<bottomrow; i++) {
positionAlloc->Add(Vector(Xpos, brheight, 0.0));
Xpos += spacing;
}
sessile.SetPositionAllocator (positionAlloc);
sessile.SetMobilityModel ("ns3::ConstantPositionMobilityModel");
sessile.Install (fixedpos);
Uncommenting the olsr line (and commenting out the last line) is all that is necessary to change to OLSR
routing. OLSR is slower to find new routes, but sends less traffic.
Now we set up the IP addresses. This is straightforward as all the nodes are on a single subnet.
InternetStackHelper internet;
internet.SetRoutingHelper(listrouting);
internet.Install (fixedpos);
internet.Install (mover);
Ipv4AddressHelper ipv4;
NS_LOG_INFO ("Assign IP Addresses.");
ipv4.SetBase ("10.1.1.0", "255.255.255.0");
Ipv4InterfaceContainer i = ipv4.Assign (devices);
Now we create a receiving application UdpServer on node mover, and a sending application
UdpClient on the lower-left node. These applications generate their own sequence numbers, which show
up in the ns-3 tracefiles marked with ns3::SeqTsHeader. As in 17.2 A Single TCP Sender, we use
Config::SetDefault() and CreateObject<>() to construct the applications.
17.3 Wireless
411
We now set up tracing. The first, commented-out line enables pcap-format tracing, which we do not need
here. The YansWifiPhyHelper object supports tracing only of receive (r) and transmit (t) records;
the PointtoPointHelper of 17.2 A Single TCP Sender also traced enqueue and drop records.
//wifiPhyHelper.EnablePcap (tracebase, devices);
AsciiTraceHelper ascii;
wifiPhyHelper.EnableAsciiAll (ascii.CreateFileStream (tracebase + ".tr"));
// create animation file, to be run with netanim
AnimationInterface anim (tracebase + ".xml");
anim.SetMobilityPollInterval(Seconds(0.1));
If we view the animation with netanim, the moving nodes motion is clear. The mover node, however,
sometimes appears to transmit back to both the fixed-row node below left and the fixed-row node below
right. These transmissions represent the Wi-Fi link-layer ACKs; they appear to be sent to two fixed-row
nodes because what netanim is actually displaying with its blue links is transmission every other node in
range.
We can also view the motion in text format by uncommenting the first line below.
//Simulator::Schedule(Seconds(position_interval), &printPosition);
Simulator::Schedule(Seconds(endtime), &stopMover);
Finally it is time to run the simulator, and print some final output.
412
Simulator::Stop(Seconds (endtime+60));
Simulator::Run ();
Simulator::Destroy ();
int pktsRecd
std::cout <<
std::cout <<
std::cout <<
= UdpRecvApp->GetReceived();
"packets received: " << pktsRecd << std::endl;
"packets recorded as lost: " << (UdpRecvApp->GetLost()) << std::endl;
"packets actually lost: " << (packetcount - pktsRecd) << std::endl;
return 0;
}
That is, packets 0-499 were transmitted only by node 0. Packet 500 was never received by mover, but
there were seven transmission attempts; these seven attempts follow the rules described in 3.3.2 Wi-Fi and
Collisions. Packets starting at 501 were transmitted by node 0 and then later by node 1. Similarly, packet
1000 was lost, and after that each packet arriving at mover was first transmitted by nodes 0, 1 and 2, in that
order. In other words, packets are indeed being forwarded rightward along the line of fixed-row nodes until
a node is reached that is in range of mover.
17.3 Wireless
413
to
listrouting.Add(dsdv, 10);
we find that the loss count goes from 4 packets out of 2000 to 398 out of 2000; for OLSR routing the loss
count is 426. As we discussed in 16.6 Wireless Simulation, the loss of one data packet triggers the AODV
implementation to look for a new route. The DSDV and OLSR implementations, on the other hand, only
look for new routes at regularly spaced intervals.
17.4 Exercises
In preparation.
414
Is giving all control of congestion to the TCP layer really the only option? As the Internet has evolved,
so have situations in which we may not want routers handling all traffic on a first-come, first-served basis.
Traffic with delay bounds so-called real-time traffic, often involving either voice or video is likely to
perform much better with preferential service, for example; we will turn to this in 19 Quality of Service.
But even without real-time traffic, we might be interested in guaranteeing that each of several customers gets
an agreed-upon fraction of bandwidth, regardless of what the other customers are receiving. If we trust only
to TCP Renos bandwidth-allocation mechanisms, and if one customer has one connection and another has
ten, then the bandwidth received may also be in the ratio of 1:10. This may make the first customer quite
unhappy.
The fundamental mechanism for achieving these kinds of traffic-management goals in a shared network is
through queuing; that is, in deciding how the routers prioritize traffic. In this chapter we will take a look at
what router-based strategies are available; in the following chapter we will study how some of these ideas
have been applied to develop distributed quality-of-service options.
Previously, in 14.1 A First Look At Queuing, we looked at FIFO queuing both tail-drop and random-drop
variants and (briefly) at priority queuing. These are examples of queuing disciplines, a catchall term
for anything that supports a way to accept and release packets. The RED gateway strategy (14.8.3 RED
gateways) could qualify as a separate queuing discipline, too, although one closely tied to FIFO.
Queuing disciplines provide tools for meeting administratively imposed constraints on traffic. Two senders,
for example, might be required to share an outbound link equally, or in the proportion 60%-40%, even if
one participant would prefer to use 100% of the bandwidth. Alternatively, a sender might be required not to
send in bursts of more than 10 packets at a time.
Closely allied to the idea of queuing is scheduling: deciding what packets get sent when. Scheduling may
take the form of sending someone elses packets right now, or it may take the form of delaying packets that
are arriving too fast.
While priority queuing is one practical alternative to FIFO queuing, we will also look at so-called fair
queuing, in both flat and hierarchical forms. Fair queuing provides a straightforward strategy for dividing
bandwidth among multiple senders according to preset percentages.
Also introduced here is the token-bucket mechanism, which can be used for traffic scheduling but also for
traffic description.
Some of the material here in particular that involving fair queuing and the Parekh-Gallager theorem may
give this chapter a more mathematical feel than earlier chapters. Mostly, however, this is confined to the
proofs; the claims themselves are more straightforward.
415
In its original conception, the Internet was arguably intended for non-time-critical transport. If you wanted
to place a digital phone call where every (or almost every) byte was guaranteed to arrive within 50 ms, your
best bet might be to use the (separate) telephone network instead.
And, indeed, having an entirely separate network for real-time transport is definitely a workable solution.
It is, however, expensive; there are many economies of scale to having just a single network. There is,
therefore, a great deal of interest in figuring out how to get the Internet to support real-time traffic directly.
The central strategy for mixing real-time and bulk traffic is to use queuing disciplines to give the real-time
traffic the service it requires. Priority queuing is the simplest mechanism, though the fair-queuing approach
below offers perhaps greater flexibility.
We round out the chapter with the Parekh-Gallager theorem, which provides a precise delay bound for realtime traffic that shares a network with bulk traffic. All that is needed is that the real-time traffic satisfies a
token-bucket specification and is assigned bandwidth guarantees through fair queuing; the volume of bulk
traffic does not matter. This is exactly what is needed for real-time support.
While this chapter contains some rather elegant theory, it is not at all clear how much it is put into practice
today, at least for real-time traffic at the ISP level. We will return to this issue in the following chapter, but
for now we acknowledge that real-time traffic management using the queuing mechanisms described here
has seen limited acceptance in the Internet marketplace.
416
the higher-priority queue; if it is nonempty then a packet is dequeued from there. Only if the higher-priority
queue is empty is the lower-priority queue served.
Note that priority queuing is nonpreemptive: if a high-priority packet arrives while a low-priority packet is
being sent, the latter is not interrupted. Only when the low-priority packet has finished transmission does
the router again check its high-priority subqueue(s).
Priority queuing can lead to complete starvation of low-priority traffic, but only if the high-priority traffic
consumes 100% of the outbound bandwidth. Often we are able to guarantee (for example, through admission
control) that the high-priority traffic is limited to a designated fraction of the total outbound bandwidth.
417
this might be fair queuing (below), which often supports a configuration in which a separate input class is
created on the fly for each separate TCP connection.
FIFO and priority queuing are both work-conserving, meaning that the associated outbound interface is not
idle unless the queue is empty. A non-work-conserving queuing discipline might, for example, artificially
delay some packets in order to enforce an administratively imposed bandwidth cap. Non-work-conserving
queuing disciplines are often called traffic shapers or policers; see 18.9 Token Bucket Filters below for an
example.
allocation is to be in the ratio 1: 2, the first sender might always send 1 packet while the second might send
418
The fluid-based GPS model approach to fair queuing, 18.5.4 The GPS Model, does provide an algorithm
that has direct, natural support for weighted fair queuing.
In the diagram above, transmission in nondecreasing order of CP means transmission in left-to-right order
of the vertical lines marking packet divisions, eg Q1, P1, R1, Q2, Q3, R2, .... This ensures that, in the long
run, each subqueue gets an equal share of bandwidth.
A completely equivalent strategy, better suited for generalization to the case where not all subqueues are
18.5 Fair Queuing
419
always active, is to send each packet in nondecreasing order of virtual finishing times, calculated for each
packet with the assumption that only that packets subqueue is active. The virtual finishing time FP of packet
P is equal to CP divided by the output bandwidth. We use finishing times rather than starting times because
if one packet is very large, shorter packets in other subqueues that would finish sooner should be sent first.
18.5.2.1 A first virtual-finish example
As an example, suppose there are two subqueues, P and Q. Suppose further that a stream of 1001-byte
packets P1 , P2 , P3 , ... arrives for P, and a stream of 400-byte packets Q1 , Q2 , Q3 , ... arrives for Q; each
stream is steady enough that each subqueue is always active. Finally, assume the output bandwidth is 1 byte
per unit time, and let T=0 be the starting point.
For the P subqueue, the virtual finishing times calculated as above would be P1 at 1001, P2 at 2002, P3
at 3003, etc; for Q the finishing times would be Q1 at 400, Q2 at 800, Q3 at 1200, etc. So the order of
transmission of all the packets together, in increasing order of virtual finish, will be as follows:
Packet
Q1
Q2
P1
Q3
Q4
Q5
P2
virtual finish
400
800
1001
1200
1600
2000
2002
actual finish
400
800
1801
2201
2601
3001
4002
For each packet we have calculated in the table above its virtual finishing time, and then its actual wallclock
finishing time assuming packets are transmitted in nondecreasing order of virtual finishing time (as shown).
Because both subqueues are always active, and because the virtual finishing times assumed each subqueue
received 100% of the output bandwidth, in the long run the actual finishing times will be about double the
virtual times. This, however, is irrelevant; all that matters is the relative virtual finishing times.
18.5.2.2 A second virtual-finish example
For the next example, however, we allow a subqueue to be idle for a while and then become active; in this
situation virtual finishing times do not work quite so well. We return to our initial simplification that all
packets are the same size, which we take to be 1 unit; this allows us to apply the round-robin mechanism to
determine the transmission order and compare this to the virtual-finish order. Assume there are three queues
P, Q and R, and P is empty until wallclock time 20. Q is constantly busy; its Kth packet QK has virtual
finishing time FK = K.
For the first case, assume R is completely idle. When Ps first packet P1 arrives at time 20, its virtual
finishing time will be 21. At time 20 the head packet in Q will be Q21 ; the two packets therefore have
identical virtual finishing times. And, encouragingly, under round-robin queue service P1 and Q21 will be
sent in the same round.
For the second case, however, suppose R is also constantly busy. Up until time 20, Q and R have each sent
10 packets; their next packets are Q11 and R11 , with virtual finishing times of T=11. When Ps first packet
arrives at T=20, again with virtual finishing time 21, under round-robin service it should be sent in the same
420
round as Q11 and R11 . Yet their virtual finishing times are off by a factor of about two; queue Ps stretch of
inactivity has left it far behind. Virtual finishing times, as we have been calculating them so far, simply do
not work.
The trick, as it turns out, is to measure elapsed time not in terms of packet-transmission times (ie wallclock
time), but rather in terms of rounds of round-robin transmission. This amounts to scaling the clock used
for measuring arrival times; counting in rounds rather than packets means that we run this clock at rate 1/N
when N subqueues are active. If we do this in case 1, with N=1, then the finishing times are unchanged.
However, in case 2, with N=2, packet P1 arrives after 20 time units but only 10 rounds; the clock runs at half
rate. Its calculated finishing time is thus 11, exactly matching the finishing times of the two long-queued
packets Q11 and R11 with which P1 shares a round-robin transmission round.
We formalize this in the next section, extending the idea to include both variable-sized packets and
sometimes-idle subqueues. Note that only the clock that measures arrival times is scaled; we do not scale
the calculated transmission times.
421
Queue
P
P
Q
Q
Q
R
R
Size
1000
1000
600
400
400
200
200
Arrival time, t
0
0
800
800
800
1200
2100
At t=0, we have R(t)=0 and we assign finishing R-values F(P1 )=1000 to P1 and F(P2 ) = F(P1 )+1000 = 2000
to P2 . Transmission of P1 begins.
When the three Q packets arrive at t=800, we have R(t)=800 as well, as only one subqueue has been active.
We assign finishing R-values for the newly arriving Q1 , Q2 and Q3 of F(Q1 ) = 800+600 = 1400, F(Q2 ) =
1400+400 = 1800, and F(Q3 ) = 1800+400 = 2200. At this point, BBRR begins serving two subqueues, so
the R(t) rate is cut in half.
At t=1000, transmission of packet P1 is completed; R(t) is 800 + 200/2 = 900. The smallest finishing R-value
on the books is F(Q1 ), at 1400, so Q1 is the second packet transmitted. Q1 s real finishing time will be t =
1000+600 = 1600.
At t=1200, R1 arrives; transmission of Q1 is still in progress. R(t) is 800 + 400/2 = 1000; we calculate F(R1 )
= 1000 + 200 = 1200. Note this is less than the finishing R-value for Q1 , which is currently transmitting, but
Q1 is not preempted. At this point (t=1200, R(t)=1000), the R(t) rate falls to 1/3.
At t=1600, Q1 has finished transmission. We have R(t) = 1000 + 400/3 = 1133. The next smallest finishing
R-value is F(R1 ) = 1200 so transmission of R1 commences.
At t=1800, R1 finishes. We have R(1800) = R(1200) + 600/3 = 1000 + 200 = 1200 (3 subqueues have been
busy since t=1200). Queue R is now empty, so the R(t) rate rises from 1/3 to 1/2. The next smallest finishing
R-value is F(Q2 )=1800, so transmission of Q2 begins. It will finish at t=2200.
At t=2100, we have R(t) = R(1800) + 300/2 = 1200 + 150 = 1350. R2 arrives and is assigned a finishing time
of F(R2 ) = 1350 + 200 = 1550. Again, transmission of Q2 is not preempted even though F(R2 ) < F(Q2 ). The
R(t) rate again falls to 1/3.
422
At t=2200, Q2 finishes. R(t) = 1350 + 100/3 = 1383. The next smallest finishing R-value is F(R2 )=1550, so
transmission of R2 begins.
At t=2400, transmission of R2 ends. R(t) is now 1350 + 300/3 = 1450. The next smallest finishing R-value
is F(P2 ) = 2000, so transmission of P2 begins. The R(t) rate rises to 1/2, as queue R is again empty.
At t=3400, transmission of P2 ends. R(t) is 1450 + 1000/2 = 1950. The only remaining unsent packet is Q3 ,
with F(Q3 )=2200. We send it.
At t=3800, transmission of Q3 ends. R(t) is 1950 + 400/1 = 2350.
To summarize:
Packet
P1
Q1
R1
Q2
R2
P2
Q3
calculated finish
R-value
1000
1400
1200*
1800
1550*
2000
2200
R-value when
sent
0
900
1133
1200
1383
1450
1950
R-value at
finish
900
1133
1200
1383
1450
1950
2350
Packets arrive, begin transmission and finish in real time. However, the number of queues active in real
time affects the rate of the rounds-counter R(t); this value is then attached to each packet as it arrives as its
virtual finishing time, and determines the order of packet transmission.
The change in R-value from start to finish exactly matches the packet size when the packet is virtually sent
via BBRR. When the packet is sent as an indivisible unit, as in the table above, the change in R-value is
usually much smaller, as the R-clock runs slower whenever at least two subqueues are in use.
The calculated-finish R-values are not in fact increasing, as can be seen at the starred (*) values. This is
because, for example, R1 was not yet available when it was time to send Q1 .
Computationally, maintaining the R-value counter is inconsequential. The primary performance issue with
BBRR simulation is the need to find the smallest R-value whenever a new packet is to be sent. If n is the
number of packets waiting to be sent, then we can do this in time O(log(n)) by keeping the R-values sorted
in an appropriate data structure.
The BBRR approach assumes equal weights for each subqueue; this does not generalize completely straightforwardly to weighted fair queuing as the number of subqueues cannot be fractional. If there are two queues,
one which is to have weight 40% and the other 60%, we could use BBRR with five subqueues, two of which
(2/5) are assigned to the 40%-subqueue and the other three (3/5) to the 60% subqueue. But this becomes
increasingly awkward as the fractions become less simple; the GPS model, next, is a better option.
423
an infinitesimal variant of BBRR. The GPS model has an advantage of generalizing straightforwardly to
weighted fair queuing.
Other fluid models have also been used in the analysis of networks, eg for the study of TCP, though we do
not consider these here. See [MW00] for one example.
For the GPS model, assume there are N input subqueues, and the ith subqueue, 0i<N, is to receive fraction
i > 0, where 0 +1 + ... + N1 =1. If at some point a set A of input subqueues is active , say A = {0,2,4},
then subqueue 0 will receive fraction 0 /(0 +2 +4 ), and subqueues 2 and 4 similarly. The router forwards
packets from each active subqueue simultaneously, each at its designated rate.
The GPS model (and the BBRR model) provides an ideal degree of isolation between input flows: each flow
is insulated from any delay due to packets on competing flows. The ith flow receives bandwidth of at least
i and packets wait only for other packets belonging to the same flow. When a packet arrives for an inactive
subqueue, forwarding begins immediately, interleaved with any other work the router is doing. Traffic on
other flows can reduce the real rate of a flow, but not its virtual rate.
While GPS is convenient as a model, it is even less implementable, literally, than BBRR. As with BBRR,
though, we can use the GPS model to determine the order of one-packet-at-a-time transmission. As each
real packet arrives, we calculate the time it would finish, if we were using GPS. Packets are then transmitted
under WFQ one at a time, in order of increasing GPS finishing time.
In lieu of the BBRR rounds counter R(t), a virtual clock VC(t) is used that runs at an increased rate 1/
1 where is the sum of the i for the active subqueues. That is, if subqueues 0, 2 and 4 are active,
then the VC(t) clock runs at a rate of 1/(0 +2 +4 ). If all the i are equal, each to 1/N, then VC(t) always
runs N times faster than R(t), and so VC(t) = NR(t); the VC clock runs at wallclock speed when all input
subqueues are active and speeds up as subqueues become idle.
For any one active subqueue i, the GPS rate of transmission relative to the virtual clock (that is, in units of
bits per virtual-second) is always equal to fraction i of the full output-interface rate. That is, if the output
rate is 10 Mbps and an active flow has fraction = 0.4, then it will always transmit at 4 bits per virtual
microsecond. When all the subqueues are active, and the VC clock runs at wallclock speed, the flows actual
rate will be 4 bits/sec. When the subqueue is active alone, its speed measured by a real clock will be 10
bit/sec but the virtual clock will run 2.5 times faster so 10 bits/sec is 10 bits per 2.5 virtual microseconds,
or 4 bits per virtual microsecond.
To make this claim more precise, let A be the set of active queues, and let again be the sum of the j for j
in A. Then VC(t) runs at rate 1/ and active subqueue is data is sent at rate i / relative to wallclock time.
Subqueue is transmission rate relative to virtual time is thus (i /)/(1/) = i .
As other subqueues become inactive or become active, the VC(t) rate and the actual transmission rate move
in lockstep. Therefore, as with BBRR, a packet P of size S on subqueue i that starts transmission at virtual
time T will finish at T + S/(ri ) by the VC clock, where r is the actual output rate of the router, regardless
of what is happening in the other subqueues. In other words, VC-calculated finishing times are invariant.
To round out the calculation of finishing times, suppose packet P of size S arrives on an active GPS subqueue i. The p
FP = max(VC(now), Fprev ) + S/(ri )
In 18.8.1.1 WFQ with non-FIFO subqueues below, we will consider WFQ routers that, as part of a hierarchy, are in effect only allowed to transmit intermittently. In such a case, the virtual clock should be
424
suspended whenever output is blocked. This is perhaps easiest to see for the BBRR scheduler: the roundscounter RR(t) is to increment by 1 for each bit sent by each active subqueue. When no bits may be sent, the
clock should not increase.
As an example of what happens if this is not done, suppose R has two subqueues A and B; the first is empty
and the second has a long backlog. R normally processes one packet per second. At T=0/VC=0, Rs output
is suspended. Packets in the second subqueue b1 , b2 , b3 , ... have virtual finishing times 1, 2, 3, .... At T=10,
R resumes transmission, and packet a1 arrives on the A subqueue. If Rs virtual clock had been suspended
for the interval 0T10, a1 would be assigned finishing time T=1 and would have priority comparable to
b1 . If Rs virtual clock had continued to run, a1 would be assigned finishing time T=11 and would not be
sent until b11 reached the head of the B queue.
18.5.4.1 The WFQ scheduler
To schedule actual packet transmission under weighted fair queuing, we calculate upon arrival each packets
virtual-clock finishing time assuming it were to be sent using GPS. Whenever the sender is ready to start
transmission of a new packet, it selects from the available packets the one with the smallest GPS-finishingtime value. By the argument above, a packets GPS finishing time does not depend on any later arrivals or
idle periods on other subqueues. As with BBRR, small but later-arriving packets might have smaller virtual
finishing times, but a packet currently being transmitted will not be interrupted.
18.5.4.2 Finishing Order under GPS and WFQ
We now look at the order in which packets finish transmission under GPS versus WFQ. The goal is to
provide in 18.5.4.7 Finishing-Order Bound a tight bound on how long packets may have to wait under
WFQ compared to GPS. We emphasize again:
GPS finishing time: the theoretical finishing time based on parallel multi-packet transmissions under
the GPS model
WFQ finishing time: the real finishing time assuming packets are sent sequentially in increasing order
of calculated GPS finishing time
One way to view this is as a quantification of the informal idea that WFQ provides a natural priority for
smaller packets, at least smaller packets sent on previously idle subqueues. This is quite separate from
the bandwidth guarantee that a given small-packet input class might receive; it means that small packets
are likely to leapfrog larger packets waiting in other subqueues. The quantum algorithm, below, provides
long-term WFQ bandwidth guarantees but does not provide the same delay assurances.
First, if all subqueues are always active (or if a fixed subset of subqueues is always active), then packets finish
under WFQ in the same order as they do under GPS. This is because under WFQ packets are transmitted
in the order of GPS finishing times according the virtual clock, and if all subqueues are always active the
virtual clock runs at a rate identical to wallclock time (or, if a fixed subset of subqueues is always active, at
a rate proportional to wallclock time).
If all subqueues are always active, we can assume that all packets were in their subqueues as of time T=0;
the finishing order is the same as long as each packet arrived before its subqueue went inactive.
Finally, if all subqueues are always active then each packet finishes at least as early under WFQ as under
GPS. To see this, let Pj be the jth packet to finish, under either GPS or WFQ. At the time when Pj finishes
18.5 Fair Queuing
425
under WFQ, the router R will have devoted 100% of its output bandwidth exclusively to P1 through Pj .
When Pj finishes under GPS, R will also have transmitted P1 through Pj , and may have transmitted fractions
of later packets as well. Therefore, the Pj finishing time under GPS cannot be earlier.
The finishing order and the relative GPS/WFQ finishing times may change, however, if as will usually
be the case some subqueues are sometimes idle; that is, if packets sometimes arrive late for some
subqueues.
18.5.4.3 GPS Example 1
As a first example we return to the scenario of 18.5.2.1 A first virtual-finish example. The routers two
subqueues are always active; each has an allocation of =50%. Packets P1 , P2 , P3 , ..., all of size 1001, wait
in the first queue; packets Q1 , Q2 , Q3 , ..., all of size 400, wait in the second queue. Output bandwidth is 1
byte per unit time, and T=0 is the starting point.
The routers virtual clock runs at wallclock speed, as both subqueues are always active. If Fi represents the
virtual finishing time of Ri , then we now calculate Fi as Fi-1 + 400/ = Fi-1 + 800. The virtual finishing times
of P1 , P2 , etc are similarly at multiples of 2002.
Packet
Q1
Q2
P1
Q3
Q4
Q5
P2
virtual finish
800
1600
2002
2400
3200
4000
4004
In the table above, the virtual finish column is simply double that of the BBRR version, reflecting the fact
that the virtual finishing times are now scaled by a factor of 1/ = 2. The actual finish times are identical to
what we calculated before.
Note that, in every case, the actual WFQ finish time is always less than or equal to the virtual GPS finish
time.
18.5.4.4 GPS Example 2
If the router has only a single active subqueue, with share and packets P1 , P2 , P3 , ..., then the calculated
virtual-clock packet finishing times will be equal to the time on the virtual clock at the point of actual finish,
at least if this has been the case since the virtual clock last restarted at T=VC=0. Let r be the output rate of
the router, let S1 , S2 , S3 be the sizes of the packets and let F1 , F2 , F3 be their virtual finishing times with
F0 =0. Then
Fi = Fi-1 + Si /(r) = S1 /(r) + ... + Si /(r)
The ith packets actual finishing time Ai is (S1 + ... + Si )/r, which is Fi . But the virtual clock runs fast by
a factor of 1/, so the actual finishing time on the virtual clock is Ai / = Fi .
426
subqueue 2
Q1 : T=0, L=2
Q2 : T=4, L=6
subqueue 3
R1 : T=10, L=5
Under WFQ, we send P1 and then Q1 ; Q1 is second because its finishing time is later. When Q1 finishes the
wallclock time is T=3. At this point, P2 is the only packet available to send; it finishes at T=13.
Up until T=10, we have two packets in progress under GPS (because Q1 finishes under GPS at T=4 and
Q2 arrives at T=4), and so the GPS clock runs at rate 3/2 of wallclock time and the BBRR clock runs at
rate 1/2 of wallclock time. At T=4, when Q2 arrives, the BBRR clock is at 2 and the VC clock is at 6 and
we calculate the BBRR finishing time as 2+6=8 and the GPS finishing time as 6+6/(1/3) = 24. At T=10,
the BBRR clock is at 5 and the GPS clock is 15. R1 arrives then; we calculate its BBRR finishing time as
5+5=10 and its GPS finishing time as 15+5/ = 30.
Because Q2 has the earlier virtual-clock finishing time, WFQ sends it next after P2 , followed by R1 .
Here is a diagram of transmission under GPS. The chart itself is scaled to wallclock times. The BBRR clock
is on the scale below; the VC clock always runs three times faster.
The circled numbers represent the size of the portion of the packet sent in the intervals separated by the
dotted vertical lines; for each packet, these add up to the packets total size.
Note that, while the transmission order under WFQ is P1 , Q1 , P2 , Q2 , R1 , the finishing order under GPS is
P1 , Q1 , Q2 , R1 , P2 . That is, P2 managed to leapfrog Q2 and R1 under WFQ by the simple expedient of being
the only packet available for transmission at T=3.
18.5.4.6 GPS Example 4
As a second example of leapfrogging, suppose we have the following arrivals; in this scenario, the smaller
but later-arriving R1 finishes ahead of P1 and Q2 under GPS, but not under WFQ.
427
subqueue 1
P1 : T=0, L=1000
subqueue 2
Q1 : T=0, L=200
Q2 : T=0, L=300
subqueue3
R1 : T=600, L=100
The following diagram shows how the packets shared the link under GPS over time. As can be seen, the
GPS finishing order is Q1 , R1 , Q2 , P1 .
Under WFQ, the transmission order is Q1 , Q2 , P1 , R1 , because when Q2 finishes at T=500, R1 has not yet
arrived.
18.5.4.7 Finishing-Order Bound
These examples bring us to the following delay-bound claim, due to Parekh and Gallager [PG93]; we will
make use of it below in 18.13.3 Parekh-Gallager Theorem. It is arguably the deepest part of the ParekhGallager theorem.
Claim: For any packet P, the wallclock finishing time of P at a router R under WFQ cannot be later than the
wallclock finishing time of P at R under GPS by more than the time R needs to transmit the maximum-sized
packet that can appear.
Expressed symbolically, if FWFQ and FGPS are the finishing times for P under WFQ and GPS, Rs outbound
transmission rate is r, and Lmax is the maximum packet size that can appear at R, then
FWFQ FGPS + Lmax /r
This is the best possible bound; Lmax /r is the time packet P must wait if it has arrived an instant too late and
another packet of size Lmax has started instead.
Note that, if a packets subqueue is inactive, the packet starts transmitting immediately upon arrival under
GPS; however, GPS may send the packet relatively slowly.
To prove this claim, let us number the packets P1 through Pk in order of WFQ transmission, starting from
the most recent point when at least one subqueue of the router became active. (Note that these packets may
be spread over multiple input subqueues.) For each i, let Fi be the finishing time of Pi under WFQ, let Gi be
the finishing time of Pi under GPS, and let Li be the length of Pi ; note that, for each i, Fi+1 = Li+1 /r + Fi .
If Pk finishes after P1 through Pk-1 under GPS, then the argument above (18.5.4.2 Finishing Order under
GPS and WFQ) for the all-subqueues-active case still applies to show Pk cannot finish earlier under GPS
than it does under WFQ; that is, we have Fk Gk .
428
Otherwise, some packet Pi with i<k must finish after Pk under GPS; Pi has leapfrogged Pk under WFQ,
presumably because Pk was late in arriving. Let Pm be the most recent (largest m<k) such leapfrogger
packet, so that Pm finishes after Pk under GPS but Pm+1 through Pk-1 finish earlier (or are tied); this was
illustrated above in 18.5.4.5 GPS Example 3 for k=5 and m=3.
We next claim that none of the packets Pm+1 through Pk could have yet arrived at R at the time Tstart when
Pm began transmission under WFQ. If some Pi with i>m were present at time Tstart , then the fact that it is
transmitted after Pm under WFQ would imply that the calculated GPS finishing time came after that of Pm .
But, as we argued earlier, calculated virtual-clock GPS finishing times are always the actual virtual-clock
GPS finishing times, and we cannot have Pi finishing both ahead of Pm and behind it.
Recalling that Fm is the finishing time of Pm under WFQ, the time Tstart above is simply Fm - Lm /r. Between
Tstart and Gk , all the packets Pm+1 through Pk must arrive and then, under GPS, depart. The absolute minimum time to send these packets under GPS is the WFQ time for end-to-end transmission, which is (Lm+1 +
.. + Lk )/r = Fk - Fm . Therefore we have
Gk - Tstart Fk - Fm
Gk - (Fm - Lm /r) Fk - Fm
Fk Gk + Lm /r Gk + Lmax /r
The last line represents the desired conclusion.
packets sent
1
2,3
4,5
new deficit
400
200
0
429
In three rounds the subqueue has been allowed to send 5600 = 3000 bytes, which is exactly 3 quantums. We
will refer to this strategy, with provision for deficits as above, as the quantum algorithm for fair queuing.
If a subqueue is ever empty, its deficit should immediately expire.
We can implement weighted fair queuing using the quantum algorithm by adjusting the quantum sizes in
proportion to the weights; that is, if the weights are 40% and 60% then the respective quanta might be 1000
bytes and 1500 bytes. The quantum should be at least as large as the largest packet (eg the MTU), so that
if one input class is to have 10% weight and the other 90%, then the second class will have a quantum of 9
times the MTU. Of course, if the smaller-weighted class happens to be a VoIP stream with smaller packets
as well, this is less of an issue.
430
Unfortunately, this strategy does not work quite as well with three receivers. Suppose now A, B and C are
three pools of users (eg three departments), and we want to give them each equal shares of an inbound 6
Mbps link:
Inbound fair queuing could be used at the ISP (upstream) end of the connection, where the three flows
would be identified by their destinations. But the ISP router is likely not ours to play with. Once the three
flows arrive at R, it is too late to allocate shares; there is nothing to do but deliver the traffic to A, B and C
respectively. We will return to this scenario below in 18.10 Applications of Token Bucket.
Note that for these kinds of applications we need to be able to designate administratively how packets get
classified into different input classes. The automatic classification of stochastic fair queuing, above, will not
help here.
431
separate WFQ nodes into a tree with real links, then either packets simply pile up at the root, or else the
interior links become the bottleneck. For a case where the internal-storage construction does work, see the
end of 18.12 Hierarchical Token Bucket.
432
We now need a classifier, as with any multi-subqueue queuing discipline, to place input packets into one of
the four leaf nodes labeled A, B, C and D above; these likely represent FIFO queues. Once everything is set
up, the dequeuing rules are as follows:
at the root node, we first see if packets are available in the left (high-priority) subqueue, QA. If so, we
dequeue the packet from there; if not, we go on to the right subqueue QB.
at either QA or QB, we first see if packets are available in the left input, labeled 0; if so, we dequeue
from there. Otherwise we see if packets are available from the right input, labeled 1. If neither, then
the subqueue is empty.
The catch here is that hierarchical priority queuing collapses to a single layer, with four priority levels 00,
01, 10, and 11 (from high to low):
For many other queuing disciplines, however, the hierarchy does not collapse. One example of this might
be a root queuing discipline PQR using priority queuing, with two leaf nodes using fair queuing, FQA and
FQB. (Fair queuing does not behave well as an interior node without modification, as mentioned above, but
it can participate in generic hierarchical queuing just fine as a leaf node.)
433
Senders A1 and A2 receive all the bandwidth, if either is active; when this is the case, B1 and B2 get
nothing. If both A1 and A2 are active, they get equal shares. Only when neither is active do B1 and B2
receive anything, in which case they divide the bandwidth fairly between themselves. The root node will
check on each dequeue() operation whether FQA is nonempty; if it is, it dequeues a packet from FQA
and otherwise dequeues from FQB. Because the root does not dequeue a packet from FQA or FQB unless it
is about to return it via its own dequeue() operation, those two child nodes do not have to decide which
of their internal per-class queues to service until a packet is actually needed.
434
If, however, node FQD becomes inactive, then FQA will assign 100% to FQC, in which case L1 will receive
70%100%50% = 35% of the bandwidth.
Alternatively, we can stick to real packets, but simplify by assuming all are the same size and that child
allocations are all equal; in this situation we can implement hierarchical fair queuing via generic hierarchical
queuing. Each parent node simply dequeues from its child nodes in round-robin fashion, skipping over any
empty children.
Hierarchical fair queuing unlike priority queuing does not collapse; the hierarchical queuing discipline
is not equivalent to any one-level queuing discipline. Here is an example illustrating this. We form a tree of
three fair queuing nodes, labeled FQR, FQA and FQB in the diagram. Each grants 50% of the bandwidth to
each of its two input classes.
If all four input classes A,B,C,D are active in the hierarchical structure, then each gets 25% of the total
bandwidth, just as for a flat four-input-class fair queuing structure. However, consider what happens when
only A, B and C are active, and D is idle. In the flat model, A, B and C each get 33%. In the hierarchical
model, however, as long as FQA and FQB are both active then each gets 50%, meaning that A and B split
435
If A is idle, but B and D are active, then B should receive four times the bandwidth of D. Assume for
a moment that a large number of packets have arrived for each of B and D. Then B should receive the
entire 80% of Cs share. If transmission order can be determined by packet arrival, then the order will look
something like
d1 , b1 , b2 , b3 , b4 ,
d2 , b5 , b6 , b7 , b8 , ...
Now suppose A wakes up and becomes active. At that point, B goes from receiving 80% of the total
bandwidth to 5% = 80% 1/16, or from four times Ds bandwidth to a quarter of it. The new b/d relative
rate, not showing As packets, must be
b1 , d1 , d2 , d3 , d4 ,
b2 , d5 , d6 , d7 , d8 , ...
The relative frequencies of B and D packets have been reversed by the arrival of later A packets. The
packet order for B and D is thus dependent on later arrivals on another queue entirely, and thus cannot be
determined at the point the packets arrive.
After presenting the actual hierarchical-WFQ virtual-clock algorithm, we will return to this example in
18.8.1.2 A Hierarchical-WFQ Example.
436
437
calls the peek() operation on each of its subqueues and then calculates the finishing time FP for the packet
P currently at the head of each subqueue as NextStart + size(P)/, where is the bandwidth fraction
assigned to the subqueue.
If a formerly inactive subqueue becomes active, it by hypothesis notifies the parent WFQ node. The parent
node records the time on its virtual clock as the NextStart value for that subqueue. Whenever a subqueue
is called upon to dequeue packet P, its NextStart value is updated to NextStart + size(P)/, the
virtual-clock finishing time for P.
The active-subqueue notification is also exactly what is necessary for the WFQ node to maintain its virtual
clock. If A is the set of active subqueues, and the ith subqueue has bandwidth share i , then the clock is to
run at rate equal to the reciprocal of the sum of the i for i in A. This rate needs to be updated whenever
a subqueue becomes active or inactive. In the first case, the parent node is notified by hypothesis, and the
second case happens only after a dequeue() operation.
There is one more modification needed for non-root WFQ nodes: we must suspend their virtual clocks when
they are not transmitting, following the argument at the end of 18.5.4 The GPS Model. Non-root nodes
do not have real interfaces and do not physically transmit. However, we can say that an interior node N is
logically transmitting when the root node R is currently transmitting a packet from leaf node L, and N lies
on the path from R to L. Note that all interior nodes on the path from R to L will be logically transmitting
simultaneously. For a specific non-root node N, whenever it is called upon at time T to dequeue a packet P,
its virtual clock should run during the wallclock interval from T to T + size(P)/r, where r is the root nodes
physical bandwidth. The virtual finishing time of P need not have any direct correspondence to this actual
finishing time T + size(P)/r. The rate of Ns virtual clock in the interval from T to T + size(P)/r will depend,
of course, on the number of Ns active child nodes.
We remark that the dequeue() operation described above is relatively inefficient; each dequeue()
operation by the root results in recursive traversal of the entire tree. There have been several attempts to
improve the algorithm performance. Other algorithms have also been used; the mechanism here has been
taken from [BZ97].
18.8.1.2 A Hierarchical-WFQ Example
Let us consider again the example at the end of 18.8 Hierarchical Weighted Fair Queuing:
438
Assume that all packets are of size 1 and R transmits at rate 1 packet per second. Initially, suppose FIFO
leaf nodes B and D have long backlogs (eg b1 , b2 , b3 , ...) but A is idle. Both of Rs subqueues are active, so
Rs virtual clock runs at the wall-clock rate. Cs virtual clock is running 16 fast, though. Rs NextStart
values for both its subqueues are 0.
The finishing time assigned by R to di will be 5i. Whenever packet di reaches the head of the D queue,
Rs NextStart for D will be 5(i-1). (Although we claimed in the previous section that hierarchical WFQ
nodes shouldnt need to assign finishing times beyond that for the current head packet, for FIFO subqueues
this is safe.)
At least during the initial A-idle period, whenever R checks Cs subqueue, if bi is the head packet then Rs
NextStart for C will be 1.25(i-1) and the calculated virtual finishing time will be 1.25i. If ties are decided
in Bs favor then in the first ten seconds R will send
b1 , b2 , b3 , b4 , d1 , b5 , b6 , b7 , b8 , d2
During the ten seconds needed to send the ten packets above, all of the packets dequeued by C come from B.
Having only one active subqueue puts C is in the situation of 18.5.4.4 GPS Example 2, and so its packets
calculated finishing times will exactly match Cs virtual-clock value at the point of actual finish. C dequeues
eight packets, so its virtual clock runs for only those 8 of the 10 seconds during which one of the bi is being
transmitted. As a result, packet bi finishes at time 16i by Cs virtual clock. At T=10, Cs virtual clock is
816 = 128.
Now, at T=10, as the last of the ten packets above completes transmission, let subqueue A become backlogged with a1 , a2 , a3 , etc. C will assign a finishing time of 128 + 1.0667i to ai (1.0667 = 16/15); C has
already assigned a virtual finishing time of 916=144 to b9 . None of the virtual finishing times assigned by
C to the remaining bi will change.
At this point the virtual finishing times for Cs packets are as follows:
packet
a1
a2
a3
a4
...
a15
b9
C finishing time
128 + 1.0667
128 + 21.0667
128 + 31.0667
128 + 41.0667
R finishing time
10 + 1.25
10 + 2.50
10 + 3.75
10 + 5
10 + 151.25
10 + 161.25 = 30
During the time the 16 packets in the table above are sent from C, R will also send four of Ds packets, for
a total of 20.
The virtual finishing times assigned by C to b9 and b10 have not changed, but note that the virtual finishing
times assigned to these packets by R are now very different from what they would have been had A remained
idle. With A idle, these finishing times would have been F9 = 11.25 and F10 = 12.50, etc. Now, with A active,
it is a1 and a2 that finish at 11.25 and 12.50; b9 will now be assigned by R a finishing time of 30 and b10
will be assigned a finishing time of 50. R is still assigning successive finishing times at increments of 1.25
to packets dequeued by C, but Bs contributions to this stream of packets have been bumped far back.
Rs assignments of virtual finishing times to the di are immutable, as are Cs assignments of virtual finishing
times, but R can not assign a final virtual finishing time to any of Cs packets (that is, As or Bs) until the
packet has reached the head of Cs queue. R assigns finishing times to Cs packets in the order they are
439
440
is non-compliant and must suffer special treatment as above. If the bucket is full, however, then the sender
may send a burst of packets corresponding to the bucket capacity (at which point the bucket will be empty).
A common variation is requiring one token per byte rather than per packet, with the fill rate correspondingly
scaled; this allows packet size to be taken into account.
More precisely, a token-bucket specification TB(r,Bmax ) includes a token fill rate of r tokens/sec, representing the rate at which the bucket fills with tokens, and also a bucket capacity (or depth) Bmax >0. The bucket
fills at the rate specified, subject to a maximum of Bmax ; we will denote the current capacity by B, or by
B(t) if we need to specify the time. In order for a packet of size S (possibly S=1 for counting size in units of
whole packets) to be within the specification, the bucket must have at least S tokens; that is, BS. Otherwise
the packet is non-compliant. When the packet is sent, S tokens are removed from the bucket, that is, B=BS.
It is possible for the packets of a given flow all to be compliant with a given token-bucket specification at
one point (eg one router) in the network but not at another point; this can happen, for example, if more than
Bmax packets pile up at a downstream router due to momentary congestion.
The following graph is a visual representation of a token-bucket constraint. The black and purple curves
plotted are of cumulative bits sent as a function of time, that is, bits(t). When bits(t) is horizontal, the sender
is idle.
The blue line represents a sender sending linearly at the rate r, with no burstiness. At vertical distance Bmax
above the blue line is the red line. Graphs for compliant senders cannot cross this, because that would entail
a burst of more than Bmax above the blue line; we give a more formal argument below. As a senders graph
18.9 Token Bucket Filters
441
approaches the red line, the senders current bucket contents decreases; the instantaneous bucket contents
for the black sender is shown at one point as B(t).
The purple sender has fallen below the blue line at one point; as a result, it can never catch up. In fact, after
passing through the vertex at point A the purple graph can never cross the dashed red line. A proof is in
18.11 Token Bucket Queue Utilization, following some numeric token-bucket examples that illustrate how
a token-bucket filter works.
Satellite Token Bucket
When I first got satellite Internet, my service was limited by a token-bucket filter with rate 56 Kbps
and bucket 300 megabytes. When the bucket emptied, it took 12 hours to refill. The idea was that
someone could use the Internet intensely but relatively briefly; satellite access is expensive. Within a
year, the provider switched to a flat 300 MB cap per day; the token-bucket rule was apparently not well
understood by customers.
If a packet arrives when there are not enough tokens in the bucket to send it, then as indicated above there are
three options. The sender can engage in shaping, making the packet wait until sufficient tokens accumulate.
The sender can engage in policing, dropping the packet. Or the sender can send the packet immediately but
mark it as noncompliant.
One common strategy is to send noncompliant packets as marked in the third option above with lower
priority. Alternatively, marked packets may face a greater chance of being dropped by some downstream
router. In ATM networks (3.8 Asynchronous Transfer Mode: ATM) the cell-loss priority bit is often used
to mark noncompliant packets.
Token-bucket specifications supply a framework for making decisions about admission control: a router
can decide whether to accept a new connection (or whether to accept the connections quality-of-service
request) based on the requested rate and bucket (queue) requirements.
Token-bucket specifications are the mirror-image equivalent to leaky-bucket specifications, in which the
fluid leaks out of the leaky bucket at rate r and to send a packet we must add S units without overflowing.
The two forms are completely equivalent.
So far we have been using token-bucket specifications to describe traffic; eg traffic arriving at a router. It is
also possible to use token buckets to describe the router itself; in this setting, the leaky-bucket formulation
may be clearer. The routers queue represents the bucket, and the routers packet transmissions represent
tokens leaking out of the bucket. Arriving packets are added to the bucket; a bucket overflow represents lost
packets. We will not pursue this interpretation further.
442
bucket volume has reached 1 and the fifth packet can be sent. The bucket is now empty, but fortunately the
remaining packets arrive at 3-ms intervals and can all be sent.
In the next set of packet arrival times, again with TB(1/3,4), we have three bursts of four packets each.
0, 0, 0, 0, 12, 12, 12, 12, 24, 24, 24, 24
Each burst empties the bucket, which then takes 12 ms to refill. All packets are compliant.
In the following set of packet arrival times, still with TB(1/3,4), the burst of four packets at T=0 drains the
bucket. At T=3 the bucket size has increased back to 1, allowing the packet that arrives then to be sent but
also draining the bucket again.
0, 0, 0, 0, 3, 6, 12, 12
At T=6 the same thing happens. From T=6 to T=12 the bucket contents rise from 0 to 2, allowing the two
packets arriving at T=12 to be sent.
Finally, suppose packets arrive at the following times at our TB(1/3,4) filter.
0, 1, 2, 3, 4, 5
Just after T=0 the bucket size is 3; just before T=1 it is 3 1/3.
Just after T=1 the bucket size is 2 1/3; just before T=2 it is 2 2/3
Just after T=2 the bucket size is 1 2/3; just before T=3 it is 2
Just after T=3 the bucket size is 1; just before T=4 it is 1 1/3
Just after T=4 the bucket size is 1/3; just before T=5 it is 2/3
At T=5 the bucket size is 2/3 and the arriving packet is noncompliant.
We can also represent this in tabular form as follows; note that for the noncompliant packet the bucket is not
decremented.
packet arrival
bucket just before
bucket just after
0
4
3
1
3 1/3
2 1/3
2
2 2/3
1 2/3
3
2
1
4
1 1/3
1/3
5
2/3
2/3
443
over the longer term, states that on average there will be 5 ms between packets, subject to a burst of 6. The
following is compliant, assuming both buckets are initially full.
0, 1, 2.5, 3, 4, 5, 6, 10, 15, 20
The first seven packets arrive at 1 ms intervals (the rate of the first filter) except for the packet that arrived at
T=2.5 instead of T=2. The sender was allowed to send again at T=3 instead of waiting until T=3.5 because
the bucket size in the first filter was 1.5 instead of 1.0. Here are the packet arrivals with the current size
of each bucket at the time of packet arrival, just before the bucket is decremented. At T=2.0, the filter2
bucket would be 4.4.
arrival:
Filter 1:
Filter 2:
T=0
1.5
6
T=1
1.5
5.2
T=2.5
1.5
4.5
T=3
1.0
3.6
T=4
1.0
2.8
T=5
1.0
2
T=6
1.0
1.2
T=10
1.5
1.0
T=15
1.5
1.0
T=20
1.5
1.0
If the packet arriving at T=2.5 had instead arrived at T=2, we would have the fastest compliant sequence for
this pair of filters. This is the sequence generated by a token-bucket shaper when there is a steady backlog
of packets and each is sent as soon as the bucket capacity (or capacities, when applicable) is full enough to
allow sending.
18.9.4 GCRA
Another formulation of the token-bucket specifications is the Generalized Cell Rate Algorithm, or GCRA;
this formulation is frequently used in classification of ATM traffic. A GCRA specification takes two parameters, a mean packet spacing time T, and an early-arrival allowance . For each packet we compute a
theoretical arrival time, tat, initially zero. A packet may arrive earlier by amount at most . Specifically, if
t is the time of actual arrival, we have two cases:
1. t tat : the packet is compliant, and we update tat to max(t,tat) + T
2. t < tat : the packet is too early and is noncompliant; tat is unchanged.
A flow satisfying GCRA(T, ) is equivalent to a token-bucket specification with rate 1/T packets/unit time,
and bucket size ( +1)/T; tat represents the time the bucket would next be full. The time to fill an empty
bucket is +1; if the bucket becomes full at time tat then, working backwards, it would contain enough to
send one packet at time tat .
For traffic flows with a more-or-less constant rate, represents the time by which one packet can be late
without permanently falling behind its regular 1/T rate. The GCRA formulation is sometimes more convenient than the token-bucket formulation, particularly when <T.
444
While fair queuing cannot be applied at R to enforce equal shares to A, B and C, we can implement a
token-bucket filter at R that limits each of A, B and C to 2 Mbps.
There are two drawbacks. First, the filter is not work-conserving: if A is idle, B and C will still only receive
2 Mbps. Second, in the absence of feedback there is no guarantee that limiting the traffic at R will eventually
result in reduced utilization of the ISPR link. While this is true for TCP traffic, due to the self-clocking
property, it is conceivable that a sender D somewhere is trying to send 8 Mbps of real-time UDP traffic to
A, via ISP and R. Three-quarters of the traffic would then fail to be compliant, and might be dropped by R,
but unless D gets feedback from A that not much of the traffic is getting through, and that it should reduce
its sending rate, the token-bucket filter at R will not achieve what we want. Most protocols do provide this
kind of feedback, but not all.
445
446
can continuously transmit at rate r and the net number of bits held within the router is bits(r) - rt. By the
argument above, this is bounded by Bmax . If bits(t) falls below the blue line, the routers queue is empty and
the router can transmit incoming data at least as fast as it is arriving.
While R can never be holding more than Bmax bytes, at the instant just before a packet finishes transmission
it can have Bmax bytes in the queue, plus the currently transmitting packet still taking up an entire buffer. As
a practical matter, then, we may need space equal to Bmax plus one packet.
While a token-bucket specification does not include a delay bound specifically, we can compute an upper
bound to the queuing delay at a router R as Bmax /r; this is the time it takes for one full buckets worth of
packets to be transmitted.
If we have N flows each individually satisfying TB(r,B), then the collective traffic will satisfy TB(Nr, NB)
(see exercise 12). However, a bucket size of NB will be needed only when all N individual flows have
their bursts gang up at a particular instant. Often it is possible to take advantage of theoretical or empirical statistical information and conclude that the collective traffic most of the time meets a token-bucket
specification TB(Nr, BN ) for BN significantly less than NB.
447
Example 1: suppose the traffic specification is TB(1/3, 10), where the rate is in (equal-sized) packets/sec,
and D is 40 sec. Then B/D is 1/4 packets/sec, and the necessary outbound bandwidth s is simply r=1/3.
Example 2: now suppose in the previous example that the delay limit D is 20 sec. In this case, we need s
= B/D = 1/2 packets/sec.
If there is other traffic, the delay constraint still holds, provided s represents the bandwidth allocated by R
to the flow, and the flows packets receive priority service at R, and we first subtract the largest-packet delay
as in 5.3.2 Packet Size and Real-Time Traffic.
Calculations of this sort often play a role in a routers decision on whether to accept a reservation for an
additional TB(r,B) flow with associated delay constraint.
448
If TB1s rate is, as here, less than the sum of its child rates, then as long as its children always have packets
ready to send, the children will receive bandwidth in proportion to their token bucket rates. In the example
above, TB1s rate is 4 packets/ms and yet the sum of the rates of its children is 5 packets/ms. Each child
will therefore receive 4/5 its promised rate: TB2 will send at a rate of 2(4/5) packets/ms while TB3 will
send at rate of 3(4/5) packets/ms.
To see this, assume FIFO2 and FIFO3 remain nonempty for a period long enough for their buckets to
empty. TB2 and TB3 will then each release packets to TB1 at their respective rates of 2 packets/ms and 3
packets/ms. In the following sequence of release times to TB1, we assume TB3 starts at T=0 and TB2 at
T=0.01, to avoid ties. Packets from A released by TB2 are shown in italic:
0, 0.01, .33, 0.51, .67, 1.0, 1.01, 1.33, 1.51, 1.67, 2.0, 2.01
They will be dequeued by TB1 at 4 packets/ms, once TB1s bucket is empty. In the long run, TB3 has
released three packets into this sequence for every two of TB2s, so sender B will receive 3/5 of the dequeuings, and thus 3/5 of the 4 packet/ms root bandwidth.
We can also have each token-bucket node physically forward released packets to FIFO queues attached
to each parent node; we called this an internal-storage hierarchy in 18.7 Hierarchical Queuing. In this
particular case, the leaf-storage and internal-storage mechanisms function identically, provided the internal
links are infinitely fast and the internal queues infinitely large. See exercise 18.
There is no point in having a node with a bucket larger than the sum of its child buckets and also a rate larger
than the sum of its child rates. In the example above, in which the sum of the child rates exceeds the parent
rates, A would be able to send at a sustained rate of 2 packets/ms provided B sends at only 2 packets/ms as
well; reducing the child rates to 2(4/5) and 3(4/5) packets/ms respectively is not equivalent. If a nodes
rate is larger than the sum of the child rates, then it will be able to handle the child traffic without delay once
the child buckets have emptied. Before that, though, the parent bucket may be the limiting factor.
449
18.13.1 CBQ
CBQ was introduced in [CJ91] and analyzed in [FJ95]. It did not actually use the token-bucket mechanism,
but instead implemented shaping by keeping track of the average idle time (more precisely, non-transmitting
time) for a given input class. Input classes that tried to send too much were restricted, unless the node was
permitted to borrow bandwidth from a sibling. When an input class sent less than it was allowed, its
average utilization would fall; if a burst arrived then it would take some time for the average to catch up
and thus the node could briefly send faster than its assigned rate. However, the size of the bucket could be
controlled only indirectly.
450
Packets are marked green, yellow or red depending on their situation. Red packets are those that must wait;
eventually they will turn yellow and then green.
Packets are considered green if they are now compliant (perhaps after waiting earlier) for one of the leaf
token-bucket nodes; green packets are sent as soon as possible.
After L1, L2 and L3 have each emptied their buckets, they will not exhaust node Ns rate. Similarly, after
N and M have emptied their buckets they will use only half of Rs rate. Nodes are allowed to borrow
bandwidth without payback from their parents rates; packets benefiting from such borrowed bandwidth
are marked yellow, and may also be sent immediately if no green packets are waiting. Borrowing is always
in proportion to a nodes guaranteed rate, in the manner of fair queuing. That is, the guaranteed rates of the
child nodes are treated as unnormalized fair-queuing weights; normalized weight fractions are obtained by
dividing by their total. N above would have normalized weight fraction 200/(200+100) = 2/3.
If L1, L2 and L3 engage in borrowing from N, and each has traffic to send, then each gets a total bandwidth
of 50, 50 and 100 Kbps, respectively. If L3 is idle, then L1 and L2 each would get 100 Kbps. If N and M
borrow in turn from R, they each can send at 400 and 200 Kbps respectively, in which case L1, L2 and L3
(again assuming all are active) get 100, 100 and 200 Kbps. If M elects not do do any borrowing, because it
has nothing to send, then N will get 600 Kbps and L1, L2 and L3 will get 150, 150 and 300.
If fair-queuing behavior is not desired, we can set rceil = r so that a node can never send faster than its
guaranteed rate. This allows htb to model the token-bucket-only hierarchy of 18.12 Hierarchical Token
Bucket.
451
The output rate of the ith router Ri is ri , of which our flow is guaranteed rate fi ri . Let f = min {fi |i<N}.
Suppose the maximum packet size for packets in our flow is S, and the maximum packet size including
competing traffic is Smax . Then the total delay encountered by the flows packets is bounded by the sum of
the following:
1. propagation delay (total single-bit delay along all N+1 links)
2. B/f
3. The sum from 1 to N of S/fi
4. The sum from 1 to N of Smax /ri
The second term B/f represents the queuing delay introduced by a single burst of size B; we showed in
18.11.2 Token Bucket Through Multiple Routers that this delay bound applied regardless of the number of
routers.
The third term represents the total store-and-forward delay at each router for packets belonging to our flow
under GPS; the delay at Ri is S/fi .
The final term represents the degree to which fair-queuing may delay a packet beyond the theoretical GPS
time expressed in the third term. If the routers were to use GPS, then the first three terms above would
bound the packet delay; we established in 18.5.4.7 Finishing-Order Bound that router Ri may introduce an
additional delay above and beyond the GPS delay of at most Smax /ri .
18.14 Epilog
If we want to use 100% the outbound bandwidth, but divide it among several senders according to a predetermined ratio, fair queuing is the tool to use. If we want to impose an absolute rather than a relative cap on
traffic, token bucket is appropriate.
Fair queuing has applications to the routing of ordinary packets; for example, if routers implement fair
queuing on a per-connection basis, then TCP senders will have no incentive to maximize queue utilization
and TCP Reno will lose its competitive advantage.
It is for real-time traffic, however, that queuing disciplines such as fair queuing, token bucket and even
priority queuing come into their own as fundamental building blocks. These tools allow us to guarantee a
bandwidth fraction to VoIP traffic, or to allow such traffic to be sent with minimal delay. In the next chapter
19 Quality of Service we will encounter fair queuing and token-bucket specifications repeatedly.
18.15 Exercises
1. Suppose a router uses fair queuing with three input classes, and uses the quantum algorithm of 18.5.2 Different Packet Sizes. The first class sends packets of size 900 bytes, the second sends packets of 400 bytes,
and the third sends packets of 200 bytes. List what would be sent by each flow in each of the first five
rounds.
2. Suppose we attempt to simulate BBRR as follows with the following strategy we will call SBBRR. Each
subqueue has a bit-position marker that advances by one bit for each bit we have available to send from that
452
queue. If three queues are active, then in the time it takes us to send three bits, each marker would advance
by one bit. When all the bits of a packet have been ticked through, the packet is sent.
(a). Explain why this is not the same as BBRR fair queuing (even with equal-sized packets).
(b). Is it the same as BBRR if all input queues are active?
3. Suppose we modify the SBBRR strategy of the previous exercise so that, if the output link is ever idle,
and no packet has yet had all its bits ticked through by the bit-position marker, then we immediately send
the packet with the fewest bits remaining to be ticked through by the bit-position marker.
Suppose packets P1 of size 1000 and P2 of size 100 arrive on subqueue 1. Just after P1 begins transmission,
packet Q1 of size 400 arrives on subqueue 2. Fair queuing should send the packets in the order P1, Q1, P2;
show that the mechanism described here does not do that.
4. Suppose we attempt to implement fair queuing by calculating the finishing time Fj for Pj , the jth packet
in subqueue i, as follows.
Startj+1 = max(Fj , now) (now by wallclock time)
Fj = Startj + NLj
where N is the total number of subqueues, active or not.
(a). Suppose a router has three subqueues; ie N=3. The outbound bandwidth is 1 size unit / 1 time unit. At
T=0, packets P1, P2, P3, P4 and P5 arrive for subqueue 1, each of size 1 unit. At T=2 (by which point P1
and P2 will have finished transmission), packets Q1 and Q2 arrive on subqueue 2, also of size 1. What
finishing times will all the packets be assigned? In what order will they be transmitted?
(b). Is this strategy approximately equivalent to fair queuing if we are given that all subqueues of the router
are always active?
5. Suppose we modify the strategy of the previous exercise by letting N be the number of active subqueues
at the time of arrival of packet Pj . What happens if we have three input subqueues, and at T=0 five packets
arrive for subqueue 1, and at T=1 five packets arrive for subqueue 2. Assume all ten packets are of size 1,
and the output bandwidth is again 1 size unit per time unit.
6. The following packets all arrive at time T=0 at a router with an output rate of one size unit per time unit.
Subqueue 1: P1 of size 100, P2 of size 500, P3 of size 400
Subqueue 2: Q1 of size 300, Q2 of size 200, Q3 of size 600
Subqueue 3: R1 of size 400, R2 of size
6.5. Is byte-by-byte round-robin the same as bit-by-bit round-robin, if someone found a way to implement
these literally? Does byte-by-byte round-robin lead to the same transmission order as BBRR?
18.15 Exercises
453
7. Calculate GPS finishing times for the following packets, all present at T=0. There are two subqueues,
and their bandwidth fractions are and where = ( 5-1)/2 0.618 and = 2 = 1-. The packet sizes
for the two subqueues are as follows (they follow the Fibonacci sequence, except 2 appears twice):
: 2, 3, 8, 21, 55, 144, 377, 987
: 1, 2, 5, 13, 34, 89, 233, 610
Hint: you will have to evaluate to more decimal places than is shown here.
8. Suppose a WFQ router has two subqueues, each with a bandwidth fraction of =50%. The router
transmits 1 byte per ms. Initially, the subqueues are empty and T=0 and the GPS virtual clock is 0. At that
moment a packet P1 of size 1000 bytes arrives at the first subqueue. At T=500, a similarly sized packet P2
arrives at the second subqueue. Give, for each of P1 and P2,
9. Suppose a router has three subqueues, and an outbound bandwidth of 1 packet per unit time. Twelve
packets arrive at or after T=0, timed so that the router remains busy until finishing the packets at T=12.
(a). What packet arrival schedule leads to the minimum final BBRR clock value?
(b). What schedule leads to the maximum final BBRR clock value?
Hint: the rate of the BBRR clock depends only on the number of active subqueues.
10. Suppose packets from three subqueues are sent using the quantum algorithm of 18.5.5 The Quantum
Algorithm. The packets are listed below in order of arrival for each subqueue, along with their lengths L;
the packets are all available at time T=0. The quantum is 1000 bytes. Give the order of transmission.
Subqueue 1
P1, L=700
P2, L=700
P3, L=700
P4, L=700
Subqueue 2
Q1, L=400
Q2, L=500
Q3, L=1000
Q4, L=200
Subqueue 3
R1, L=500
R2, L=600
R3, L=200
R4, L=900
11. At Disneyland, patrons often wait in a queue that winds slowly through one large waiting room, only to
feed into another queue in another room. Is this an example of hierarchical queuing, eg of one FIFO queue
feeding another, without classes?
12. If two traffic streams meet token-bucket specifications of TB(r1,b1) and TB(r2,b2) respectively, show
their commingled traffic must meet TB(r1+r2,b1+b2). Hint: imagine a common bucket of size b1+b2, filled
at rate r1 with red tokens and at rate r2 with blue tokens.
13. For each sequence of arrival times, indicate which packets are compliant for the given token-bucket
specification. If a packet is noncompliant, go on to the next arrival without decrementing the bucket.
454
14. Find the fastest sequence (see the end of 18.9.3 Multiple Token Buckets) for the following flows. Both
start at T=0, and all buckets are initially full.
(a). TB(1/4, 4); packets can depart at a minimum of 1 time unit apart. Continue the sequence to at least
T=10
(b). TB(1/2, 4) and TB(1/8, 8); multiple packets can depart at the same instant. Continue to at least T=25.
15. Give the fastest sequence of packets compliant for all three of the following token-bucket specifications.
Continue the sequence at least until T=60.
TB(1/2, 1)
TB(1/6, 4)
TB(1/12, 8)
Hint: the first specification means arrival times must always be separated by at least 2. The middle specification should kick in by T=12.
16. Show that if a GPS traffic flow satisfies a token-bucket specification TB(r,B), then in any interval of time
t1 tt2 the amount of traffic is at most B + r(t2 t1 ). Hint: during the interval t1 tt2 the amount of fluid
added to the bucket is exactly r(t2 t1 ).
17. Show that a generic hierarchy of FIFO queuing disciplines, described in 18.7.1 Generic Hierarchical
Queuing, collapses to a single FIFO queue.
18. Show that the token-bucket leaf-storage hierarchy of 18.12 Hierarchical Token Bucket produces the
same result as an internal-storage hierarchy in which each intermediate token-bucket node contained a
real, infinite-capacity FIFO queue, and each node instantaneously transmitted each packet to the parents
FIFO queue as soon as it was released. Show that packets are transmitted by each hierarchy at the same
times. Hint: show that each node in the leaf-storage hierarchy releases a packet at the same time the
corresponding internal-storage hierarchy forwards the packet upwards.
19. The following linux htb hierarchies are labeled with their guaranteed rates. Is there any difference in
terms of the bandwidth allocations that would be received by senders A and B?
(a)
R
100
/
\
/
\
L1
L2
60
40
|
|
A
B
18.15 Exercises
(b)
R
100
/
\
/
\
L1
L2
30
20
|
|
A
B
455
20. Suppose we know that the real-time traffic through a given router R uses at most 1 Mbps of the total 10
Mbps bandwidth. Consider the following two ways of giving the real-time traffic special treatment:
i. Using priority queuing, and giving the real-time traffic higher priority.
ii. Using weighted fair queuing, and giving the real-time traffic a 10% share
(a). Show that, if the real-time traffic meets a token-bucket specification with rate 1 Mbps and negligible
bucket size, then the two mechanisms are equivalent, in the sense that if the real-time and non-real-time
traffic flows are sending at fractions and , respectively, of the 10 Mbps outbound rate, with +=1 (and
with 10%), then the two methods above will actually send at the same rates.
(b). What differences can be expected if the bucket size is not negligible? Which approach will favor the
real-time fraction?
21. In the previous exercise, now suppose we have two separate real-time flows, each guaranteed by a
token-bucket specification not to exceed 1 Mbps. Is there a material difference between any pair of the
following?
i. Sending the two real-time flows at priority 1, and the remaining traffic at priority 2.
ii. Sending the first real-time flow at priority 1, the second at priority 2, and the remaining
traffic at priority 3.
iii. Giving each real-time flow a WFQ share of 10%, and the rest a WFQ share of 80%
22. Suppose a router uses priority queuing. There is one low-priority and one high-priority input. The
outbound bandwidth is r.
(a). If the high-priority queue is currently empty, what is the maximum time that an arriving high-priority
packet must wait?
(b). If the high-priority traffic follows a token-bucket description TB(rp,B), with rp < r, what is the
maximum time an arriving high-priority packet must wait? Hint: use 18.11 Token Bucket Queue
Utilization.
Your answer may include symbolic representations of any necessary additional parameters.
456
19 QUALITY OF SERVICE
So far, the Internet has been presented as a place where all traffic is sent on a best-effort basis and routers
handle all traffic on an equal footing; indeed, this is often seen as a fundamental aspect of the IP layer. Delays
and losses due to congestion are nearly universal. For bulk file-transfers this is usually quite sufficient; one
way to look at TCP congestive losses, after all, is as part of a mechanism to ensure optimum utilization of
the available bandwidth.
Sometimes, however, at least some senders may wish to arrange in advance for a certain minimum level of
network services. Such arrangements are known as quality of service (QoS) assurances, and may involve
bandwidth, delay, loss rates, or any combination of these. Even bulk senders, for example, might sometimes
wish to negotiate ahead of time for a specified amount of bandwidth.
While any sender might be interested in quality-of-service levels, they are an especially common concern
for those sending and receiving real-time traffic such as voice-over-IP or videoconferencing. Real-time
senders are likely to have not only bandwidth constraints, but constraints on delay and on loss rates as well.
Furthermore, real-time applications may simply fail at least temporarily if these bandwidth, delay and
loss constraints are not met.
In any network, large or small, in which bulk traffic may sometimes create queue backlogs large enough to
cause unacceptable delay, quality-of-service assurances must involve the cooperation of the routers. These
routers will then use the queuing and scheduling mechanisms of 18 Queuing and Scheduling to set aside
bandwidth for designated traffic. This is a major departure from the classic Internet model of stateless
routers that have no information about specific connections or flows, though it is a natural feature of virtualcircuit routing.
In this chapter, we introduce some quality-of-service mechanisms for the Internet. We introduce the theory,
anyway; some of these mechanisms have not exactly been adopted with warm arms by the ISP industry.
Sometimes this is simply the chicken-and-egg problem: ISPs do not like to implement features nobody is
using, but often nobody is using them because their ISPs dont support them. However, often an ISPs
problem with a QoS feature comes down costs: routers will have more state to manage and more work to
do, and this will require upgrades. There are two specific cost issues:
it is not clear how to charge endpoints for their QoS requests (as with RSVP), particularly when the
endpoints are not direct customers
it is not clear how to compare special traffic like multicast, for the purpose of pricing, with standard
unicast traffic
Nonetheless, VoIP at least is here to stay on a medium scale. Streaming video is here to stay on a large scale.
Quality-of-service issues are no longer quite being ignored, or, at least, not blithely.
Note that a fundamental part of quality-of-service requests on the Internet is sharing among multiple traffic classes; that is, the transmission of best-effort and various grades of premium traffic on the same
network. One can, after all, lease a SONET line and construct a private, premium-only network; such an
approach is, however, expensive. Support for multiple traffic classes on the same network is sometimes
referred to generically as integrated services, not to be confused with the specific IETF protocol suite of
that name (19.4 Integrated Services / RSVP). There are two separate issues in an integrated network:
how to make sure premium traffic gets the service it requires
457
458
19 Quality of Service
If queuing delays (and losses) occur only at R1 and at R3, then there is no need to involve R2 in any
bandwidth-reservation scheme; the same is true of R1 and R3 if the delays occur only at R2. Unfortunately,
it is not always easy to determine the location of Internet congestion, and an abundance of bandwidth today
may become a dearth tomorrow: bandwidth usage ineluctably grows to consume supply. That said, the early
models for quality-of-service requests assumed that all (or at least most) routers would need to participate;
it is quite possible that this is practically speaking no longer true. Another possible consequence is that
adequate QoS levels might might be attainable with only the participation of ones immediate ISP.
459
packet
1
2
3
4
5
sent
0
20
40
60
80
expected
50
70
90
110
130
recd
50
55
120
125
130
(recd expected)
0
-5
30
15
0
The first and the last packet have arrived on time, the second is early, and the third and fourth are late by 30
and 15 ms respectively. Setting the playback buffer to 25 ms means that the third packet is not received in
time to be played back, and so must be discarded; setting the buffer to a value at least 30 ms means that all
the packets are received.
For non-interactive voice and video, there is essentially no penalty to making the playback buffer quite
long. But for telephony, if one speaker stops and the other starts, they will perceive a gap of length equal to
the RTT including playback-buffer delay. For voice, this becomes increasingly annoying if the RTT delay
reaches about 200-400 ms.
460
19 Quality of Service
461
do share something fundamental in common: both require the participation of intermediate routers; neither
can be effectively implemented solely in end systems.
Then R1 will receive packet 1 from A and will forward it to both B1 and to R2. R2 will receive packet 1
from R1 and forward it to B2 and R3. R3 will forward only to B3; R4 does not see the traffic at all. The
set of paths used by the multicast traffic, from the sender to all destinations, is known as the multicast tree;
these are shown in bold.
We should acknowledge upfront that, while IP multicast is potentially a very useful technology, as with
IntServ there is not much support for it within the mainstream ISP industry. The central issues are the need
for routers to maintain complex new information and the difficulty in figuring out how to charge for this
kind of traffic. At larger scales ISPs normally charge by total traffic carried; now suppose an ISPs portion
of a multicast tree is a single path all across the continent, but the tree branches into ten different paths near
the point of egress. Should the ISP consider this to be like one unicast connection, or more like ten?
Once Upon A Time, an ideal candidate for multicast might have been the large-scale delivery of broadcast
television. Ironically, the expansion of the Internet backbone has meant that large-scale video delivery is
now achieved with an individual unicast connection for every viewer. This is the model of YouTube.com,
Netflix.com and Hulu.com, and almost every other site delivering video. Online education also once upon a
time might have been a natural candidate for multicast, and here again separate unicast connections are now
entirely affordable. The bottom line for the future of multicast, then, is whether there is an application out
there that really needs it.
Note that IP multicast is potentially straightforward to implement within a single large (or small) organization. In this setting, though, the organization is free to set its own budget rules for multicast use.
Multicast traffic will consist of UDP packets; there is no provision in the TCP specification for multicast
connections. For large groups, acknowledgment by every receiver of the multicast UDP packets is impractical; the returning ACKs could consume more bandwidth than the outbound data. Fortunately, complete
acknowledgments are often unnecessary; the archetypal example of multicast traffic is loss-tolerant voice
and video traffic. The RTP protocol (19.11 Real-time Transport Protocol (RTP)) includes a response mechanism from receivers to senders; the RTP response packets are intended to at least give the sender some idea
of the loss rate. Some effort is expended in the RTP protocol (more precisely, in the companion protocol
462
19 Quality of Service
RTCP) to make sure that these response packets, from multiple recipients, do not end up amounting to more
traffic than the data traffic.
In the original Ethernet model for LAN-level multicast, nodes agree on a physical multicast address, and
then receivers subscribe to this address, by instructing their network-interface cards to forward on up to the
host system all packets with this address as destination. Switches, in turn, were expected to treat packets
addressed to multicast addresses the same as broadcast, forwarding on all interfaces other than the arrival
interface.
Global broadcast, however, is not an option for the Internet as a whole. Routers must receive specific
instructions about forwarding packets. Even on large switched Ethernets, newer switches generally try to
avoid broadcasting every multicast packet, preferring instead to attempt to figure out where the subscribers
to the multicast group are actually located.
In principle, IP multicast routing can be thought of as an extension of IP unicast routing. Under this
model, IP forwarding tables would have the usual udest,next_hop entries for each unicast destination,
and mdest,set_of_next_hops entries for each multicast destination. In the diagram above, if G represents
the multicast group of receivers {B1,B2,B3}, then R1 would have an entry G,{B1,R2}. All that is needed
to get multicast routing to work are extensions to distance-vector, link-state and BGP router-update algorithms to accommodate multicast destinations. (We are using G here to denote both the actual multicast
group and also the multicast address for the group; we are also for the time being ignoring exactly how a
group would be assigned an actual multicast address.)
These routing-protocol extensions can be done (and in fact it is quite straightforward in the link-state case,
as each node can use its local network map to figure out the optimal multicast tree), but there are some
problems. First off, if any Internet host might potentially join any multicast group, then each router must
maintain a separate entry for each multicast group; there are no opportunities for consolidation or for hierarchical routing. For that matter, there is no way even to support for multicast the basic unicast-IP separation
of addresses into network and host portions that was a crucial part of the continued scalability of IP routing. The Class-D multicast address block contains 228 270 million entries, far to many to support a
routing-table entry for each.
The second problem is that multicast groups, unlike unicast destinations, may be ephemeral; this would
place an additional burden on routers trying to keep track of routing to such groups. An example of an
ephemeral group would be one used only for a specific video-conference speaker.
Finally, multicast groups also tend to be of interest only to their members, in that hosts far and wide on the
Internet generally do not send to multicast groups to which they do not have a close relationship. In the
diagram above, the sender A might not actually be a member of the group {B1,B2,B3}, but there is a strong
tie. There may be no reason for R4 to know anything about the multicast group.
So we need another way to think about multicast routing. Perhaps the most successful approach has been the
subscription model, where senders and receivers join and leave a multicast group dynamically, and there is
no route to the group except for those subscribed to it. Routers update the multicast tree on each join/leave
event. The optimal multicast tree is determined not only by the receiving group, {B1,B2,B3}, but also by
the sender, A; if a different sender wanted to send to the group, a different tree might be constructed. In
practice, sender-specific trees may be constructed only for senders transmitting large volumes of data; less
important senders put up with a modicum of inefficiency.
The multicast protocol most suited for these purposes is known as PIM-SM, defined in RFC 2362. PIM
stands for Protocol-Independent Multicast, where protocol-independent means that it is not tied to a
specific routing protocol such as distance-vector or link-state. SM here stands for Sparse Mode, meaning
19.5 Global IP Multicast
463
that the set of members may be widely scattered on the Internet. We acknowledge again that, while PIM-SM
is a reasonable starting point for a realistic multicast implementation, it may be difficult to find an ISP that
implements it.
The first step for PIM-SM, given a multicast group G as destination, is for the designation of a router to
serve as the rendezvous point, RP, for G. If the multicast group is being set up by a particular sender, RP
might be a router near that sender. The RP will serve as the default root of the multicast tree, up until such
time as sender-specific multicast trees are created. Senders will send packets to the RP, which will forward
them out the multicast tree to all group members.
At the bottom level, the Internet Group Management Protocol, IGMP, is used for hosts to inform one of
their local routers (their designated router, or DR) of the groups they wish to join. When such a designated
router learns that a host B wishes to join group G, it forwards a join request to the RP.
The join request travels from the DR to the RP via the usual IP-unicast path from DR to RP. Suppose that
path for the diagram below is DR,R1,R2,R3,RP. Then every router in this chain will create (or update)
an entry for the group G; each router will record in this entry that traffic to G will need to be sent to the
previous router in the list (starting from DR), and that traffic from G must come from the next router in the
list (ultimately from the RP).
In the above diagram, suppose B1 is first to join the group. B1s designated router is R5, and the join packet
is sent R5R2R3RP. R5, R2 and R3 now have entries for G; R2s entry, for example, specifies that
packets addressed to G are to be sent to {R5} and must come from R3. These entries are all tagged with
*,G, to use RFC 2362s notation, where the * means any sender; we will return to this below when
we get to sender-specific trees.
Now B2 wishes to join, and its designated router DR sends its join request along the path
DRR1R2R3RP. R2 updates its entry for G to reflect that packets addressed to G are to be forwarded to the set {R5,R1}. In fact, R2 does not need to forward the join packet to R3 at all.
At this point, a sender S can send to the group G by creating a multicast-addressed IP packet and encapsulating it in a unicast IP packet addressed to RP. RP opens the encapsulation and forwards the packet down
the tree, represented by the bold links above.
Note that the data packets sent by RP to DR will follow the path RPR3R2R1, as set up above, even if
464
19 Quality of Service
the normal unicast path from R3 to R1 were R3R4R1. The multicast path was based on R1s preferred
next_hop to RP, which was assumed to be R2. Traffic here from sender to a specific receiver takes the
exact reverse of the path that a unicast packet would take from that receiver to the sender; as we saw in
10.4.3 Provider-Based Hierarchical Routing, it is common for unicast IP traffic to take a different path
each direction.
The next step is to introduce source-specific trees for high-volume senders. Suppose that sender S above is
sending a considerable amount of traffic to G, and that there is also an R6R2 link (in blue below) that can
serve as a shortcut from S to {B1,B2}:
We will still suppose that traffic from R2 reaches RP via R3 and not R6. However, we would like to allow
S to send to G via the more-direct path R6R2. RP would initiate this through special join messages sent
to R6; a message would then be sent from RP to the group G announcing the option of creating a sourcespecific tree for S (or, more properly, for Ss designated router R6). For R1 and R5, there is no change; these
routers reach RP and R6 through the same next_hop router R2.
However, R2 receives this message and notes that it can reach R6 directly (or, in general, at least via a
different path than it uses to reach RP), and so R2 will send a join message to R6. R2 and R6 will now each
have general entries for *,G but also a source-specific entry S,G, meaning that R6 will forward traffic
addressed to G and coming from S to R2, and R2 will accept it. R6 may still also forward these packets
to RP (as RP does belong to group G), but RP might also by then have an S,G entry that says (unless the
diagram above is extended) not to forward any further.
The tags *,G and S,G thus mark two different trees, one rooted at RP and the other rooted at R6. Routers
each use an implicit closest-match strategy, using a source-specific entry if one is available and the wildcard
*,G entry otherwise.
As mentioned repeatedly above, the necessary ISP cooperation with all this has not been forthcoming. As
a stopgap measure, the multicast backbone or Mbone was created as a modest subset of multicast-aware
routers. Most of the routers were actually within Internet leaf-customer domains rather than ISPs, let alone
backbone ISPs. To join a multicast group on the Mbone, one first identified the nearest Mbone router and
then connected to it using tunneling. The Mbone gradually faded away after the year 2000.
We have not discussed at all how a multicast address would be allocated to a specific set of hosts wishing to
form a multicast group. There are several large blocks of class-D addresses assigned by the IANA. Some of
19.5 Global IP Multicast
465
these are assigned to specific protocols; for example, the multicast address for the Network Time Protocol is
224.0.1.1 (though you can use NTP quite happily without using multicast). The 232.0.0.0/8 block is reserved
for source-specific multicast, and the 233.0.0.0/8 block is allocated by the GLOP standard; if a site has a
16-bit Autonomous System number with bytes x and y, then that site automatically gets the multicast block
233.x.y.0/24. A fuller allocation scheme waits for the adoption and development of a broader IP-multicast
strategy.
19.6 RSVP
We next turn to the RSVP (ReSerVation) protocol, which forms the core of IntServ.
The original model for RSVP was to support multicast, so as to support teleconferencing. For this reason,
reservations are requested not by senders but by receivers, as a multicast sender may not even know who all
the receivers are. Reservations are also for one direction; bidirectional traffic needs to make two reservations.
Like multicast, RSVP generally requires participation of intermediate routers.
Reservations include both a flowspec describing the traffic flow (eg a unicast or multicast destination) and
also a filterspec describing how to identify the packets of the flow. We will assume that filterspecs simply
define unidirectional unicast flows, eg by specifying source and destination sockets, but more general filterspecs are possible. A component of the flowspec is the Tspec, or traffic spec; this is where the token-bucket
specification for the flow appears. Tspecs do not in fact include a bound on total delay; however, the degree
of queuing delay at each router can be computed from the TB(r,Bmax ) token-bucket parameters as Bmax /r.
The two main messages used by RSVP are PATH packets, which move from sender to receiver, and the
subsequent RESV packets, which move from receiver to sender.
Initially, the sender (or senders) sends a PATH message to the receiver (or receivers), either via a single
unicast connection or to a multicast group. The PATH message contains the senders Tspec, which the
receivers need to know to make their reservations. But the PATH messages are not just for the ultimate
recipients: every router on the path examines these packets and learns the identity of the next_hop RSVP
router in the upstream direction. The PATH messages inform each router along the path of the path state
for the sender.
As an example, imagine that sender A sends a PATH message to receiver B, using normal unicast delivery.
Suppose the route taken is AR1R2R3B. Suppose also that if B simply sends a unicast message to
A, however, then the route is the rather different BR3R4R1A.
466
19 Quality of Service
As As PATH message heads to B, R2 must record that R1 is the next hop back to A along this particular
PATH, and R3 must record that R2 is the next reverse-path hop back to A, and even B needs to note R3 is
the next hop back to A (R1 presumably already knows this, as it is directly connected to A). To convey this
reverse-path information, each router inserts its own IP address at a specific location in the PATH packet,
so that the next router to receive the PATH packet will know the reverse-path next hop. All this path state
stored at each router includes not only the address of the previous router, but also the senders Tspec. All
these path-state records are for this particular PATH only.
The PATH packet, in other words, tells the receiver what the Tspec is, and prepares the routers along the
way for future reservations.
Each receiver now responds with its RESV message, requesting its reservation. The RESV packets are
passed back to the sender not by the default unicast route, but along the reverse path created by the PATH
message. In the example above, the RESV packet would travel BR3R2R1A. Each router (and also
B) must look at the RESV message and look up the corresponding PATH record in order to figure out how
to pass the reservation message back up the chain. If the RESV message were sent using normal unicast,
via BR3R4R1A, then R2 would not see it.
Each router seeing the RESV path must also make a decision as to whether it is able to grant the reservation.
This is the admission control decision. RSVP-compliant routers will typically set aside some fraction of
their total bandwidth for their reservations, and will likely use priority queuing to give preferred service to
this fraction. However, as long as this fraction is well under 100%, bulk unreserved traffic will not be shut
out. Fair queuing can also be used.
Reservations must be resent every so often (eg every ~30 seconds) or they will time out and go away; this
means that a receiver that is shut down without canceling its reservation will not continue to tie up resources.
If the RESV messages are moving up a multicast tree, rather than backwards along a unicast path, then they
are likely to reach a router that already has granted a reservation of equal or greater capacity. In the diagram
below, router R has granted a reservation for traffic from A to receiver B1, reached via Rs interface 1, and
now has a similar reservation from receiver B2 reached via Rs interface 2.
Assuming R is able to grant B2s reservation, it does not have to send the RESV packet upstream any further
(at least not as requests for a new reservation); B2s reservation can be merged with B1s. R simply will
receive packets from A and now forward them out both of its interfaces 1 and 2, to the two receivers B1 and
B2 respectively.
It is not necessary that every router along the path be RSVP-aware. Suppose A sends its PATH messages to
B via AR1R2aR2bR3B, where every router is listed but R2a and R2b are part of a non-RSVP
cloud. Then R2a and R2b will not store any path state, but also will not mark the PATH packets with
their IP addresses. So when a PATH packet arrives at R3, it still bears R1s IP address, and R3 records
R1 as the reverse-path next hop. It is now possible that when R3 sends RESV packets back to R1, they
will take a different path R3R2cR1, but this does not matter as R2a, R2b and R2c are not accepting
19.6 RSVP
467
reservations anyway. Best-effort delivery will be used instead for these routers, but at least part of the path
will be covered by reservations. As we outlined in 19.2 Where the Wild Queues Are, it is quite possible
that we do not need the participation of the various R2s to get the quality of service wanted; perhaps only
R1 and R3 contribute to delays.
In the multicast multiple-sender/multiple-receiver model, not every receiver must make a reservation for
all senders; some receivers may have no interest in some senders. Also, if rate-adaptive data transmission
protocols are used, some receivers may make a reservation for a sender at a lower rate than that at which
the sender is sending. For this to work, some node between sender and receiver must be willing to decode
and re-encode the data at a lower rate; the RTP protocol provides some support for this in the form of RTP
mixers (19.11.1 RTP Mixers). This allows different members of the multicast receiver group to receive the
same audio/video stream but at different resolutions.
From an ISPs perspective, the problems with RSVP are that there are likely to be a lot of reservations,
and the ISP must figure out how to decide who gets to reserve what. One model is simply to charge for
reservations, but this is complicated when the ISP doing the charging is not the ISP providing service to the
receivers involved. Another model is to allow anyone to ask for a reservation, but to maintain a cap on the
number of reservations from any one site.
These sorts of problems have largely prevented RSVP from being implemented in the Internet backbone.
That said, RSVP is apparently quite successful within some corporate intranets, where it can be used to
support voice traffic on the same LANs as data.
19 Quality of Service
Assured Forwarding, or AF (19.7.2 Assured Forwarding), is really four separate PHBs, corresponding
to four classes 1 through 4. It is meant for providing (soft) service guarantees that are contingent on the
senders staying within a certain rate specification. Each AF class has its own rate specification; these rate
specifications are entirely at the discretion of the implementing DS domain. AF uses the first three bits to
denote the AF class, and the second three bits to denote the drop precedence.
Here are the six-bit patterns for the above PHBs; the AF drop-precedence bits are denoted ddd.
000 000: default PHB (best-effort delivery)
001 ddd: AF class 1 (the lowest priority)
010 ddd: AF class 2
011 ddd: AF class 3
100 ddd: AF class 4 (the best)
101 110: Expedited Forwarding
101 100: Voice Admit
11x 000: Network control traffic (RFC 2597)
xxx 000: Class Selector (traditional IPv4 Type-of-Service)
The goal behind premium PHBs such as EF and AF is for the DS domain to set some rules on admitting
premium packets, and hope that their total numbers to any given destination are small enough that high-level
service targets can be met. This is not exactly the same as a guarantee, and to a significant degree depends on
statistics. The actual specifications are written as per-hop behaviors (hence the PHB name); with appropriate
admission control these per-hop behaviors will translate into the desired larger-scale behavior.
One possible Internet arrangement is that a leaf domain or ISP would support RSVP, but hands traffic off
to some larger-scale carrier that runs DiffServ instead. Traffic with RSVP reservations would be marked on
entry to the second carrier with the appropriate DS class.
469
RFC 3246 goes on to specify how this apparent service should work. Roughly, if EF packets have length
L then they should be sent at intervals L/R. If an EF packet arrives when no other EF traffic is waiting, it
can be held in a queue, but it should be sent soon enough so that, when physical transmission has ended,
no more than L/R time has elapsed in total. That is, if R and L are such that L/R is 10s, but the physical
bandwidth delay in sending is only 2s, then the packet can be held up to 8s for other traffic.
Note that this does not mean that EF traffic is given strict priority over all other traffic (though implementation of EF-traffic processing via priority queuing is a reasonable strategy); however, the sending interface
must provide service to the EF queue at intervals of no more than L/R; the EF rate R must be in effect at
per-packet time scales. Queuing ten EF packets and then sending the lot of them after time 10L/R is not
allowed. Fair queuing can be used instead of priority queuing, but if quantum fair queuing is used then the
quantum must be small.
An EF routers committed rate R means simply that the router has promised to reserve bandwidth R for EF
traffic; if EF traffic arrives at a router faster than rate R, then a queue of EF packets may build up (though
the router may be in a position to use some of its additional bandwidth to avoid this, at least to a degree).
Queuing delays for EF traffic may mean that someones application somewhere fails rather badly, but the
router cannot be held to account. As long as the total EF traffic arriving at a given router is limited to that
routers EF rate R, then at least that router will be able to offer good service. If the arriving EF traffic meets
a token-bucket specification TB(R,B), then the maximum number of EF packets in the queue will be B and
the maximum time an EF packet should be held will be B/R.
So far we have been looking at individual routers. A DS domain controls EF traffic only at its border; how
does it arrange things so none of its routers receives EF traffic at more than its committed rate?
One very conservative approach is to limit the total EF traffic entering the DS domain to the common
committed rate R. This will likely mean that individual routers will not see EF traffic loads anywhere close
to R.
As a less-conservative, more statistical, approach, suppose a DS domain has four border routers R1, R2, R3
and R4 through which all traffic must enter and exit, as in the diagram below:
Suppose in addition the domain knows from experience that exiting EF traffic generally divides equally
between R1-R4, and also that these border routers are the bottlenecks. Then it might allow an EF-traffic
entry rate of R at each router R1-R4, meaning a total entering EF traffic volume of 4R. Of course, if on
some occasion all the EF traffic entering through R1, R2 and R3 happened to be addressed so as to exit via
R4, then R4 would see an EF rate of 3R, but hopefully this would not happen often.
If an individual ISP wanted to provide end-user DiffServ-based VoIP service, it might mark VoIP packets
for EF service as they entered (or might let the customer mark them, subject to the ISPs policing). The
470
19 Quality of Service
rate of marked packets would be subject to some ceiling, which might be negotiated with the customer as
a certain number of voice lines. These marked VoIP packets would receive EF service as they were routed
within the ISP.
For calls also terminating within that ISP or switching over to the traditional telephone network at an
interface within that ISP this would be all that was necessary, but some calls will likely be to customers of
other ISPs. To address this, the original ISP might negotiate with its ISP neighbors for continued preferential
service; such service might be at some other DS service class(eg AF). Packets would likely need to be remarked as they left the original ISP and entered another.
The original ISP may have one larger ISP in particular with which it has a customer-provider relationship.
The larger ISP might feel that with its high-volume internal network it has no need to support preferential
service, but might still agree to carry along the original ISPs EF marking for use by a third ISP down the
road.
471
ISP their average rate and also their committed and excess burst (bucket) capacities. As the customers
traffic entered the network, it would encounter two token-bucket filters, with rate equal to the agreed-upon
rate and with the two different bucket sizes: TB(r,Bcommitted ) and TB(r,Bexcess ). Traffic compliant with the
first token-bucket specification would be marked do not drop; traffic noncompliant for the first but compliant for the second would be marked medium and traffic noncompliant for either specification would be
marked with a drop precedence of high. (The use of Bexcess here corresponds to the sum of committed
burst size and excess burst size in the Appendix to RFC 2597.)
Customers would thus be free to send faster than their agreed-upon rate, subject to the excess traffic being
marked with a lower drop precedence.
This process means that an ISP can have many different Gold customers, each with widely varying rate
agreements, all lumped together in the same AF-4 class. The individual customer rates, and even the sum of
the rates, may have only a tenuous relationship to the actual internal capacity of the ISP, although it is not in
the ISPs long-term interest to oversubscribe any AF level.
If the ISP keeps the AF class 4 traffic sparse enough, it might outperform EF traffic in terms of delay. The
EF PHB rules, however, explicitly address delay issues while the AF rules do not.
19.9 NSIS
The practical problem with RSVP is the need for routers to participate. One approach to gaining ISP cooperation might be a lighter-weight version of RSVP, though that is speculative; Differentiated Services was
supposed to be just that and it too has not been widely adopted within the commercial Internet backbone.
That said, work has been underway for quite some time now on a replacement protocol suite. One candidate
is Next Steps In Signaling, or NSIS, documented in RFC 4080 and RFC 5974.
NSIS breaks the RSVP design into two separate layers: the signal transport layer, charged with figuring out
how to reach the intermediate routers, and the higher signaling application layer, charged with requesting
actual reservations. One advantage of this two-layer approach is that NSIS can be adapted for other kinds
of signaling, although most NSIS signaling can be expected to be related to a specific network flow. For
472
19 Quality of Service
example, NSIS can be used to tell NAT routers to open up access to a given inside port, in order to allow a
VoIP (or other) connection to proceed.
Generally speaking, NSIS also abandons the multicast-centric approach of RSVP. Signaling traffic travels
hop-by-hop from one NSIS Element, or NE, to the next NE on the path. In cases when the signaling traffic
follows the same path as the data (the path-coupled case), the signaling packet would likely be addressed
to the ultimate destination, but recognized by each NE router on the path. NE routers would then add
something to the packet, and perhaps update their own internal state. Nonparticipating (non-NE) routers
would simply forward the signaling packet, like any other IP packet, further towards its ultimate destination.
473
A subscriber leaves the EHCS state when the subscribers bandwidth drops below 50% of the subscription
rate for 15 minutes, where presumably the 50% rate is measured over some very short timescale. Note that
this means that a user with a TCP sawtooth ranging from 30% to 60% of the maximum might remain in the
EHCS state indefinitely.
Also note that all the subscribers traffic will be marked as BE traffic, not just the overage. The intent is to
provide a mild disincentive for sustained utilization within 70% of the subscriber maximum rate.
Token bucket specifications are not used, except possibly over very small timescales to define utilizations
of 50% and 70%. The (larger-scale) token-bucket alternative might be to create a token-bucket specification
for each customer TB(r,B) where r is some fraction of the subscription rate and B is a bucket of modest
size. All compliant traffic would then be marked PBE and noncompliant traffic would be marked BE. Such
a mechanism might behave quite differently, as only traffic actually over the ceiling would be marked.
474
19 Quality of Service
For two-way voice calls, symmetric RTP is often used (RFC 4961). This means that each party uses the
same port for sending and receiving. This is done only for convenience and to handle NAT routers; the two
separate directions are completely independent and do not serve as acknowledgments for one another, as
would be the case for bidirectional TCP traffic. Indeed, one can block one direction of a VoIP symmetricRTP stream and the call continues on indefinitely, transmitting voice only in the other direction. When the
block is removed, the blocked voice flow simply resumes.
475
The Ver field holds the version, currently 2. The P bit indicates that the RTP packet includes at least one
padding byte at the end; the final padding byte contains the exact count.
The X bit is set if an extension header follows the basic header; we do not consider this further.
The CC field contains the number of contributing source (CSRC) identifiers (below). The total header size,
in 32-bit words, is 3+CC.
The M bit is set to allow the marking, for example, of the first packet of each new video frame. Of course,
the actual video encoding will also contain this information.
The Payload Type field allows (or, more precisely, originally allowed) for specification of the audio/visual
encoding mechanism (eg -law/G.711), as described in RFC 3551. Of course, there are more than 27 possible encodings now, and so these are typically specified via some other mechanism, eg using an extension
header or the separate Session Description Protocol (RFC 4566) or as part of the RTP streams announcement. RFC 3551 put it this way:
During the early stages of RTP development, it was necessary to use statically assigned
payload types because no other mechanism had been specified to bind encodings to payload
types.... Now, mechanisms for defining dynamic payload type bindings have been specified in
the Session Description Protocol (SDP) and in other protocols....
The sequence-number field allows for detection of lost packets and for correct reassembly of reordered
packets. Typically, if packets are lost then the receiver is expected to manage without them; there is no
timeout/retransmission mechanism in RTP.
The timestamp is for synchronizing playback. The timestamp should be sufficiently fine-grained to support
not only smooth playback but also the measurement of jitter, that is, the degree of variation in packet arrival.
Each encoding mechanism chooses its own timestamp granularity. For most telephone-grade voice encodings, for example, the timestamp clock increments at the canonical sampling rate of 8,000 times a second,
corresponding to one DS0 channel (4.2 Time-Division Multiplexing). RFC 3551 suggests a timestamp
clock rate of 90,000 times a second for most video encodings.
Many VoIP applications that use RTP send 20 ms of voice per packet, meaning that the timestamp is incremented by 160 for each packet. The actual amount of data needed to send 20 ms of voice can vary from 160
bytes down to 20 bytes, depending on the encoding used, but the timestamp clock always increments at the
8,000/sec, 160/packet rate.
476
19 Quality of Service
The SSRC identifier identifies the primary data source (the synchronization source) of the stream. In
most cases this is either the identifier for the originating camera/microphone, or the identifier for the mixer
that repackaged the stream.
If the stream was processed by a mixer, the SSRC field identifies the mixer, and the SSRC identifiers of the
original sources are now listed in the CSRC (contributing sources) section of the header. If there was no
mixer involved, the CSRC section is empty.
477
([GP11]), but as yet no RFC. TFRC also allows for a gradual adjustment in rate to accommodate the new
conditions.
Of course, TFRC can only be used to adjust the sending rate if the sending rate is in fact adaptive, at the
application level. This, of course, requires a rate-adaptive encoding.
RFC 3550 specifies a mechanism to make sure that RTCP packets do not consume more than 5% of the total
RTP bandwidth, and, of that 5%, RTCP RR packets do not account for more than 75%. This is achieved by
each receiver learning from RTCP SR packets how many other receivers there are, and what the total RTP
bandwidth is. From this, and from some related information, each RTP receiver can calculate an acceptable
RTCP reporting interval. This interval can easily be in the hundreds of seconds.
Mixers are allowed to consolidate RTCP information from their direct subscribers, and forward it on to the
originating sources.
478
19 Quality of Service
priority traffic essentially took the same route as everything else. MPLS allows priority traffic to take an
entirely different route.
MPLS started out as a way of routing IP and non-IP traffic together, and later became a way to avoid the
then-inefficient lookup of an IP address at every router. It has, however, now evolved into a way to support
the following:
creation of explicit routes (virtual circuits) for IP traffic, either in bulk or by designated class.
large-scale virtual private networks (VPNs)
We are mostly interested in MPLS for the first of these; as such, it provides an alternative way for ISPs to
handle real-time traffic.
In effect, virtual-circuit tags are added to each packet, on entrance to the routing domain, allowing packets
to be routed along predefined virtual circuits. The virtual circuits need not follow the same route that normal
IP routing would use, though note that link-state routing protocols such as OSPF already allow different
routes for different classes of service.
We note also that the MPLS-labeled traffic might very well use the same internal routers as bulk traffic;
priority or fair queuing would then still be needed at those routers to make sure the MPLS traffic gets the
desired level of service. However, the use of MPLS at least makes the classification problem easier: internal
routers need only look at the tag to determine the priority or fair-queuing class, and deep-packet inspection
can be avoided. MPLS would also allow the option that a high-priority flow would travel on a special path
through its own set of routers that do not also service low-priority traffic.
Generally MPLS is used only within one routing domain or administrative system; that is, within the scope
of one ISP. Traffic enters and leaves looking like ordinary IP traffic, and the use of MPLS internally is
completely invisible. This local scope of MPLS, however, has meant that it has seen relatively widespread
adoption, at least compared to RSVP and IP multicast: no coordination with other ISPs is necessary.
To implement MPLS, we start with a set of participating routers, called label-switching routers or LSRs.
(The LSRs can comprise an entire ISP, or just a subset.) Edge routers partition (or classify) traffic into
large flow classes; one distinct flow (which might, for example, correspond to all VoIP traffic) is called a
forwarding equivalence class or FEC. Different FECs can have different quality-of-service targets. Bulk
traffic not receiving special MPLS treatment is not considered to be part of any FEC.
A one-way virtual circuit is then created for each FEC. An MPLS header is prepended to each IP packet,
using for the VCI a label value related to the FEC. The MPLS label is a 32-bit field, but only the first 20
bits are part of the VCI itself. The last 12 bits may carry supplemental connection information, for example
ATM virtual-channel identifiers and virtual-path identifiers (3.8 Asynchronous Transfer Mode: ATM).
It is likely that some traffic (perhaps even a majority) does not get put into any FEC; such traffic is then
delivered via the normal IP-routing mechanism.
MPLS-aware routers then add to their forwarding tables an MPLS table that consists of labelin , interfacein ,
labelout , interfaceout quadruples, just as in any virtual-circuit routing. A packet arriving on interface
interfacein with label labelin is forwarded on interface interfaceout after the label is altered to labelout .
Routers can also build their MPLS tables incrementally, although if this is done then the MPLS-routed traffic
will follow the same path as the IP-routed traffic. For example, downstream router R1 might connect to two
customers 200.0.1/24 and 200.0.2/24. R1 might assign these customers labels 37 and 38 respectively.
479
R1 might then tell its upstream neighbors (eg R2 above) that any arriving traffic for either of these customers
should be labeled with the corresponding label. R2 now becomes the ingress router for the MPLS domain
consisting of R1 and R2.
R2 can push this further upstream (eg to R3) by selecting its own labels, eg 17 and 18, and asking R3 to
label 200.0.1/24 traffic with label 17 and 200.0.2/24 with 18. R2 would then rewrite VCIs 17 and 18 with 37
and 38, respectively, before forwarding on to R1, as usual in virtual-circuit routing. R2 might not be able to
continue with labels 37 and 38 because it might already be using those for inbound traffic from somewhere
else. At this point R2 would have an MPLS virtual-circuit forwarding table like the following:
interfacein
R3
R3
labelin
17
18
interfaceout
R1
R1
labelout
37
38
One advantage here of MPLS is that labels live in a flat address space and thus are easy and simple to look
up, eg with a big array of 65,000 entries for 16-bit labels.
MPLS can be adapted to multicast, in which case there might be two or more labelout , interfaceout combinations for a single input.
Sometimes, packets that already have one MPLS label might have a second (or more) label pushed on the
front, as the packet enters some designated subdomain of the original routing domain.
When MPLS is used throughout a domain, ingress routers attach the initial label; egress routers strip it
off. A label information base, or LIB, is maintained at each node to hold any necessary packet-handling
information (eg queue priority). The LIB is indexed by the labels, and thus involves a simpler lookup than
examination of the IP header itself.
MPLS has a natural fit with Differentiated Services (19.7 Differentiated Services): the ingress routers could
assign the DS class and then attach an MPLS label; the interior routers would then need to examine only the
MPLS label. Priority traffic could be routed along different paths from bulk traffic.
MPLS also allows an ISP to create multiple, mutually isolated VPNs; all that is needed to ensure isolation is
that there are no virtual circuits crossing over from one VPN to another. If the ISP has multi-site customers
A and B, then virtual circuits are created connecting each pair of As sites and each pair of Bs sites. A and
B each probably have at least one gateway to the whole Internet, but A and B can communicate with each
other only through those gateways.
19.13 Epilog
Quality-of-service guarantees for real-time and other classes of traffic have been an area of active research
on the Internet for over 20 years, but have not yet come into the mainstream. The tools, however, are there.
Protocols requiring global ISP coordination such as RSVP and IP Multicast may come slowly if at all,
480
19 Quality of Service
but other protocols such as Differentiated Services and MPLS can be effective when rolled out within a
single ISP.
Still, after twenty years it is not unreasonable to ask whether integrated networks are in fact the correct
approach. One school of thought says that real-time applications (such as VoIP) are only just beginning to
come into the mainstream, and integrated networks are sure to follow, or else that video streaming will take
over the niche once intended for real-time traffic. Perhaps IntServ was just ahead of its time. But the other
perspective is that the marketplace has had plenty of opportunity to make up its mind and has answered with
a resounding no, and it is time to move on. Perhaps having separate networks for bulk traffic and for voice
is not unreasonable or inefficient after all. Or perhaps the Internet will evolve into a network where all traffic
is handled by real-time mechanisms. Time, as usual, may tell, but not, perhaps, quickly.
19.14 Exercises
1. Suppose someone proposes TCP over multicast, in which each router collects the ACKs returning from
the group members reached through it, and consolidates them into a single ACK. This now means that, like
the multicast traffic itself, no ACK is duplicated on any single link.
What problems do you foresee with this proposal? (Hint: who will send retransmissions? How long will
packets need to be buffered for potential retransmission?)
2. In the following network, suppose traffic from RP to R3-R5 is always routed first right and then down,
while traffic from R3R5 to RP is always routed first left and then up. What is the multicast tree for the
group G = {B1,B2,B3}?
RP
R1
R2
R3
R4
R5
B1
B2
B3
3. What should an RSVP router do if it sees a PATH packet but there is no subsequent RESV packet?
4. In 19.7.1 Expedited Forwarding there is an example of an EF router with committed rate R for packets
with length L. If R and L are such that L/R is 10s, but the physical bandwidth delay in sending is only 2s,
then the packet can be held up to 8s for other traffic.
How large a bulk-traffic packet can this router send, in between EF packets? Your answer will involve L.
5. Suppose, in the diagram in 19.7.1 Expedited Forwarding, EF was used for voice telephony and at some
point calls entering through R1, R2 and R3 were indeed all directed to R4.
Note that the ISP has no control over who calls whom.
19.14 Exercises
481
482
19 Quality of Service
Network management, broadly construed, consists of all the administrative actions taken to keep a network
running efficiently. This may include a number of non-technical considerations, eg staffing the help desk
and negotiating contracts with vendors, but we will restrict attention exclusively to the technical aspects of
network management.
The ISO and the International Telecommunications Union have defined a formal model for telecommunications and network management. The original model defined five areas of concern, and was sometimes
known as FCAPS after the first letter of each area:
fault
configuration
accounting
performance
security
Most non-ISP organizations have little interest in network accounting (the A in FCAPS is often replaced
with administration for that reason, but that is a rather vague category). Network security is arguably its
own subject entirely. As for the others, we can identify some important subcategories:
fault:
device management: monitoring of all switches, routers, servers and other network hardware to make
sure they are running properly.
server management: monitoring of the networks application layer, that is, all network-based software services; these include login authentication, email, web servers, business applications and file
servers.
link management: monitoring of long-haul links to ensure they are working.
configuration:
network architecture: the overall design, including topology, switching vs routing and subnet layout.
configuration management: arranging for the consistent configuration of large numbers of network
devices.
change management: how does a site roll out new changes, from new IP addresses to software
updates and patches?
performance:
traffic management: using the techniques of 18 Queuing and Scheduling to allocate bandwidth
shares (and perhaps bucket sizes) among varying internal or external clients or customers.
service-level management: making sure that agreed-upon service targets eg bandwidth are met
(depending on the focus, this could also be placed in the fault category).
483
While all these aspects are important, technical network management narrowly construed often devolves to
an emphasis on fault management and its companion, reliability management: making sure nothing goes
wrong, and, when it does, fixing it promptly. It is through fault management that some network providers
achieve the elusive availability goal of 99.999% uptime.
SNMP versus Management
While SNMP is a very important tool for network management, it is just a tool. Network management
is the process of making decisions to achieve the goals outlined above, subject to resource constraints.
SNMP simply provides some input for those decisions.
By far the most common device-monitoring protocol, and the primary focus for this chapter, is the Simple
Network Management Protocol or SNMP (20.2 SNMP Basics). This protocol allows a device to report
information about its current operational state; for example, a switch or router may report the configuration
of each interface and the total numbers of bytes and packets sent via each interface.
Implicit in any device-monitoring strategy is initial device discovery: the process by which the monitor
learns of new devices. The ping protocol (7.9 Internet Control Message Protocol) is common here, though
there are other options as well; for example, it is possible to probe the UDP port on a node used for SNMP
usually 161. As was the case with router configuration (9 Routing-Update Algorithms), manual entry is
simply not a realistic alternative.
SNMP and the Application Layer
SNMP can be studied entirely from a network-management perspective, but it also makes an excellent
self-contained case study of the application layer. Like essentially all applications, SNMP defines
rules for client and server roles and for the format of requests and responses. SNMP also contains its
own authentication mechanisms (20.11 SNMPv1 communities and security and 20.15 SNMPv3),
generally unrelated to any operating-system-based login authentication.
It is a practical necessity, for networks of even modest size, to automate the job of checking whether everything is working properly. Waiting for complaints is not an option. Such a monitoring system is known as a
network management system or NMS; there are a wide range of both proprietary and open-source NMSs
available. At its most basic, an NMS consists of a library of scripts to discover new network devices and
then to poll each device (possibly but not necessarily using SNMP) at regular intervals. Generally the data
received is recorded and analyzed, and alarms are sounded if a failure is detected.
When SNMP was first established, there was a common belief that it would soon be replaced by the OSIs
Common Management Interface Protocol. CMIP is defined in the International Telecommunication Unions
X.711 protocol and companion protocols. CMIP uses the same ASN.1 syntax as SNMP, but has a richer
operations set. It remains the network management protocol of choice for OSI networks, and was once upon
a time believed to be destined for adoption in the TCP/IP world as well.
But it was not to be. TCP/IP-based network-equipment vendors never supported CMIP widely, and so any
network manager had to support SNMP as well. With SNMP support essentially universal, there was never
a need for a site to bother with CMIP.
CMIP, unlike SNMP, is supported over TCP, using the CMIP Over Tcp, or CMOT, protocol. One advantage
of using TCP is the elimination of uncertainty as to whether a request or a reply was received.
484
485
until a response is received. If a response (or even a request) is too big, the usual strategy is to use IP-layer
fragmentation (7.4 Fragmentation and 8.4.4 Fragment Header). This is not ideal, but as most SNMP data
stays within local and organizational networks it is at least workable.
Another consequence of the use of UDP is that every SNMP device needs an IP address. For devices acting
at the IP layer and above, this is not an issue, but an Ethernet switch or hub has no need of an IP address for
its basic function. Devices like switches and hubs that are to use SNMP must be provided with a suitable IP
interface and address. Sometimes, for security, these IP addresses for SNMP management are placed on a
private, more-or-less hidden subnet.
SNMP also supports the writing of attributes to devices, thus implementing a remote-configuration mechanism. As writing to devices has rather more security implications than reading from them, this mechanism
must be used with care. Even for read-only SNMP, however, security is an important issue; see 20.11 SNMPv1 communities and security and 20.15 SNMPv3.
Writing to agents may be done either to configure the network behavior of the device eg to bring an
interface up or down, or to set an IP address or specifically to configure the SNMP behavior of the agent.
Finally, SNMP supports asynchronous notification through its traps mechanism: a device can be configured
to report an error immediately rather than wait until asked. While traps are quite important at many sites,
we will largely ignore them here.
SNMP is sometimes used for server monitoring as well as device monitoring; alternatively, simple scripts
can initiate connections to services and discern at least something about their readiness. It is one thing,
however, to verify that an email server is listening on TCP port 25 and responds with the correct RFC 5321
(originally RFC 821) EHLO message; it is another to verify that messages are accepted and then actually
delivered.
Starting in 1985, the International Standards Organization (ISO) and the standardization sector of the International Telecommunications Union (ITU-T, then known as CCITT) developed the Object Identifier scheme
for naming anything in the world that needed naming. Names, or OIDs, consist of strings of non-negative integers. In textual representation these component integers are often written separated by periods, eg 1.3.6.1;
the notation { 1 3 6 1 } is also used. The scheme is standardized in ITU-T X.660.
All OIDs used by SNMP begin with the prefix 1.3.6.1. For example, the sysName attribute corresponds to
OID 1.3.6.1.2.1.1.5. The prefix 1.3.6.1.2.1 is known as mib-2, and the one-step-longer prefix 1.3.6.1.2.1.1
is system. Occasionally we will abuse notation and act as if names referred to single additional levels
rather than full prefixes, eg the latter two names in mib-2.system.sysName.
The basic SNMP read operation is Get() (sent via the GetRequest protocol message), which takes an
OID as parameter. The agent receiving the Get request will, if authentication checks out and if the OID
corresponds to a valid attribute, return a pair consisting of the OID and the attribute value. We will return to
this in 20.8 SNMP Operations, and see how multiple attribute values can be requested in a single operation.
OIDs form a tree or hierarchy; the immediate child nodes of, say, 1.3.6.1 of length 4 are all nodes 1.3.6.1.N
of length 5. The root node, with this understanding, is anonymous; OIDs are sometimes rendered with a
leading . to emphasize this: .1.3.6.1. Alternatively, the numbers can be thought of as labels on the arcs of
the tree rather than the nodes.
There is no intrinsic way to distinguish internal OID nodes prefixes from leaf OID nodes that correspond
to actual named objects. Context is essential here.
It is common to give the numeric labels at any specific level human-readable string equivalents. The three
nodes immediately below the root are, with their standard string equivalents
itu-t(0)
iso(1)
joint-iso-itu-t(2)
The string equivalents can be thought of as external data; when OIDs are encoded only the numeric values
are included.
OID naming has been adopted in several non-SNMP contexts as well. ISO and ITU-T both use OIDs
to identify official standards documents. The X.509 standard uses OID naming in encryption certificates;
see RFC 5280. The X.500 directory standard also uses OIDs; the related RFC 4524 defines several OID
classes prefixed by 0.9.2342.19200300.100.1. As a non-computing example, Health Level Seven names US
healthcare information beginning with the OID prefix 2.16.840.1.113883.
As mentioned above, SNMP uses OIDs beginning with 1.3.6.1; in the OID tree these levels correspond to
iso(1)
org(3): organizations
dod(6): the US Department of Defense, the original sponsor of the Internet
internet(1)
This is spelled out in RFC 1155 via the syntax
internet
487
which means that the string internet has the type OBJECT IDENTIFIER and its actual value is the list
to the right of ::=: 1.3.6.1. The use of iso here represents the OID prefix .1, conceptually different from
just the level iso(1). The formal syntax here is ASN.1, as defined by the ITU-T standard X.680. We will
expand on this below in 20.6 ASN.1 Syntax and SNMP, though the presentation will mostly be informal;
further details can be found at http://www.itu.int/en/ITU-T/asn1/Pages/introduction.aspx.
After the above, RFC 1155 then defines the OID prefixes for management information, mgmt, and for
private-vendor use, private, as
mgmt
private
Again, internet represents the newly defined prefix 1.3.6.1. Most general-purpose SNMP OIDs begin
with the mgmt prefix. The private prefix is for OIDs representing vendor-specific information; for
example, the prefix 1.3.6.1.4.1.9 has been delegated to Cisco Systems, Inc. RFC 1155 states that both the
mgmt and private prefixes are delegated by the IAB to the IANA.
Responsibility for assigning names that begin with a given OID prefix can easily be delegated. The Internet
Activities Board, for example, is in charge of 1.3.6.1. (According to RFC 1155, this delegation by the
Department of Defense to the IAB was never made officially; the IAB just began using it.)
20.4 MIBs
A MIB, or Management Information Base, is a set of definitions that associate OIDs with specific attributes, and, along the way, associates each numeric level of the OIDs with a symbolic name. For example,
below is the set of so-called System definitions in the core RFC 1213 MIB known as MIB-2 (20.10 MIB2). The mib-2 and system prefixes are first defined as
mib-2
system
that is, 1.3.6.1.2.1 and 1.3.6.1.2.1.1 respectively. The system definitions, which represent actual attributes
that can be retrieved, are as follows
sysDescr
sysObjectID
sysUpTime
sysContact
sysName
sysLocation
sysServices
{
{
{
{
{
{
{
system
system
system
system
system
system
system
1
2
3
4
5
6
7
}
}
}
}
}
}
}
Most of these attributes represent string values that need to be administratively defined; we will return to
these in 20.10 MIB-2.
The MIB is not the actual attributes themselves; the set of all OID,value pairs stored by an SNMP manager
perhaps the result of a single set of queries, perhaps the result of a sequence of queries over time is
sometimes known as the management database, or MDB, though that term is not as universal.
Colloquially, a MIB can be either the abstract set of OID definitions or a particular ASCII file that defines the
set. The latter must be run through a MIB compiler to be used by other software; users of a MIB browser
488
generally find the information much more useful if the appropriate MIB file is loaded before making SNMP
queries. Some RFCs can be read directly by the MIB compiler, which knows to edit out the non-MIB
discussion.
The Net-SNMP package contains SNMP agents for linux and Macintosh computers (Microsoft has their
own SNMP-agent software), and command-line tools for making SNMP queries; these latter tools are also
available for Windows. The sysName value, for example, can be retrieved with the snmpget command
snmpget -v1 -c public localhost 1.3.6.1.2.1.1.5.0
The OID here corresponds to { system 5 0 }; the final 0 represents a SNMP convention that indicates this is
a scalar value and is not part of a table. See 20.9 MIB Browsing for further examples.
MIBs serve as both documentation and data. As documentation, they tell the implementer of a given SNMP
agent what information is to be returned for each OID request. As data, they serve to translate numeric OIDs
into human-readable data. For example, if the above snmpget command understands the appropriate MIB
(eg because the MIB file is in the right place), we can type
snmpget -v1 -c public localhost sysName.0
It is not uncommon for an installation to involve several hundred different (possibly overlapping) MIBs, each
defining a different portion of the OID tree. This is particularly true of the private subtree 1.3.6.1.4.1,
for which one might have a separate MIB file for each manufacturer and device.
Description
A 32-bit integer
Either a text string or a network address
A non-negative INTEGER that increases from 0, and can wrap around.
An INTEGER that can rise and fall but that never wraps around
An INTEGER used to measure time in 1/100ths of a second
An OID appearing as data
An IPv4 address, as an OCTET STRING of length 4
An IpAddress
Any other data, ultimately an OCTET STRING
489
SEQUENCE OF type: This defines a homogeneous list an array or table of objects of type
type. In most cases, type represents a SEQUENCE type, and so the list is a list of records, or
table in the database sense.
Note the importance of the keyword OF here to distinguish records from lists.
The OBJECT-TYPE macro is heavily used in MIB definitions. It has the format
objname OBJECT-TYPE
SYNTAX type
ACCESS read-write or read-only or write-only or not-accessible
STATUS mandatory or optional or obsolete
DESCRIPTION descriptive string
INDEX OID used for table indexing
::= OID assigned to objname
The objname, sometimes called the OBJECT DESCRIPTOR, is intended to be globally unique. The value
for SYNTAX must be a valid ASN.1 syntax specification; see 20.10 MIB-2 for examples. The values for ACCESS and STATUS are, in each case, one of the literal strings shown above. The value for
DESCRIPTION is optional.
Many of the objects so defined will represent tables or other higher-level structures; in this case the Object
ID of the last line will represent an internal node of the OID tree rather than a leaf node, and the ACCESS
will be not-accessible. It is not that the table is actually inaccessible, but rather that the attributes
must be retrieved one by one rather than collectively.
Here is a concrete simple example; other examples appear in the following section. The OID value of
ifEntry here is 1.3.6.1.2.1.2.2.1.
ifInOctets OBJECT-TYPE
SYNTAX Counter
ACCESS read-only
STATUS mandatory
DESCRIPTION
"The total number of octets received on the
interface, including framing characters."
::= { ifEntry 10 }
490
The data type definitions and the above ASN.1 syntax used for are collectively known as the structure of
management information, or SMI. The SMI in effect defines all the types and structures needed by the
various MIBs. For SNMPv1, the SMI is defined in RFC 1155.
491
ifIndex
1
2
3
4
name
lo
eth0
eth1
ppp0
MTU
16436
1500
1500
1420
bitrate
10,000,000
100,000,000
100,000,000
0
inOctets
171
37155014
0
2906687015
inPackets
3
1833455677
0
2821825
Routes
Although the data here is mostly a mash-up of records from different sources, the 10.38.0.0 route below
was a VPN route using the ppp0 tunnel above. Otherwise, workstation forwarding tables usually have
just two entries, for the local subnet and the default route.
The SNMP routing table, ipRouteTable, based on the IP forwarding table of 1.10 IP - Internet
Protocol, has a column representing the destination network, a variety of status and information columns
(eg for RouteAge), and a NextHop column; the index is the destination-network attribute. Again, here is
some sample data with the index column again in italic. The index here an IP address is a compound
object: a list of four integers (yes, we can represent an IPv4 address as a single 32-bit integer, but usually do
not).
dest
0.0.0.0
10.0.0.0
10.38.0.0
mask
0.0.0.0
255.255.255.0
255.255.0.0
metric
1
0
1
next_hop
192.168.1.1
0.0.0.0
192.168.4.1
type
indirect(4)
direct(3)
indirect(4)
(The type column indicates whether the destination is local or nonlocal; the host is on subnet 10.0.0.0/24
and so the middle entry involves local delivery. The use of indirect(4) and direct(3) is an example
of an SNMP enumerated type.)
The indexing of the interfaces table is (usually) dense: the values for ifIndex are typically consecutive
numbers. The routing table, on the other hand, has a sparse index; the index values are likely nowhere near
one another.
The TCP Connections table tcpConnTable lists every TCP connection together with its connection
state as in 12.6 TCP state diagram; one can obtain this on most linux, windows or macintosh systems
with the command netstat -a. The index in this case consists of the four attributes of the connectiondefining socketpair: the local address, the local port, the remote address and the remote port. In this table,
the only attribute not part of the index is an integer representing the connection state (again represented by
an SNMP enumerated type).
localAddr
10.0.0.3
10.0.0.3
10.0.0.3
localPort
31895
40113
20459
remoteAddr
147.126.1.209
74.125.225.98
10.38.2.42
remotePort
993
80
22
state
established(5)
timeWait(11)
established(5)
SNMP has adopted conventions for how tabular data such as the above is to be encoded in the OID tree.
The basic strategy is first to define, statically, an OID prefix for the subtree representing the table; in this
section we will denote this by T (later, in 20.10.2 Table definitions and the interfaces Group, we will see
that T often represents two OID levels of the form Table.Entry, or T.E). Each attribute of the table (that is,
each column) is then assigned a non-leaf OID by appending successive integers to the table prefix, starting
at 1. For example, the root prefix for the interfaces table is T = 1.3.6.1.2.1.2.2.1, known as ifEntry. The
columns shown in the fraction of the interfaces table above have the following OIDs:
492
OID
{ ifEntry
{ ifEntry
{ ifEntry
{ ifEntry
{ ifEntry
{ ifEntry
1 } or T.1
2 } or T.2
4 } or T.4
5 } or T.5
10 } or T.10
11 } or T.11
table column
The interface number, ifIndex
The interface name or description
The interface MTU
The interface bitrate
The number of inbound octets (bytes)
The number of inbound unicast packets
The second step is to establish a convention for converting the index attributes to a list of integers that can
be used as an OID suffix. This index OID suffix is then appended to the attribute OID prefix to obtain the
full OID (with no trailing .0 this time) that represents the name of the table data value. The prefix specifies
the table and the column, and the suffix specifies the row.
For the case of the interfaces table, the interface number is used as a single-level suffix. The eth1 value
for inOctets thus has OID name T.10.3: 10 is the number assigned to the inOctets column and 3 is
the ifIndex value for the eth1 row. Written in full, this is 1.3.6.1.2.1.2.2.1.10.3. Note this numbering
has the form T.col.row, the transposition of the more common programming-language row-first convention
T[row][col].
For the routing table, a destination IP address, eg 10.38.0.0, is converted to a four-level OID suffix, eg
10.38.0.0. (Note that 10.38.0.0 has exactly the same textual representation as an IPv4 address and as a fourlevel OID suffix, but logically they are entirely different things). An IP address might have been encoded
as a single OID level using the address as a single 32-bit integer, but this option was not chosen. In the
table above, the nextHop column is assigned the number 7, so the nextHop for 10.38.0.0 thus has the OID
T.7.10.38.0.0.
For the TCP connections table, the two IP addresses involved each translate to four-level OID chunks, as in
the routing table, and the two port numbers each translate to one-level chunks; the resultant OID suffix for
the state of the first connection in the table above would be
10.0.0.3.31895.147.126.1.209.993
where, for clarity, the port-number levels are shown in italic. The state column is assigned the identifier 1,
so this would all be appended to T.1 (it is common, but, as we see here, not universal, for column-number
assignment to begin with the index columns).
Here is a column-oriented diagram of the interfaces-table fragment from above. As we have been doing, the
table prefix is denoted T and nodes are labeled T.col.row. The topmost T and the first row (with the T.col
OIDs) are internal nodes; these are drawn with the heavier, blue boxes. The lower four rows, with black
boxes, are leaf nodes.
493
The diagram above emphasizes the arrangement into columns. The actual OID tree structure, however, is as
in the diagram below; all the leaf nodes in one column above are actually sibling nodes.
494
In all three tables here, the index columns are part of the table data. This is unnecessary; all the index
columns are of necessity encoded in the full OID of any table entry. In SNMPv2 it is required (though not
necessarily enforced) to in effect omit the index columns from the table as presented by the agent (though
they are still declared); see 20.13.4 SNMPv2 Indexes.
We also note that, in all three tables here (and in most SNMP tables that serve purely as sources of information), new rows cannot be added via SNMP. Some SNMP tables, however, do support row creation; see
20.14 Table Row Creation.
495
OID strictly following oid in lexicographical order, which we will call oid . The agent then includes the pair
oid ,value in its response, where value is the value corresponding to oid . Note that in the GetNext()
case oid is not previously known to the manager. Note also that SNMP always returns, with each value, the
corresponding OID, which for tabular data encodes the full index for the value.
The GetNext() operation allows a manager to walk through any subtree of the OID tree. If the root
of the subtree is prefix T, then the first call is to GetNext(T). This returns oid1 ,value1 . The next call
to GetNext has parameter oid1 ; the agent returns oid2 ,value2 . The manager continues with the series of
GetNext calls until, finally, the subtree is exhausted and the agent returns either an error or else (more likely)
an oidN ,valueN for which oidN is no longer an extension of the original prefix p.
As an example, let us start with the prefix 1.3.6.1.2.1.1, the start of the system group. The first call to
GetNext() will return the pair
1.3.6.1.2.1.1.1.0, sysDescr_value
The OID 1.3.6.1.2.1.1.1.0 is the first leaf node below the interior node 1.3.6.1.2.1.1.
The second call to GetNext will take OID 1.3.6.1.2.1.1.1.0 as parameter and GetNext() will return
1.3.6.1.2.1.1.2.0, sysObjectID_value
Here, the OID returned is the next leaf node strictly following 1.3.6.1.2.1.1.1.0. The process will continue on through {system 3 0}, {system 4 0}, etc, until an OID is found that is not part of the system
group. As we have seen, this will likely be the first entry of the interfaces group, ifNumber.0, with OID
1.3.6.1.2.1.2.1.0.
The action of GetNext() is particularly useful when retrieving a table, such as the interfaces table presented in the preceding section. In that example the index values are consecutive integers, and so an SNMP
manager could likely guess them. However, guessing the index values for the other two tables the IP
forwarding table and the TCP connections table would be well nigh impossible. And without the index
values, we do not have a complete OID.
Consider again the partial-column version of the interfaces table as diagrammed in this format:
496
Let us initially call GetNext(T). The next leaf node is the upper left black box, with OID T.1.1. The call to
GetNext(T) returns the pair T.1.1,1. We now continue with a call to GetNext(T.1.1); this returns T.1.2,2
and represents the black box immediately below the first one. The next two calls to GetNext() return,
successively, T.1.3,3 and T.1.4,4.
Now we call GetNext(T.1.4). The next leaf node following is the first leaf node in the second column, T.2.1;
the value returned is T.2.1,lo. The next three calls to GetNext, each time supplied with parameter the
value returned by the previous GetNext(), return the next three values making up the second, ifDesc
column of the table.
At the bottom of the second column, the call to GetNext(T.2.4) returns the OID,value pair that is the first
leaf node of the third column, T.3.1,16436. At the bottom of the third column, GetNext() jumps to
the top of the fourth column, and so on. In this manner GetNext() iterates through the entire table, one
column at a time.
When we finally get to the last leaf node of the table, shown here as the lower-right T.11.4 (though the actual
ifTable has additional columns). The call to GetNext(T.11.4) returns something outside the table. It will
20.8 SNMP Operations
497
either return an error in the event that there is no next OID that the manager is authorized to receive or
the next leaf node up and to the right of T. (In the normal MIB-2 collection, this is would be the first entry
of the atTable table.)
Either way, the manager receiving the data can tell that its request for GetNext(T.11.4) has gone past the
end of table T, and so T has been completely traversed.
Here is a diagram of the above sequence of GetNext() operations involved in traversing our partial
interfaces table; it shows the GetNext(T.11.4) at the tables lower right returning the beyond-the-table
OID,value pair A,avalue:
Some manager utilities performing this kind of table retrieval often called a walk will present the data
in the order retrieved, one column at a time, and some will (perhaps as an option) format the data visually
498
to look more like a table. See Net-SNMPs snmpwalk and snmptable in 20.9 MIB Browsing.
20.8.2 Set()
SNMP also allows a Set() operation. Not all attributes are writable, of course, and a manager must have
write authority for those that are writable. And there is no SetNext(); the exact OID of each value must
be supplied.
In the system group, the sysName attribute is writable. It has OID 1.3.6.1.2.1.1.5.0; to update this we
would invoke
Set(1.3.6.1.2.1.1.5.0, newsysname)
499
Like Get(), Set() can be invoked on multiple attributes. Suppose table T has three columns T.1, T.2
and T.3, and two rows:
T.1
10
11
T.2
eth0
eth1
T.3
3.14159
2.71828
The OID here is that of ifTable. Multiple OIDs are permitted for snmpget and snmpgetnext but not
for snmpwalk.
If the appropriate MIB files are loaded, the above command can also be entered as
snmpwalk -v 1 -c tengwar localhost ifTable
This entails putting the RFC1213-MIB file into a special directory (eg $HOME/.snmp/mibs), and then either
adding a line like
500
mibs +RFC1213-MIB
to the snmp.conf file, or by including the MIB name on the command line:
snmpwalk -v 1 -c tengwar -m RFC1213-MIB localhost ifTable
Note that RFC1213-MIB is the identifier assigned to the MIB file in its first line, as below, and not necessarily the name of the file containing the MIB.
RFC1213-MIB DEFINITIONS ::= BEGIN
If the appropriate MIB file is loaded, the OIDs may be displayed symbolically rather than numerically, but
the data presentation will not change:
iso.3.6.1.2.1.1.5.0 = STRING: "valhal"
RFC1213-MIB::sysName.0 = STRING: "valhal"
;; without MIB
;; with MIB
(Actually, if the data is of type Object ID, then without the MIB the OID will be displayed numerically, and
with the appropriate MIB all or part of it may be displayed symbolically.)
Finally, the Net-SNMP manager package includes snmptable, which is like snmpwalk except that the
data is displayed as a table rather than one column at a time. For the snmptable command, the appropriate
MIB file must be installed. Net-SNMP comes with a mib browser with a graphical interface, tkmib.
The Java SNMP open-source package at http://jsevy.com/snmp/ includes both a Java library to encode and
decode SNMP packets and the SNMP Inquisitor browser.
The personal edition of the iReasoning MIB browser is not open-source, but it is free, and works well on
windows, macs and linux; SNMPv3 is not supported. The license does prohibit the publication of benchmark
tests without consent.
20.10 MIB-2
We can now turn to the MIB that represents the core of SNMPv1 data, known as MIB-2, and defined in RFC
1213. The 2 here often represented with a Roman numeral: MIB-II represents the second iteration of
the definition, but it is still part of SNMPv1. The predecessor MIB-1 was first documented in RFC 1066,
1988.
As we saw in 20.4 MIBs, the MIB-2 OID prefix is 1.3.6.1.2.1.
MIB-2 has since been extended. We look at a few extensions below in 20.13.7 SNMPv2 MIB Changes,
20.13.9 IF-MIB and ifXTable, 20.13.12 TCP-MIB and 20.13.11.2 IP-Forward MIB. In general, serious
network management should make use of these newer versions. However, MIB-2 is still an excellent place
to get started, even if partly obsolete.
The original MIB-2 is divided into ten groups, not all of which are in current use:
system(1): above
interfaces(2): above, in brief
at(3): the ARP cache, 7.7 Address Resolution Protocol: ARP
20.10 MIB-2
501
502
Some of the interfaces-group definitions have later been updated. See, for example, RFC 2863, which also
redefines the group to use the additional features of the SNMPv2 SMI. We will return to an extension of the
interfaces group in 20.13.9 IF-MIB and ifXTable.
MIB table definitions almost always involve a two-level process: an OID is defined for the table, and then
a second OID is defined for a table entry, that is, for one row of the table. This second OID is usually
generated from the first by appending .1, and it is this second OID that represents the table prefix in the
sense of 20.7 SNMP Tables, denoted there by T.
The actual ASN.1, slightly annotated, is as follows:
ifTable OBJECT-TYPE
SYNTAX SEQUENCE OF IfEntry
-- note UPPER-CASE-I IfEntry
ACCESS not-accessible
STATUS mandatory
DESCRIPTION
"A list of interface entries. The number of
entries is given by the value of ifNumber."
::= { interfaces 2 }
-- that is, 1.3.6.1.2.1.2.2
ifEntry OBJECT-TYPE
-- note lower-case-i ifEntry
SYNTAX IfEntry
-- note UPPER-CASE-I IfEntry
ACCESS not-accessible
STATUS mandatory
DESCRIPTION
"An interface entry containing objects at the
subnetwork layer and below for a particular interface."
INDEX
{ ifIndex }
::= { ifTable 1 }
-- that is, 1.3.6.1.2.1.2.2.1
Both these entries are not-accessible as they do not represent leaf nodes. The second declaration
above is for the lower-case-i ifEntry; the next definition in RFC 1213 is for the UPPER-CASE-I version.
The latter represents the complete list of all columns of an ifEntry/IfEntry object, together with their types
from (in most cases) 20.5 SNMPv1 Data Types. The PhysAddress type is defined in RFC 1213 as a
synonym for OCTET STRING. Definitions for the OID associated with each column comes later.
It is ifEntry that represents an actual row, and thus includes an INDEX entry to specify the attribute or
attributes that make up the primary key for that row.
We now define IfEntry:
IfEntry ::=
SEQUENCE {
ifIndex
ifDescr
ifType
ifMtu
ifSpeed
ifPhysAddress
ifAdminStatus
ifOperStatus
ifLastChange
ifInOctets
ifInUcastPkts
20.10 MIB-2
INTEGER,
DisplayString,
INTEGER,
INTEGER,
Gauge,
PhysAddress,
INTEGER,
INTEGER,
TimeTicks,
Counter,
Counter,
503
ifInNUcastPkts
ifInDiscards
ifInErrors
ifInUnknownProtos
ifOutOctets
ifOutUcastPkts
ifOutNUcastPkts
ifOutDiscards
ifOutErrors
ifOutQLen
ifSpecific
Counter,
Counter,
Counter,
Counter,
Counter,
Counter,
Counter,
Counter,
Counter,
Gauge,
OBJECT IDENTIFIER
Attributes
There can be a certain eye-glazing tedium in some of SNMPs lengthy attribute lists, such as the one
above. This can be replaced by panic in a hurry, though, when a problem has arisen and the proper
attribute to diagnose it doesnt seem to be included anywhere.
It would have been possible to plug in the above definition of IfEntry into the SYNTAX specification of the
previous ifEntry, but that is cumbersome. The IfEntry definition is not a stand-alone OBJECT-TYPE and
does not have its own OID.
If all we wanted to do was to implement the interfaces table using the shortest possible OIDs, we would
not have created separate ifTable and ifEntry OIDs. This would mean, however, that we could not use
the OBJECT-TYPE macro to define ifTable, which would have been less consistent (especially as the
ASN.1 syntax also determines the packet encoding, as in 20.12 SNMP and ASN.1 Encoding). Essentially
every table prefix in SNMP is defined using two additional OID levels, as here, rather than one.
We now turn to the 22 specific interface attributes. Here is the definition for the first, ifIndex; it defines
column 1 of the interfaces table to be, in effect, ifEntry.1. As we are now talking about a leaf node, once the
OID suffix is appended to represent the index, the ACCESS is no longer not-accessible.
ifIndex OBJECT-TYPE
SYNTAX INTEGER
ACCESS read-only
STATUS mandatory
DESCRIPTION
"A unique value for each interface. Its value ranges
between 1 and the value of ifNumber. The value for each
interface must remain constant at least from one
reinitialization of the entitys network management system
to the next reinitialization."
::= { ifEntry 1 }
The DESCRIPTION clearly indicates that the values for ifIndex, used to specify interfaces, are to be
consecutive integers. Compliance with this rule has been an early casualty, for various reasons, and is
formally withdrawn in RFC 2863. Some vendors simply number physical interfaces non-consecutively. In
other cases, there is some underlying issue with consecutive numbering. For example, one of the authors
systems running Net-SNMP returns ifNumber = 3, and then the following table values:
504
ifIndex
1
2
549
ifName
lo
eth0
ppp0
It turns out that ppp0 is a virtual interface corresponding to a VPN tunnel, 3.1 Virtual Private Network,
and the underlying tunnel regularly fails and is then automatically re-instantiated. Each time it does so, the
ifIndex is incremented by 1.
The rule that interfaces be numbered consecutively was formally deprecated in RFC 2863, an SNMPv2
update of the interface group. This, in turn, makes the ifNumber value rather less useful than it might be;
most SNMP tables are not associated with a count attribute and seem to do just fine.
Most SNMP data values correspond straightforwardly with attributes defined by the hardware and the underlying operating system. The ifIndex value does not, at least not necessarily. Generally the agent must
maintain at least some state to keep the ifIndex value consistent. In some cases, the ifIndex value
may be taken from the relative position of the interface in some internal operating-system table. This is not,
however, universally the case, as with the ifIndex value of 549 for ppp0 in the table above.
The ifIndex value is widely used throughout SNMP, and is often referenced in other tables. SNMPv2 even
defines a special type (a TEXTUAL CONVENTION) for it, named InterfaceIndex (20.13.1 SNMPv2
SMI and Data Types).
As we mentioned earlier, there is no reason to include ifIndex as an actual column in the table; the value
of ifIndex can always be calculated from the OID of any component of the row. The SNMPv2 approach
here basically to define ifIndex as not-accessible is described below in 20.13.4 SNMPv2
Indexes.
The ifDescr is a textual description of the interface; it is usually the device name associated with the
interface. RFC 1213 states this string should include the name of the manufacturer, the product name and
the version of the hardware interface, but this is inconsistent with linux device-naming conventions. The
current rule is that it is merely unique.
The ifType attribute is our first example of how ASN.1 handles enumerated types. The value is a small
integer and the hardware type associated with each integer is spelled out as follows. The majority of the
networking technologies in this 1991 list have pretty much vanished from the face of the earth.
ifType OBJECT-TYPE
SYNTAX INTEGER {
other(1),
-- none of the following
regular1822(2),
hdh1822(3),
ddn-x25(4),
rfc877-x25(5),
ethernet-csmacd(6),
iso88023-csmacd(7),
iso88024-tokenBus(8),
iso88025-tokenRing(9),
iso88026-man(10),
starLan(11),
proteon-10Mbit(12),
proteon-80Mbit(13),
hyperchannel(14),
20.10 MIB-2
505
fddi(15),
lapb(16),
sdlc(17),
ds1(18),
-- T-1
e1(19),
-- European equiv. of T-1
basicISDN(20),
primaryISDN(21),
-- proprietary serial
propPointToPointSerial(22),
ppp(23),
softwareLoopback(24),
eon(25),
-- CLNP over IP [11]
ethernet-3Mbit(26),
nsip(27),
-- XNS over IP
slip(28),
-- generic SLIP
ultra(29),
-- ULTRA technologies
ds3(30),
-- T-3
sip(31),
-- SMDS
frame-relay(32)
}
ACCESS read-only
STATUS mandatory
DESCRIPTION
"The type of interface, distinguished according to
the physical/link protocol(s) immediately below
the network layer in the protocol stack."
::= { ifEntry 3 }
A more serious problem is that over two hundred new technologies are unlisted here. To address this, the
values for ifType have been placed under control of the IANA, as defined in IANAifType; see RFC
2863 and https://www.iana.org/assignments/ianaiftype-mib/ianaiftype-mib. The IANA can then add new
types without formally updating any RFC.
For the meaning of ifMtu, the interface MTU, see 7.4 Fragmentation. The 32-bit ifSpeed value will
be unusable once speeds exceed 2 Gbps; RFC 2863 defines an ifHighSpeed object with speed measured
in units of Mbps. This entry is part of the ifXTable table, 20.13.9 IF-MIB and ifXTable. RFC 2863
also clarifies that, for virtual interfaces that do not really have a bandwidth, the value to be reported is zero
(though in the example earlier the loopback interface lo was reported to have a bandwidth of 10 Mbps).
The ifPhysAddress is, on Ethernets, the Ethernet address of the interface.
The ifAdminStatus and ifOperStatus attributes are enumerated types: up(1), down(2), testing(3).
If the ifAdminStatus is up(1), and the ifOperStatus disagrees, then there is a likely hardware
malfunction. The ifLastChange attribute reflects the last time, in TimeTicks, there was a change in
ifOperStatus.
The next eleven entries count bytes, packets and errors. The ifInOctets and ifOutOctets count
bytes received and sent (including framing bytes, 4.1.5 Framing, if applicable). The problem with these
is that they may wrap around too fast: in 34 seconds at 1 Gbps. The MIB-2 values are still used, but are
generally supplemented with 64-bit counters defined in ifXTable, 20.13.9 IF-MIB and ifXTable.
The packet counters are for unicast packets, non-unicast packets, discarded packets, errors, and, for inbound
packets only, packets with unknown protocols. The count of non-unicast packets is separated in ifXTable
into separate counts of broadcast and multicast packets.
506
The interface queue length is available in ifOutQLen. It takes a considerable amount of traffic to make
this anything other than 0. RFC 1213 says nothing about the timescale for averaging the queue length.
The last member of the classic MIB-2 interfaces group is ifSpecific, which has type ObjectID. It was
to return an OID that may be queried for additional ifType-specific information about the interface. It was
formally deprecated in RFC 2233.
20.10 MIB-2
507
The first table is the ipAddrTable, a list of all IP addresses assigned to this node together with interfaces
and netmasks.
The second table is the ipRouteTable, that is, the forwarding table. We looked briefly at this in
20.7 SNMP Tables. Note that the index consists solely of the destination IP address ipRouteDest,
which is not sufficient for CIDR-based forwarding in which the forwarding-table key is the dest,netmask
pair. Traffic to 10.38.0.0/16 might be routed differently than traffic to 10.38.0.0/24!
This table contains four attributes ipRouteMetric1 through ipRouteMetric4. RFC 1213 states
that unused metrics should be set to -1, but many agents simply omit such columns entirely. Agents may
similarly omit ipRouteAge.
The third table is ipNetToMediaTable, which is the ARP-cache table and which replaces a similar nowdeprecated table in the at group. The ipNetToMedia table adds one column, ipNetToMediaType, not
present in the at-group table; this column indicates whether an physical-to-IP-address mapping is invalid(2),
dynamic(3) or static(4). Most ARP entries are dynamic.
For an updated version of the ip group, see 20.13.11.1 IP-MIB and 20.13.11.2 IP-Forward MIB.
508
tcpInSegs: the count of received TCP packets, including errors and duplicates (though duplicate
reception is relatively rare)
tcpOutSegs: the count of sent TCP packets, not including retransmissions
tcpRetransSegs: the count of TCP packets with at least one retransmitted byte; the packet may
also contain new data
tcpInErrs: the number of TCP packets received with errors, including checksum errors
tcpOutRsts: the number of RST packets sent
Perhaps surprisingly, there is no counter provided for total number of TCP bytes sent or received. There is
also no entry for the congestion-management strategy, eg Reno v NewReno v TCP Cubic (15 Newer TCP
Implementations).
The tcp group also includes the tcpConnTable table, which lists, for each connection (identified by
localAddress,localPort,remoteAddress,remotePort) the connection state. We looked at this earlier in
20.7 SNMP Tables. The table is noteworthy in that four of its five columns are part of the INDEX.
A consequence is that to extract all the information from the table, a manager need only retrieve the
tcpConnState column: the other four attributes will all be encoded in the OID index that is returned
with each tcpConnState value.
We will look at the newer SNMPv2 replacement in 20.13.12 TCP-MIB.
509
The community string identifies a manager as a member of a designated community, in the conventional
use of the word, of managers. Of course, at many sites there will be only one manager that interacts with
any given agent.
Community strings are sent unencrypted, and so are vulnerable to sniffing. They should be obscure, and
changed frequently. In actual practice, far and away the most popular value the default for many agents
is the string public.
If a manager sends an incorrect community string, then in SNMPv1 and SNMPv2c there is no reply. SNMPv3 relaxed this rule somewhat, 20.15.3 SNMPv3 Engines.
A single agent can support multiple community strings. Each community has an associated subset of the
MIB (a view) that it allows. For example, an agent can be configured so that the community string system
allows the manager access to the system group, the community string tengwar allows access to the entire
MIB-2 group, and the community string galadriel allows access to the preceding plus the private
group(s). Using Net-SNMP (the most common agent on linux and Macintosh machines) this would be
achieved by the following entries in the snmpd.conf file:
rocommunity
rocommunity
rocommunity
view
view
system
default
tengwar
default
galadriel default
mib2+private included
mib2+private included
1.3.6.1.2.1.1
1.3.6.1.2.1
-V mib2+private
1.3.6.1.2.1
1.3.6.1.4.1
In order that the galadriel community could contain two disconnected OID subtrees, it was necessary to
make use of the View-based Access Control Model (VACM). Community galadriel has access to the OIDtree view named mib2+private; this view is in turn defined in the last two lines above.
For our purposes here, VACM can be seen as an implementation mechanism for specifying what portion of
the OID tree is accessible to a given community. VACM allows read and write permissions to be granted
to specific OID trees, and also to be excluded from specific subtrees (eg table columns). Using the mask
mechanism, access can even be granted to specific rows of a table. We return to VACM for SNMPv3 very
briefly in 20.15.9 VACM for SNMPv3.
On Microsoft operating systems, an SNMP agent is generally included but must be enabled, eg from Windows components or Programs and Features. After that, the agent must still be configured. This is done
by accessing Services (eg through Control Panel Administrative Tools or by launching services.msc),
selecting SNMP service, and clicking on Properties. This applet only allows setting the community
name and read vs write permissions; specifying collections of visible OID subtrees (views) is not supported
here (though it may be via SNMP itself).
The community mechanism can offer a reasonable degree of security, if community names are changed
frequently and if eavesdropping is not a concern. Perhaps the real problem with community-based security
is that just how much information can leak out if an attacker knows the community string is not always well
understood. SNMP access can reveal a sites complete network and host structure, including VPNs, subnets,
TCP connections, host-to-host trust relationships, and much more. Most sites block the SNMP port 161 at
their firewall; some even go so far as to run SNMP only on a hidden network largely invisible even within
the organization.
The view model for OID-tree access is formalized in RFC 3415 as part of SNMPv3; it is called the Viewbased Access Control Model or VACM. It allows the creation of named views. The vacmAccessTable
510
spells out the viewing rights assigned to a given VACM group, which, in a rough sense, corresponds to an
SNMPv1/SNMPv2c community. VACM supports, in addition to views consisting of disjoint unions of OID
subtrees, table views that limit access to a specific set of columns.
The first two bits of the type field, the class bits, identify the context. Universal types such as INTEGER
and OBJECT IDENTIFIER have class bits 00; application-specific types such as Counter32 and TimeTicks
have class bits 01.
The third bit is 0 for primitive types and 1 for constructed types such as STRUCTURE.
The rest of the first byte is the type tag. If a second byte is needed, the tag bits of the first byte are 11111.
NULL
INTEGER
OCTET STRING
OBJECT IDENTIFIER
SNMP also defines several application-specific tags (the following are from RFC 2578):
00000
00001
00010
00011
00100
00110
IpAddress
Counter32
Gauge32
TimeTicks
Opaque
Counter64
511
The second field of the type-length-value structure is the length of the value portion, in bytes. Most lengths
of primitive types will fit into a single byte. If a data item is longer than 127 bytes (true for many composite
types), the multi-byte integer encoding below is used.
The actual data is then encoded into the value field. For nonnegative integers, the integer is converted
to a twos-complement bitstring and then encoded in as few bytes as possible, provided there is at least
one leading 0-bit to represent the sign. Similarly, negative numbers must have at least one leading 1-bit to
represent the sign.
For example, 127 can be encoded as a length=1 INTEGER with value byte 0111 1111; 128 must be
encoded as a length=2 integer with value bytes 0000 0000 and 1000 0000; the INTEGER represented by a
length of 1 and aa value byte of 1000 0000 is 128. Similarly, to encode decimal 10,000,000 (0x989680),
four value bytes are needed as otherwise the leading bit would be 1 and the number would be interpreted
as negative.
For OCTET STRINGs, the string bytes are placed in the value portion and the number of bytes is placed
in the length portion.
Object IDs are generally encoded using one byte per level; there are two exceptions. First, the initial two
levels x.y of the OID are encoded using a single level with value 40x + y; all SNMP OIDs begin with 1.3
and so the first byte is 43 (0x2b). Second, if a level is greater than 127, it is encoded as multiple bytes. The
first bit of the last byte is 0; the first bit of each of the preceding bytes is 1. The seven remaining bits of each
byte contain the bits of the OID level. For example, 1.3.6.1.2.1.128.9 would have a value encoding of the
following bytes in hexadecimal:
2b 06 01 02 01 81 00 09
where 128 is represented as the two seven-bit blocks 0000001 0000000 and the first block is prefixed by 1
and the second block by 0.
There are also several application-specific composite types; the first three bits of the tag field for these will
be 101.
00000
00001
00010
00011
00100
Get-Request
Get-Next-Request
Get-Response
Set-Request
Trap
ObjectSyntax is specified as a CHOICE that can contain any of the tagged SNMP primitive types
above; the CHOICE syntax adds no bytes so the value is simply encoded as above. The VarBind pair
512
1.3.6.1.2.1.1.3.0, 8650000 the OID is sysUpTime and the value is 24 hours would then be represented
in hexadecimal bytes as below; the hexadecimal representation of 8650000 is 0x83d600.
The first byte of 0x30 marks this as a SEQUENCE; recall that the type byte for a SEQUENCE has a P/C
bit of 1 and five low-order bits of 10000, for a numeric value of 48 decimal or 0x30.
The VarBindList is defined to be a SEQUENCE OF VarBind. If we have only the single VarBind
above, the resultant enclosing VarBindList would be as follows; the length field is 0x12 = 18.
30 12 30 10 06 08 2b 06 01 02 01 01 03 00 02 04 00 83 d6 00
The BER encoding rules do not stop with the VarBindList. A slightly simplified ASN.1 specification
for an entire SNMPv1 Get-Request packet portion (or protocol data unit) is
SEQUENCE {
request-id
INTEGER,
error-status INTEGER,
error-index INTEGER,
variable-bindings VarBindList
}
The encoding of the whole packet would also be by the above BER rules.
The BER encoding mechanism represents a very different approach from the fixed-field layout of, say, the
IP and TCP headers (7.1 The IPv4 Header and 12.2 TCP Header). The latter approach is generally quite
a bit more compact, as only four bytes are needed for a larger integer versus six under BER, and no bytes
are used for SEQUENCE specifications. The biggest advantage of the SNMP BER approach, however, is
that all objects, from entire packets down to individual values, are self-describing. Given the variety of
types used by SNMP, the fact that many are of variable length, and the fact that value readers such as MIB
browsers often operate without having all the type-specifying MIBs loaded, this self-describing feature is
quite useful.
20.13 SNMPv2
SNMPv2 introduced multiple evolutionary changes: to the SMI, to various MIBs, and to the basic protocol
operation. Many new MIBs were added. SNMPv2 also contained a proposal for improved security, but this
was not widely adopted. Eventually most of the SNMP community settled on SNMPv2c, the version of
SNMPv2 that stayed with the community-based security model.
Most of the specification of SNMPv2c is in RFC 1901 through RFC 1909.
Generally SNMPv1 and SNMPv2c agents and managers can interoperate quite easily. Essentially all SNMPv2 agents also support SNMPv1 queries, and answer according to the version of the request received.
20.13 SNMPv2
513
There is slightly greater confusion between SNMPv1 and SNMPv2 MIB files, but almost all browsers and
managers support both.
T.1.11
T.1.12
T.1.13
T.1.14
T.1.15
T.2.11
T.2.12
T.2.13
T.2.14
T.2.15
T.3.11
T.3.12
T.3.13
T.3.14
T.3.15
Then a GetBulk request for (T.1,T.2,T.3) with a repetition count of 3 will return the first three rows, with
the following OIDs:
T.1.11 T.2.11 T.3.11
T.1.12 T.2.12 T.3.12
T.1.13 T.2.13 T.3.13
To continue, the next such request would include OIDs (T.1.13, T.2.13, T.3.13) and the result (again assuming a count of 3) would be of values with these OIDs:
T.1.14 T.2.14 T.3.14
T.1.15 T.2.15 T.3.15
T.2.11 T.3.11 A
where A is the next leaf OID above and to the right of T.
Note the third row of the second request: the first leaf OID following T.1.15 the last row of column 1 is
T.2.11, that is, the first row of column 2. Similarly, T.3.11 follows T.2.15. As T.3.15 is the last leaf OID in
the table, its leaf-OID successor (A) is outside the table.
The GetBulk request format actually partitions the list of requested OIDs into two parts: those OIDs that are
to be requested only once and those for which the request is repeated. Two additional parameters besides
the OID list are included: non-repeaters indicating the number of OIDs to be requested only once and
max-repetitions indicating the number of times the remaining OIDs are retrieved.
As with GetNext, it is possible that a request for rows of the table will return OID,value pairs outside the
table, that is, following the table in the OID-tree order.
If the total number of OIDs in the request is N non-repeaters, then the return packet will contain a
list of OID,value variable bindings of up to length
non-repeaters + (N non-repeaters)max-repetitions
GetBulk has considerable potential to return more data than there is room for, so an agent may return fewer
repetitions as it sees fit.
20.13 SNMPv2
515
sysOREntry OBJECT-TYPE
SYNTAX
SysOREntry
MAX-ACCESS not-accessible
STATUS
current
DESCRIPTION
"An entry (conceptual row) in the sysORTable."
INDEX
{ sysORIndex }
::= { sysORTable 1 }
...
sysORIndex OBJECT-TYPE
SYNTAX
INTEGER (1..2147483647)
MAX-ACCESS not-accessible
STATUS
current
DESCRIPTION ...
::= { sysOREntry 1 }
The INDEX attribute appears in the declaration of sysOREntry in the usual way. But it is now classed as
an auxiliary object, and access to sysORIndex is not-accessible; in SNMPv1 it would have been
read-only and thus an ordinary column.
When direct access to index attributes is suppressed this way, the data is still available in the accompanying
OID, but it is no longer tagged by type as in 20.12 SNMP and ASN.1 Encoding. The sysORIndex value
above, for example, would appear as a single OID level, but the manager would have to use the MIB to
determine that it was meant as an INTEGER and not a TimeTicks or Counter. An IPv4 address used as an
index would appear in the OID as four levels, readily recognizable as an IPv4 address provided the manager
knew where to look. When STRING values appear in the index, the lack of an index column can be a
particular nuisance; for example, the only indication of the username alice in the usmUserTable of
20.15.9.1 The usmUserTable might be the OID fragment 97.108.105.99.101, representing the ASCII codes
for the letters a.l.i.c.e.
Generally, an SNMPv2 agent will send back the noSuchObject special value (see 20.13.2 SNMPv2 Get
Semantics) when asked for a not-accessible auxiliary object.
20.13.5 TestAndIncr
SNMPv2 introduced the TestAndIncr textual convention, which introduces something of an aberration
to the usual semantics of Set(). The underlying type is INTEGER, in the range 0..231 1, and Get()
works the usual way. However, if TI is the OID name of a TestAndIncr object, then Set(TI,val) never
sets the value of TI to val. Instead, if the value of the object is already equal to val, then the Set() succeeds
and the value of the object is incremented, that is, set to val + 1. If the value of the object is not equal to val,
an error occurs and no change is made.
A TestAndIncr object acts like a kind of semaphore, though not exactly as there is no way to decrement
the object (though the value does wrap around from 231 -1 back to 0). The goal here is to provide a voluntary mechanism to enforce serialization when more than one manager may be writing to the same set of
attributes.
516
As we saw at the example at the end of 20.8.2 Set(), such serialization is not automatic. But now let us revisit
that example using TestAndIncr. Recall that two managers are updating attributes with OIDs X and Y;
this time, however, they agree also to include a TestAndIncr object with OID TI. Then serialization is
assured as long as each manager executes each multi-attribute Set() in the following form, where val :=
Get(TI) means that the manager uses Get() to retrieve the value of TI and stores it in its own local variable
val.
val := Get(TI)
Set((TI,val), (X,xval), (Y,yval))
To see this, suppose manager Bs Get(TI) occurs after manager As Set(TI,val) has incremented val. Then
manager Bs Set() operations occur even later, after As Set() has completed. The alternative is that
both managers Get() the same value val. Let A be the manager who first succeeds with Set(TI,val). Now
the other manager B will have its Set(TI,val) fail, as the value stored at TI is now val+1. Thus all of Bs
Set() operations will fail.
A consequence here is that manager B will have to try again, probably immediately. However, the likelihood
of conflict is low, and B can expect to succeed soon.
Usually only one TestAndIncr object needs to be provided for an entire table, not one per row. For an
actual example, see 20.15.9.1 The usmUserTable.
The intent here is that every ifTable row is now extended to include the nineteen additional values defined
for IfXEntry; that is, there is a one-to-one correspondence between rows if ifTable and ifXTable.
If, on the other hand, the new table extends only a few rows of the original table, ie is a sparse extension,
then the new table entry should have an INDEX clause that repeats that of the original table. An example is the EtherLike-MIB of 20.13.10 ETHERLIKE-MIB, in which the dot3StatsTable extends the
MIB-2 ifTable by providing additional information for those interfaces that behave like Ethernets. The
dot3StatsEntry definition is
20.13 SNMPv2
517
dot3StatsEntry OBJECT-TYPE
SYNTAX
Dot3StatsEntry
MAX-ACCESS not-accessible
STATUS
current
DESCRIPTION "Statistics for a particular interface to an ethernet-like medium."
INDEX
{ dot3StatsIndex }
::= { dot3StatsTable 1 }
The INDEX is dots3StatsIndex, which is then defined as follows; note the statement in the DESCRIPTION (and the REFERENCE) that the dot3StatsIndex is to correspond to ifIndex.
dot3StatsIndex OBJECT-TYPE
SYNTAX
InterfaceIndex
MAX-ACCESS read-only
STATUS
current
DESCRIPTION "An index value that uniquely identifies an interface
to an ethernet-like medium. The interface identified
by a particular value of this index is the same interface
as identified by the same value of ifIndex."
REFERENCE
"RFC 2863, ifIndex"
::= { dot3StatsEntry 1 }
Finally, it is possible that a new table has a many-to-one, or dense, correspondence to the rows (or a subset
of the rows) of an existing table. In this case, the new table will have an INDEX clause that will include the
index attributes of the original table, and one or more additional attributes. An example is the EtherLikeMIB dot3CollTable, which keeps, for each interface, a set of histogram buckets giving, for each N, a
count of the number of packets that experienced exactly N collisions before successful transmission. The
dot3CollEntry definition is as follows:
dot3CollEntry OBJECT-TYPE
SYNTAX
Dot3CollEntry
MAX-ACCESS not-accessible
STATUS
current
DESCRIPTION ...
INDEX
{ ifIndex, dot3CollCount }
::= { dot3CollTable 1 }
The ifIndex entry in the INDEX represents the original table, as before except that here there is also a
second, new INDEX attribute, dot3CollCount.
518
20.13.8 sysORTable
The original system group contained the attribute sysObjectID that identifies the agent and at the same
time suggests a private OID tree that could provide additional information about the agent (20.10.1 The
system Group).
The sysORTable, 1.3.6.1.2.1.1.9 or mib-2.system.9, is an attempt to extend this. It consists of a list of
OIDs that can be queried for further agent information; each OID also has an associated description string
and a sysORUpTime value indicating the time that OID was added.
For example, my system lists the following (where snmpModules = 1.3.6.1.6.3 and mib-2 = 1.3.6.1.2.1):
snmpModules.11.3.1.1
snmpModules.15.2.1.1
snmpModules.10.3.1.1
snmpModules.1
mib-2.49
mib-2.4
mib-2.50
snmpModules.16.2.2.1
snmpModules.13.3.1.3
mib-2.92
Each of the above OID prefixes can theoretically then be accessed for further information. Unfortunately,
on my system several of them are not configured, and a query returns nothing, but sysORTable does not
know that.
20.13 SNMPv2
DisplayString,
Counter32,
Counter32,
Counter32,
Counter32,
Counter64,
Counter64,
Counter64,
Counter64,
519
ifHCOutOctets
Counter64,
ifHCOutUcastPkts
Counter64,
ifHCOutMulticastPkts
Counter64,
ifHCOutBroadcastPkts
Counter64,
ifLinkUpDownTrapEnable INTEGER,
ifHighSpeed
Gauge32,
ifPromiscuousMode
TruthValue,
ifConnectorPresent
TruthValue,
ifAlias
DisplayString,
ifCounterDiscontinuityTime TimeStamp
}
The original MIB-2 interfaces group counted multicast and broadcast packets together, eg in
ifInNUcastPkts.
20.13.10 ETHERLIKE-MIB
RFC 3635 (originally RFC 1650) defines a MIB for Ethernet-like interfaces. The primary goal is to
enable the collection of statistics on collisions and other Ethernet-specific behaviors. Several new tables are
defined.
The table dot3StatsTable contains additional per-interface attributes; the name refers to the IEEE
designation for Ethernet of 802.3. The table represents a sparse extension of the original ifTable, in the
sense of 20.13.6 Table Augmentation (where this table was used as the example).
The rows of the table mostly consist of counters for various errors and other noteworthy conditions:
Alignment errors: the number of bits in the frame is not divisible by 8
CRC checksum failure
Frames that experienced exactly one collision
Frames that experienced more than one collision
Signal quality errors. SQE is specific to 10 Mbps Ethernet
Deferred transmissions; when the station tried to send, the line was not idle
Late collisions: the only way a collision can occur after the slot time is passed is if the physical
Ethernet is too big or if collision-detection is failing. See 2.1.3 The Slot Time and Collisions
Excessive collisions: the frame experienced 16 collisions and the sender gave up
Other
hardware
errors
(dot3StatsInternalMacTransmitErrors
dot3StatsInternalMacReceiveErrors)
and
Carrier sense errors (carrier sense refers to the collision-detection mechanism; there is no actual
carrier)
Frames longer than 1500 octets
For Ethernets that encode data as symbols (eg 100 Mbps Ethernets 4B/5B), frames arriving with a
corrupted symbol
520
20.13 SNMPv2
521
are assigned to what interfaces and where these addresses came from (eg DHCP, 7.8 Dynamic Host Configuration Protocol (DHCP), or Prefix Discovery, 8.6.2 Prefix Discovery). Indexed by the IP address
itself (and also the ipAddressAddrType), these tables thus support the possibility that one interface has
multiple IP addresses (this is particularly common for IPv6).
The ipNetToPhysicalTable represents the map from local IP addresses to physical LAN addresses, as
created by either ARP for IPv4 or Neighbor Discovery for IPv6. In addition to the interface, the IP address
and the physical address, tle table also contains a timestamp indicating when a given entry was last updated
or refreshed, an indication of whether the address mapping is dynamic, static or invalid, and, finally, an
attribute ipNetToPhysicalState. The values for this last are reachable(1), stale(2) for expired reachability, delay(3) and probe(4) relating to active updates of the reachability, invalid(5),
unknown(6) and incomplete(7) indicating that ARP is in progress. See 7.7.1 ARP Finer Points.
There is also a simple version of the forwarding table known as the Default Router Table. This contains
a list of default, or, more accurately, initially configured routes. While this does represent a genuine
forwarding table, it is intended for nodes that do not act as routers and do not engage in routing-update
protocols. The table represents a list of default routes by IP address and interface, and also contains
route-lifetime and route-preference values.
The ipv6RouterAdvertTable is used for specifying timers and other attributes of IPv6 router advertisements, 8.6.1 Router Advertisement.
Finally, the IP-MIB contains two tables for ICMP statistics icmpStatsTable and icmpMsgTable.
The latter keeps track, for example, of how many pings (ICMP Echo) and other ICMP messages were sent
out; see 7.9 Internet Control Message Protocol.
20.13.11.2 IP-Forward MIB
Information specific to a hosts IP-forwarding capability was first split out from the MIB-2 ip group in RFC
1354; it was updated to SNMPv2 in RFC 2096 and the current version is RFC 4292. The original MIB-2
ip group left off at OID mib-2.ip.23; the new IP-Forward MIB begins at mib-2.ip.24.
There have been three iterations of an SNMP-viewable IP forwarding table since the original RFC 1213 ip
groups ipRouteTable at mib-2.ip.21. Here are all four:
ipRouteTable
ipForwardTable
ipCidrRouteTable
inetCidrRouteTable
Each new version has formally deprecated its predecessors.
The first replacement was ipForwardTable, described in RFC 1354. It defines the OID ipForward
to be mib-2.ip.24; the new table is at ipForward.2. This table added several routing attributes, but
perhaps more importantly changed the indexing. The index for ipRouteTable was the IP destination network ipRouteDest, alone. The new tables index also includes a quality-of-service attribute
ipForwardPolicy, usually representing the IPv4 Type of Service field (now usually known as the DS
field, 7.1 The IPv4 Header). This inclusion allows the ipForwardTable to accurately represent routing
based on dest,QoS, as discussed in 9 Routing-Update Algorithms. Such routing is sometimes called
522
multipath routing, because it allows multiple paths to a given destination based on different QoS values. However, the mask length is not included in the index, making ipForwardTable inadequate for
representing CIDR routing.
The index also includes the next_hop, which for the actual forwarding table does not make sense the
next_hop is what one is looking up, given the destination but which works fine for SNMP. See the comments about SNMP indexes with more attributes than expected in 20.7 SNMP Tables. The index even includes an attribute ipForwardProto that represents the routing-update protocol that is the source of the
table entry: icmp(4), rip(8) (a common distance-vector implementation), is-is(9) and ospf(13)
(two link-state implementations) and bgp(14).
In addition to the next_hop, this table also includes attributes for ipForwardType (eg local vs remote),
ipForwardAge (the time since the last update), ipForwardInfo, ipForwardNextHopAS, and several routing metrics. The purpose of ipForwardInfo is to provide an OID that can be accessed to provide additional information specific to the routing-update algorithm in use. The ipForwardNextHopAS
allows the specification of the next_hop Autonomous System number, usually relevant only when BGP
(10.6 Border Gateway Protocol, BGP) is involved. (If the AS number is not relevant, it is set to zero.)
The second iteration of the SNMP-viewable IP forwarding table is ipCidrRouteTable, appearing in
RFC 2096 and located at ipForward.4 (and returning to the practice of calling it a route rather than a
forward table). This table adds the address mask, ipCidrRouteMask, to the index, finally allowing
distinct routes to 10.38.0.0/16 and 10.38.0.0/24. The quality-of-service field ipCidrRouteTos remains
in the index (as does the destination), and is now firmly identified with the IPv4 Type of Service (DS) field.
The routing-update algorithm was dropped from the index.
This table also adds an attribute ipCidrRouteStatus of type RowStatus and used for the creation
and deletion of entire rows (that is, forwarding table entries) under the control of SNMP. We will return to
this process in 20.14 Table Row Creation.
The third (and still current) version of the IP forwarding table is inetCidrRouteTable, introduced in
RFC 4292 and located at ipForward.7. The main change introduced by this table is the extension to
support IPv6: the IP-address columns (eg for destination and next_hop) have companion columns for the
address type: ipv4(1) and ipv6(2).
The next_hop attribute (now two columns, with the addition of the address type) is still part of the index.
The address mask used in ipCidrRouteTable is now updated to be a prefix length,
inetCidrRoutePfxLen. The quality-of-service field inetCidrRoutePolicy is an Object ID, declared to be an opaque object without any defined semantics; that is, it is at the implementers discretion.
The IPv4 ToS/DS field evolved in IPv6 to the Traffic Class field, 8.1 The IPv6 Header.
Finally, routes are no longer required to list a single associated interface. The table makes use of the
InterfaceIndexOrZero textual convention of RFC 2863, covering just this situation.
20.13.12 TCP-MIB
RFC 4022 contains some extensions to MIB-2s tcp group. The SNMPv2-based MIB embedded in RFC
4022 repeats the MIB-2 tcp group and then adds new features within the mib-2.tcp (1.3.6.1.2.1.6) tree. RFC
1213 stopped at mib-2.tcp.15; RFC 4022 defines new objects starting at mib-2.tcp.17.
20.13 SNMPv2
523
The new tcpConnectionTable is defined at mib-2.tcp.19, versus the original tcpConnTable at mib2.tcp.13. The newer table supports IPv6; like the inetCidrRouteTable above, each IP-address column
now also has a companion address-type column. For IPv4 users used to the earlier tcpConnTable, this
means that there are extra 1s prefixing the IP addresses in the index, as in this (fictitious) example of host
10.0.0.5 connecting from port 54321 to web server 147.126.1.230:
tcpConnectionState.1.10.0.0.5.54321.1.147.126.1.230.80
The new table also includes a column representing the process ID of the process that has open the local end
of the connection.
This table adheres to the SNMPv2 convention that index columns are not included in the data the attributes
are marked not-accessible. There are only two accessible columns, tcpConnectionState and
tcpConnectionProcess.
TCP-MIB does not make available per-interface TCP statistics, eg the number of TCP bytes sent by eth0.
Nor does it make available per-connection statistics such as packet-loss and retransmission counts or total
bytes transmitted each way.
fruitT
T.2
apple
blueberry
cantaloupe
primeT
T.3
37
59
67
Then a new row 13,durian,101 might be added with the following multi-attribute Set() operation. The
entries of the new row will have OIDs T.1.13, T.2.13 and T.3.13, and all we have to do to create the row is
assign to these. Note that we are assigning to OIDs that, in the agents current database, do not yet exist.
Set((T.1.13,13), (T.2.13,durian), (T.3.13,101))
Of course, this raises some questions. Will the agent actually allow this? What happens if these Set()
operations are performed individually rather than as a group? Is there any way to delete this newly added
row? And, more seriously, what happens if some other manager tries to insert at the same time the row
13,feijoa,103?
We now turn to the specific row-creation mechanism of RMON.
20.14.1 RMON
RMON, for Remote MONitoring, was an early attempt (first appearing in RFC 1271 eight months after
MIB-2s RFC 1213) at having an SNMP agent take on some monitoring responsibilities of its own. The
524
current version is in RFC 2819. The original RMON, now often called RMON1, only implemented LANlayer monitoring, but this was later extended to the IP and Transport layers in RMON2, RFC 4502. We will
here consider only RMON1.
RMON implements only passive monitoring; there is no capability for the remote agent to send out its own
SNMP queries, or even pings (though see 20.14.3 PING-MIB). Monitoring is implemented by putting
the designated interface into promiscuous mode (2.1 10-Mbps classic Ethernet) and capturing all traffic.
In modern fully-switched Ethernets, hosts simply do not see traffic not actually addressed to them, and so
RMON would need to be implemented on a switch or router to be of much practical use.
An agents RMON activity is controlled by an SNMP manager through the insertion of new rows in various
control tables. The mechanism for doing this is our primary concern here.
RMON statistics are divided into ten groups, of which we will consider only the following:
statistics: counts of errors and counts of packets in size ranges 0-64, 65-127, 128-255, 255-511,
512-1023 and 1024-1518 octets.
history: The statistics group data, taken at regular intervals
hosts: The Ethernet senders and receivers seen by an interface
host top N: The top-N senders or receivers
matrix: Information on traffic by sender,receiver pair
20.14.1.1 Statistics
The etherStatsTable is, by default, empty, and is indexed by etherStatsIndex which is an
opaque INTEGER. The column etherStatsDataSource represents the OID of a specific interface
number as defined by ifTable; for example, the interface with ifNumber = 6 would be represented by
mib-2.2.2.1.1.6. One column, etherStatsStatus, has type EntryStatus as follows:
valid(1),
createRequest(2),
underCreation(3),
invalid(4)
525
however, may be creating new rows in the agent at the same moment, and so a row index that was available
moments before may now be unavailable. If that happens, though, the Set() operation will fail (recall that
in a multi-attribute Set() all the assignments will succeed or all will fail), and the manager can choose a
different index and try again.
One common strategy for reducing the chance of such row-creation collisions is to choose a value for the
index at random.
It is often possible for a manager to create the row with a single Set() operation, though RFC 2819
prohibits setting Status initially to valid. If T is the tree OID etherStatsEntry and 157 is the
managers chosen index, and the manager wants to monitor ifIndex = 6, it could send the following as a
single operation.
Set((T.Status.137,createRequest), (T.DataSource.137,mib-2.2.2.1.1.6), (T.Owner.137,owner))
at which point the agent will create the row with status underCreation. The value createRequest is
used only in manager requests and never appears in actual rows on the agent. The manager will then follow
with
Set(T.Status.137,valid)
If the manager wants or needs to create the entry piecemeal, it can do so as follows:
Set(T.Status.137, createRequest)
Set(T.DataSource.137, mib-2.2.2.1.1.6)
Set(T.Owner.137, owner)
...
Set(T.Status.137, valid)
Immediately following the first Set(T.Status.137, createRequest) the agent will again create the row and
mark it as underCreation, but this time the new row will be missing several columns.
The Status attribute for a row being created should be specified in the very first Set() operation for the
row.
Existing rows may not have their Status set to createRequest. The primary legal state transitions are
as follows:
In some cases, a row can be edited by the manager by changing the status from valid back to
526
527
491,60. After one hour and eleven minutes, record 491,71 replaces record 491,1. After two hours,
records with sample indexes of 51-120 are available. The manager might return once an hour and retrieve
the most recent 60 records.
A manager might also create two control-table records, one holding 25 records taken at 1-hour intervals and
another holding 60 records taken at 1-minute intervals. If all is well, that manager might download the first
table once a day, and entirely ignore the second table. The manager always has available these 1-minute
records for the last hour, though, and can access them as needed if a problem arises (perhaps signaled by
something else entirely).
A manager can easily retrieve only its own rows from the etherHistoryTable. Let T be the root of the
etherHistoryTable, which has columns 1-15. Suppose again a manager has created its controlTable
row with a value for historyControlIndex of 491; the manager can then retrieve the first of its data
rows with the following; note that each OID contains the column number and part of the row index.
GetNext(T.1.491, T.2.491, ..., T.15.491)
If the history-table rows associated with 491 have sample-index values ranging from 37 to 66, the above
GetNext() will return the row indexed by 491,37; that is, the values paired with the following OIDs:
T.1.491.37, T.2.491.37, ..., T.15.491.37
Subsequent GetNext()s will return the subsequent rows associated with control entry 491: row 491,38,
row 491,39, etc. If the etherHistoryTable had been indexed in the reverse order, with the sample
index first and the historyControlIndex second, a substantial linear search would be necessary to
locate the first row with a given value for the control index.
20.14.1.3 Hosts
The host group allows an agent to keep track of what other hosts in RMON1 identified by their Ethernet
address are currently active.
Like the historyControlTable, the hostControlTable allows a manager to specify
DataSource, Status and Owner; the manager also specifies TableSize.
Once the control-table entry is valid, the agent starts recording hosts in the hostTable, which is indexed by the control-table index and the host Ethernet address. The agent also records each new hosts
CreationOrder value, an integer record number starting at 1.
The hostTable also maintains counters for the following per-host attributes; these are updated whenever
the agent sees another packet to or from that host. We will revisit these in the hostTopNTable, following.
InPackets
OutPackets
InOctets
OutOctets
OutErrors
OutBroadcastPkts
OutMulticastPkts
528
When the number of host entries for a particular control-table index value exceeds TableSize that is,
the new hosts CreationOrder would exceed TableSize, old entries are removed and all hosts in the
table are given updated CreationOrder values. That is, if the table of size three contains entries for
hosts A, B and C with creationOrder values of 1, 2 and 3, and host D comes along, then A will be
deleted and B, C and D will be given creationOrder values of 1, 2 and 3 respectively. This is quite
different from the record-number assignments in the etherHistoryTable, where the SampleIndex
record numbers are immutable.
A consequence of this is that the CreationOrder values are always contiguous integers starting at 1.
Entries are deleted based on order of creation, not order of last update. The hostTable does not even
have an attribute representing the time a given hosts entry (or entries) was last updated.
As a convenience, the data in hostTable is also made available in hostTimeTable, but there indexed
by the control-table index and the CreationOrder time. This makes for potentially faster lookup; for
example, the manager always knows that the record with CreationOrder = 1 is the oldest record. More
importantly, this alternative index allows the manager to download the most recent entries in a single step.
Whenever a host is deleted because the table is full, the CreationOrder values assigned to other hosts
all change, and so the indexing to hostTimeTable changes. Thus, a manager downloading rows from
hostTimeTable one at a time must be prepared for the possibility that what had earlier been row 3 is
now row 2, and that a host might be duplicated or skipped. To help managers deal with this, the control table
has an entry LastDeleteTime representing the time in TimeTicks since startup of the last deletion
and thus CreationOrder renumbering. If a manager sees that this value changes, it can, for example,
start the data request over from the beginning.
20.14.1.4 Host Top N
The hostTopN table is a report prepared by the agent about the top N entries (where N is manager-supplied)
in the hostTable, over an interval of time. The manager specifies in the control table the Status and
Owner attributes, the HostIndex control-table index value from hostControlTable, and also the
Duration (in seconds) and the RateBase indicating which of the following hostTable statistics is to
be used in the ranking:
InPackets
OutPackets
InOctets
OutOctets
OutErrors
OutBroadcastPkts
OutMulticastPkts
The value of N is in the attribute RequestedSize; the agent may reduce this and communicates any
change (or lack of it) through the attribute GrantedSize.
Once the control-table row becomes valid, the agent then starts maintaining counters for all the entries
in the part of hostTable indexed by HostIndex, and at the end of the Duration sorts the data and
529
places its top-N results in the hostTopN table. If, during the interval, some hosts were removed from the
hostTable because the table was full, the results may be inaccurate.
All result statistics are inaccessible until the Duration has elapsed and the particular hostTopN
report has run its course, at which point the results become read-only.
20.14.1.5 Matrix
The matrix group allows an agent to collect information on traffic flow indexed by the source and destination
addresses. The manager begins the process by supplying attributes for Status, Owner, the usual Index,
the interface DataSource, and the maximum table size.
Once the row is valid, the agent begins collecting, for every source,destination pair, counts of the number
of packets, octets and errors. The record for A,B counts these things sent by A; the corresponding record
for B,A counts the reverse direction. The actual matrixSDTable is indexed by the manager-supplied
Index value and the source and destination Ethernet addresses in that order.
A companion table (or view) is also maintained called matrixDSTable, that lists the same information
but indexed by destination first and then source. This view is not present to supply information about the
reverse direction; that would be obtained by reversing source and destination in the matrixSDTable.
Rather, the matrixDSTable allows a manager to extract all information about a single destination D in a
single SNMP tree-walk operation of the prefix matrixDSTable.Index.D. This is similar to the indexing
discussion at the end of 20.14.1.2 History. See exercise 9.
530
Here is a simplified RowStatus state-transition diagram. Not all links to destroy, or from a node back
to itself, are shown. See RFC 2579, p 9, for more details.
After the manager issues a createAndWait, the agent fills in the attributes provided by the manager, and
any other default attributes it has available. The manager, if desired, can now Get() the entire row, and
find out what values are still missing. These values will be reported as noSuchInstance. Alternatively,
the manager may simply know that it has more attributes to Set().
If the agent knows the row is missing attributes necessary for activation, it will set the RowStatus to
notReady, otherwise to notInService. A notInService row can, of course, still have undefined
read-only attributes that the agent will later set after activation. A notInService row can also still
have attributes the manager intends to set before activation, but the agent has given default values in the
interim.
Once the manager has set all the attributes required for activation, it sets the RowStatus attribute to
active, and agent activity begins.
20.14.3 PING-MIB
The idea behind the ping MIB is to allow a manager to ask an agent to repeatedly ping a target, 7.9 Internet Control Message Protocol, and then to report the success rate. Cisco Systems has a ping MIB
at ftp://ftp.cisco.com/pub/mibs/v2/CISCO-PING-MIB.my, but we will use the IETF alternative in RFC
2925/RFC 4560. The latter is officially titled DISMAN-PING-MIB (DISMAN is short for DIStributed
MANagement). Both ping MIBs make use of the RowStatus convention of the previous section.
The manager can ping the target itself, if there are no firewalls in the way, but the result may likely be
different from that obtained by the agent.
The actual MIB supports multiple types of ping besides the usual ICMP Echo Request, but we will consider
only the latter.
The DISMAN version has control table pingCtlTable and the results appear in pingResultsTable;
in the Cisco version the same table is used for both control and results. The results include the minimum,
maximum and average RTT, the number of pings sent and the number of responses. The results table also
has an attribute OperStatus to indicate whether the test has stopped.
20.14 Table Row Creation
531
The pingCtlTable contains a wide range of options for the actual ping request, including options to
specify the outgoing IP address for multi-homed agents. The more familiar options, however, include
pingCtlTargetAddress: and address type, as IPv6 is supported
pingCtlDataSize: how big each ping packet is
pingCtlTimeout: how long before the agent gives up on any one ping
pingCtlProbeCount: the number of pings to be sent
pingCtlFrequency: the interval between pings
The table also includes pingCtlRowStatus of type RowStatus, above, and pingCtlOwnerIndex
of type SnmpAdminString, which is a fancier way of identifying the manager than the RMON Owner
attributes, and which can include some SNMPv3 credentials as desired.
The new row is created, RowStatus is set to active (perhaps through createAndGo), and the test
begins.
By setting the control tables RowStatus to destroy, a manager can attempt to halt a ping series in
progress.
20.15 SNMPv3
SNMP version 3 added authentication and encryption to SNMPv2c, but made relatively few other changes
except to nomenclature. The original definitions are in RFC 3410 through RFC 3415; RFC 3410 first
appeared as RFC 2570. SNMPv3 introduced the User Security Model, or USM, in which agents allow
manager access only if the manager has presented an appropriate key, derived ultimately from a passphrase.
The agents response can be either digitally signed or encrypted, as desired.
SNMPv3 did make several terminology changes. Any SNMP node either manager or agent is now
known as an SNMP entity. An entity is divided into two parts; the first is the SNMP engine consisting of
the message dispatcher, the message processor, and the security and access-control subsystems. The second
part consists of the various SNMP applications, consisting, for agents, primarily of the command responder
(responding to Get() and GetNext(), etc). One goal of this architectural division is to separate the
applications from the authentication mechanisms. Another goal, however, is to provide a framework in
which future new applications can easily be supported.
It is the SNMP engine that must implement all the new security provisions.
532
533
(20.15.5 Passwords and Keys and 21.5.1 Secure Hashes and Authentication), but hash(m k) is a good
example of the basic concept.
The SNMP encryption mechanism is also based on shared secret keys (21.6 Shared-Key Encryption);
that is, public-key encryption is not used. RFC 3414 describes the use of the Data Encryption Standard
cipher, or DES, which uses 56-bit keys. Later, RFC 3826 introduced the use of the Advanced Encryption
Standard cipher, or AES, which in SNMP users a 128-bit key. (Both DES and AES are discussed briefly in
21.6.3 Block Ciphers.) DES is, if anything, even more vulnerable than MD5 and SHA-1, due to the limited
key length; AES is a much stronger choice.
Shared-secret encryption is based, abstractly, on an encrypting function E(p,k) that takes a plaintext message
p and a key k and returns the encrypted ciphertext. Similarly, there is a decrypting function D(c,k) that takes
an encrypted ciphertext message c and the key k and returns the original plaintext.
534
noAuthNoPriv. The authoritative side sends a response containing its engineID. The second step is to
send a request, again with an empty varBindList but now containing a valid username,key pair. The
value for Time,Boots is 0,0. The authoritative engine now responds with a message including its actual
Time,Boots.
20.15 SNMPv3
535
This is now the user,agent shared key, or local key. It must be entered (or computed) on the agent, and
stored there. For MD5 it is 16 bytes long; for SHA-1 it is 20 bytes long.
This mechanism is, strictly speaking, optional; an agent does not know how manager keys were generated,
and thus cannot enforce any particular mechanism when a managers key is later changed as below. Any
secure way to generate a unique key kl for each user and each agent would be sufficient. The mechanism
here, though, has the advantage that any manager node can compute a users local key directly from the
password and the agent engineID; no keys need be stored by the manager. This allows a manager to use one
password for multiple agents; compromise of any one agent and its attendant local key kl should not affect
the security of other agents or of the original password.
536
At this point, random and delta are sent in the clear from the manager to the agent. The agent can
use random and oldkey to compute temp, and thus newkey = delta XOR temp. An eavesdropper
cannot use random to find out anything about temp without knowing oldkey, and cannot get anything
useful out of delta unless either newkey or temp is known.
Note that if an eavesdropper saves random and delta, and later discovers the users oldkey, then
newkey can be calculated easily. Using the language of 21.8.2 Forward Secrecy, forward secrecy fails
badly. However, if an eavesdropper later discovers newkey, there is no obvious way to find oldkey; see
exercise 10.
The actual process is to combine random and delta into a single 2N-byte keyChange string.
20.15 SNMPv3
537
view names are _all_, _none_ and systemonly; we created a view mib2+private in 20.11 SNMPv1
communities and security.
20.15.9.1 The usmUserTable
Now it is time to look at the actual table in which agents store user names and their associated authentication
credentials, usmUserTable. It has the attributes below, which should all be prefixed by usmUser; the
index attributes are the first two, EngineID and Name. We will consider only those attributes related to
authentication-only security.
EngineID
Name
SecurityName
CloneFrom
AuthProtocol
AuthKeyChange
OwnAuthKeyChange
PrivProtocol
PrivKeyChange
OwnPrivKeyChange
Public
StorageType
Status
One can think of the table as also having an implicit authKey column, representing the local key corresponding to Name, that is never directly readable or writable. RFC 3414 states flatly the authKey is not
accessible via SNMP. However, the agent must still keep the authKey somewhere, tied to the Name, so it
can validate a given user based on the Name and authParameter supplied in a request.
Because EngineID is usually the agents own EngineID, the table is de facto indexed just by Name.
Recall that, as was discussed in 20.13.4 SNMPv2 Indexes, the index attributes, EngineID and Name, will
not be directly accessible, but will be encoded in the OID associated with every other retrieved attribute.
The username alice will thus be encoded, using the ASCII string encoding, as 97.108.105.99.101, not
necessarily easily readable by human managers.
We are now in a position explain the cloning process and the key-change process as they play out with this
table.
A specific semantic rule for this table is that the use of Set() to assign a random, delta keyChange object
to AuthKeyChange or OwnAuthKeyChange causes that users hidden authKey to be updated via the
process of 20.15.7 Key Changes. The user cannot confirm directly that this change succeeded, as a read
of these keyChange attributes returns the empty string, so the usual recommended strategy is also to write a
random value to the Public attribute; because Set() operations are atomic, a change to the latter means
the former change succeeded as well.
Because it is possible for multiple managers to be updating the table simultaneously, a single
TestAndIncr object named usmUserSpinLock may be used to enforce serialization, as in 20.13.5 TestAndIncr. So the full recommended sequence for updating a key is as follows, where keyChange is the
random, delta keyChange object and index encodes the EngineID and the Name:
538
20.15 SNMPv3
539
.1.3.6.1.6.3.10.1.1.2 0x7293f49a82fc950f5c344efd94dbb7db
.1.3.6.1.6.3.10.1.2.1 0x 0x
The new localized key is 0x7293f49a82fc950f5c344efd94dbb7db; the first hex string beginning 0x80001 is
the engineID.
For this new account to be able to do anything, we must also add the following permissions entry to
/etc/snmp/snmpd.conf:
rwuser master
The effect of this entry is to create an entry in vacmSecurityToGroupTable associating user
master with its own security group (Net-SNMP names it grpmaster), and then an entry in the
vacmAccessTable granting this new group SNMPv3 read and write access to the entire OID tree. (In
general we could also have used rouser, except that we will need master to be able to create new users.)
After permissions (at least read permissions) are enabled, the following snmpget should work
snmpget -v 3 -u master -l authNoPriv -a MD5 -A saskatchewan
localhost 1.3.6.1.2.1.1.4.0
The hostname is localhost if this is being run on the same machine that the Net-SNMP agent is running
on; it can also of course be run remotely.
We can make this a little shorter by editing the Net-SNMP per-user manager configuration file
$HOME/.snmp/snmp.conf to add the following lines:
defSecurityName master
defAuthType MD5
defSecurityLevel authNoPriv
defAuthPassphrase saskatchewan
Now the snmpget command can be shortened to
snmpget -v 3 localhost 1.3.6.1.2.1.1.4.0
If we take the password, saskatchewan, and repeat it to 220 bytes, the MD5 checksum is
0x3da9dbfc3a78acb675e436746e1f4f8a; this is the digest of 20.15.5 Passwords and Keys. From
the /var/lib/snmp/snmpd.conf file (or from the created usmUser entry above) we find the engineID is
0x80001f8880889cb038b1aca650. If we convert these two strings to binary data, concatenate them as
digest engineID digest, and take the MD5 checksum, we indeed get
7293f49a82fc950f5c344efd94dbb7db
which is exactly the key entry in the /var/lib/snmp/snmpd.conf file.
20.15.9.2.1 Cloning in Net-SNMP
We can now clone this account while SNMP is running, using the snmpusm command-line utility. The
following line creates an account pld cloned from master, using the master account (and assuming that
master is an rwuser and not an rouser); we have abbreviated by auth the full credentials -u master -l
authNoPriv -a MD5 -A saskatchewan. We have switched from localhost to HOST to emphasize that this can be run remotely.
540
20.16 Exercises
1. Consider the table below. The first column is the index.
index
1
3
57
92
count
401
523
607
727
veggie
kale
kohlrabi
mche
okra
Give the OID for each data value, assuming this table were encoded in SNMP. Assume the columns are
assigned OID levels 1, 2, and 3 in order (including the index column), and the root of the subtree (the
tableEntry OID) is represented by T. Note that some data values are missing.
2. Recall that GetBulk() acts like a repeated GetNext(). Why is there no GetBulk() equivalent of
Get()?
3. What happens if we have a three-row, three-column table and ask, using GetBulk(), for the first two
columns with a repetition count of four? What is retrieved? Assume entries are of the form T.col.row, for
col and row each ranging from 1 to 3.
4. Consider the following multi-attribute GetNext() as presented in 20.8.1 Multi-attribute Get():
20.16 Exercises
541
(a). Suppose we want all the table data about a given destination D, and have only the matrixSDTable.
Explain why every row of matrixSDTable would need to be examined.
(b). Now suppose we repeat the investigation of (a), but this time the manager has previously downloaded
the complete list of all hosts on the LAN by using the RMON hosts tables. If N is the number of hosts on
the LAN, explain how to find all hosts that communicated with D using only N retrieval requests.
(c). Explain how to find all hosts S that sent packets to D even more quickly using the matrixDSTable.
If MN is the number of such S, your answer should involve N+1 retrieval requests.
10. In the keychange operation of 20.15.7 Key Changes, suppose the manager simply transmitted delta2
= oldkey XOR newkey to the agent.
(a). Suppose an eavesdropper discovers delta2 and also knows a few bits of oldkey. What can the
eavesdropper learn about newkey? Would the same vulnerability apply to the mechanism of 20.15.7 Key
Changes?
542
(b). Suppose an eavesdropper later discovers newkey. Explain how to recover oldkey, and why this does
not work when the mechanism of 20.15.7 Key Changes is used.
11. List the OID prefixes for which a manager would need to be granted write permission if the manager
were to be able to modify all settings in .1.3.6.1.2.1 and .1.3.6.1.4.1, and change their own local key, but
not have access to columns in usmUserTable that would allow modification of other manager accounts.
(The VACM table has an exception option to make this easier).
12. Suppose a manager has write permission only for those columns in usmUserTable necessary for
password changes, but has full write access to the VACM tables. Explain how the manager can modify the
local keys of other managers.
13. Use Wireshark to monitor localhost traffic while you use the Net-SNMP snmpusm command to
change a managers own local key. Does the command use the usmUserAuthKeyChange column or
the usmUserOwnAuthKeyChange column?
20.16 Exercises
543
544
21 SECURITY
545
Finally, the third category here includes any form of eavesdropping. If the password for a login-shell account
is obtained this way, a first-category attack may follow. The usual approach to prevention is the use of
encryption. Encryption is closely related to secure authentication; encryption and authentication are
addressed in this chapter in 21.5 Secure Hashes through 21.9 SSH and TLS.
Encryption does not always work as desired. In 2006 intruders made off with 40 million credit-card records
from TJX Corp by breaking the WEP Wi-Fi encryption (21.6.8 Wi-Fi WEP Encryption Failure) used by
the company, and thus gaining access to account credentials and to file servers. Albert Gonzalez pleaded
guilty to this intrusion in 2009. This was the largest retail credit-card breach until the Target hack of late
2013.
546
21 Security
547
In the diagram below, the left side shows the normal layout: the current stack frame has a buffer into which
the attackers message is read. When the current function exits, control is returned to the address stored in
return_address.
The right side shows the result after the attacker has written shellcode to the buffer, and, by overflow, has
also overwritten the return_address field on the stack so that it now points to the shellcode. When the
function exits, control will be passed to the code at return_address, that is, to the shellcode rather than
to the original caller. The shellcode is here shown below return_address in memory, but it can also be
above it.
548
21 Security
but supply an incorrect (and much too large) value for the parameter bufsize. This approach has the
practical advantage of allowing the attacker to supply a buffer with NUL characters (zero bytes) and with
additional embedded newline characters.
On the client side the attackers side we need to come up with a suitable shellcode and create a too-large
buffer containing, in the correct place, the address of the start of the shellcode. Guessing this address used
to be easy, in the days when every process always started with the same virtual-memory address. It is now
much harder precisely to make this kind of buffer-overflow attack more difficult; we will cheat and have our
server print out an address on startup that we can then supply to the client.
An attack like this depends on knowing the target operating system, the target cpu architecture (so as to
provide an executable shellcode), the target memory layout, and something about the target server implementation (so the attacker knows what overflow to present, and when). Alas, all but the last of these are
readily guessable. Once upon a time vulnerabilities in server implementations were discovered by reading
source code, but it has long been possible to search for overflow opportunities making use only of the binary
executable code.
21.2.2.1 The server
The overflow-vulnerable server, oserver.c, is more-or-less a C implementation of the tcp simplex-talk server
of 12.5 TCP simplex-talk; note the explicit call to bind() which was handled by the ServerSocket()
constructor in the Java version. For each new connection, a new process to handle it is created with fork();
that new process then calls process_connection().
The process_connection() function then reads a line at a time from the client into a buffer pcbuf
of 80 bytes. Unfortunately for the server, it may read well more than 80 bytes into pcbuf.
549
For the stack overflow to work, it is essential that the function that read in the oversized buffer thus corrupting the stack must return. Therefore the protocol has been modified so that process_connection()
returns if the arriving string begins with quit.
We must also be careful that the stack overflow does not so badly corrupt any local variables
that the code fails afterwards to continue running, even before returning. All local variables in
process_connection() are overwritten by the overflow, so we save the socket itself in the global
variable gsock.
We also call setstdinout(gsock) so that the standard input and standard output within
process_connection() is the network connection. This allows the use of gets(), which reads
from the standard input (alternatively, recv() with an incorrect value for bufsize may be used). It also
means that the shell launched by the shellcode will have its standard input and output correctly set up; we
could, of course, make the appropriate dup()/fcntl() calls from the shellcode, but that would increase
its complexity and size.
Because the standard output is now the TCP connection, the server prints incoming strings to the terminal
via the standard-error stream.
On startup, the server prints an address (that of mbuf[]) within its stack frame; we will refer to this as
mbuf_addr. The attacking client must know this value. No real server is so accommodating as to print its
internal addresses, but in the days before address randomization, 21.2.3.2 ASLR, the servers stack address
was typically easy to guess.
Whenever a connection is made, the server here also prints out the distance, in bytes, between the
start of mbuf[] in main() and the start of pcbuf the buffer that receives the overflow in
process_connection(). This latter number, which we will call addr_diff, is constant, and must
be compiled into the exploit program (it does change if new variables are added to the servers main() or
process_connection()). The actual address of pcbuf[] is thus mbuf_addr addr_diff. This will
be the address where the shellcode is located, and so is the address with which we want to overwrite the
stack. We return to this below in 21.2.2.3 The exploit, where we introduce a NOPslide so that the attacker
does not have to get the address of pcbuf[] exactly right.
Linux provides some protection against overflow attacks (21.2.3 Defenses Against Buffer Overflows), so the
server must disable these. As mentioned above, one protection, address randomization, we defeat by having
the server print a stack address. The server must also be compiled with the -fno-stack-protector
option to disable the stack canary of 21.2.3.1 Stack canary, and the -z execstack option to disable
making the stack non-executable, 21.2.3.3 Making the stack non-executable.
gcc -fno-stack-protector -z execstack -o oserver oserver.c
Even then we are dutifully warned by both the compiler and the linker:
warning: gets is deprecated ....
warning: the gets function is dangerous and should not be used.
In other words, getting this vulnerability still to work in 2014 takes a bit of effort.
The server here does work with the simplex-talk client of 12.5 TCP simplex-talk, but with the #USE_GETS
option enabled it does not handle a client exit gracefully.
550
21 Security
We need eax to contain the numeric value 11, the 32-bit-linux syscall value corresponding to execve
(perhaps defined in /usr/include/i386-linux-gnu/asm/unistd_32.h). (64-bit linux uses 59
as the syscall value for execve.) We load this with mov al 11; al is a shorthand for the low-order byte
of register eax. We first zero eax by subtracting it from itself. We can also, of course, use mov eax 11,
but then the 11 expands into four bytes 0x0b000000, and we want to avoid including NUL bytes in the code.
We also need ebx to point to the NUL-terminated string /bin/sh. The register ecx should point to an array of pointers [/bin/sh, 0] in memory (the null-terminated argv[]), and edx should point to a null word
in memory (the null-terminated envp[]). We include in the shellcode the string /bin/shNAAAABBBB,
and have the shellcode insert a NUL byte to replace the N and a zero word to replace the BBBB, as the
shellcode must contain no NUL bytes. The shellcode will also replace the AAAA with the address of
/bin/sh. We then load ecx with the address of AAAA (now containing the address of /bin/sh followed
by a zero word) and edx with the address of BBBB (now just a zero word).
Loading the address of a string is tricky in the x86 architecture. We want to calculate this address relative to
the current instruction pointer IP, but no direct access is provided to IP. The workaround in the code below
is to jump to shellstring near the end, but then invoke call start, where start: is the label for
our main code up at the top. The action of call start is to push the address of the byte following call
start onto the stack; this happens to be the address of shellstring:. Back up at start:, the pop
ebx pops this address off the stack and leaves it in ebx, where we want it.
Our complete shellcode is as follows (the actual code is in shellcode.s):
jmp short shellstring
start:
pop ebx
sub eax, eax
mov [ebx+7 ], al
mov [ebx+8 ], ebx
mov [ebx+12], eax
mov al, 11
lea ecx, [ebx+8]
lea edx, [ebx+12]
551
sub ecx,ecx
sub edx,edx
int 0x80
shellstring:
call start
db /bin/shNAAAABBBB
We can test this with a simple C program defining the above and including
int main(int argc, char **argv) {
int (*func)();
func = (int (*)()) shellcode;
(int)(*func)();
}
We can verify that the resulting shell has not inherited the parent environment with the env and set
commands.
Additional shellcode examples can be found in [AHLR07].
21.2.2.3 The exploit
Now we can assemble the actual attack. We start with a C implementation of the simplex-talk client, and
add a feature so that when the input string is doit, the client
sends the oversized buffer containing the shellcode, terminated with a newline to make gets()
happy
sends quit, terminated with a newline, to force the servers process_connection() to return,
and thus to transfer control to the code pointed to by the now-overwritten return_address field
of the stack
begin a loop copylines() to copy the local terminals standard input to the network connection (hopefully now with a shell at the other end), and to copy the network connection to the local
terminals standard output
On startup, the client accepts a command-line parameter representing the address (in hex) of a variable close
to the start of the servers main() stack frame. When the server starts, it prints this address out; we simply
copy that when starting the client.
The full client code is in netsploit.c.
552
21 Security
All that remains is to describe the layout of the malicious oversized buffer, created by buildbadbuf().
We first compute our guess for the address of the start of the vulnerable buffer pcbuf in the servers
process_connection() function: we take the address passed in on the command line, which is actually the address of mbuf in the servers main(), and add to it the known constant (pcbuf - mbuf). This
latter value, 147 in the version tested by the author, is stored in netsploit.cs BUF_OFFSET.
This calculation of the address of the servers pcbuf should be spot-on; if we now store our shellcode at the
start of pcbuf and arrange for the server to jump to our calculated address, the shellcode should run. Real
life, however, is not always so accommodating, so we introduce a NOP slide: we precede our shellcode
with a run of NOP instructions. A jump anywhere into the NOPslide should lead right into the shellcode. In
our example here, we make the NOPslide 20 bytes long, and add a fudge factor of between 0 and 20 to our
calculated address (FUDGE is 10 in the actual code).
We need the attack buffer to be large enough that it overwrites the stack return-address field, but small
enough that the server does not incur a segmentation fault when overfilling its buffer. We settle on a
BADBUFSIZE of 161 (160 plus 1 byte for a final newline character); it should be comparable to but perhaps
slightly larger than the BUF_OFFSET value of, in our case, 147).
The attack buffer is now
20 bytes of NOPs
50 bytes of shellcode
90 bytes worth of the repeated calculated address (baseaddr-BUF_OFFSET+FUDGE in the code
1 byte newline, as the servers gets() expects a newline-terminated string
Here is a diagram like the one above, but labeled for this particular attack. Not all memory regions are drawn
to scale, and there are more addresses between the two stack frames than just return address.
We double-check the bad buffer to make sure it contains no NUL bytes and no other newline characters.
553
If we wanted a longer NOPslide, we would have to hope there was room above the stacks return-address
field. In that case, the attack buffer would consist of a bunch of repeated address guesses, then the NOPslide,
and finally the shellcode.
After the command doit, the netsploit client prompt changes to 1>. We can then type ls, and, if the
shellcode has successfully started, we get back a list of files on the server. As earlier, we can also type env
and set to verify that the shell did not inherit its environment from any normal shell.
21 Security
in guessing the shellcode address. With a NOPslide of length 210 = 1024 bits, guessing the correct stack
address will take only 26 = 64 tries. (Some implementations squeeze out 19 bits of address-space entropy,
meaning that guessing the correct address increases to 29 = 512 tries.)
For 64-bit systems, however, ASLR is much more effective. Brute-force guessing of the stack address takes
a prohibitively long time.
ALSR also changes the heap layout and the location of libraries each time the server process is restarted.
This is to prevent return-to-libc attacks, 21.2.1 Return to libc. For a concrete example of an attackers use
of non-randomized library addresses, see 21.3.2 A JPEG heap vulnerability.
On linux systems, ASLR can be disabled by writing a 0 to /proc/sys/kernel/randomize_va_space; values 1
and 2 correspond to partial and full randomization.
Windows systems since Vista (2007) have had ASLR support, though earlier versions of the linker required
the developer to request ASLR with the /DYNAMICBASE switch.
21.2.3.3 Making the stack non-executable
A more sophisticated idea, if the virtual-memory hardware supports it, is to mark those pages of memory
allocated to the stack as non-executable, meaning that if the processors instruction register is loaded with
an address on those pages (due to branching to stack-based shellcode), a hardware-level exception will
immediately occur. This immediately prevents attacks that place a shellcode on the stack, though return-tolibc attacks (21.2.1 Return to libc) are still possible.
In the x86 world, AMD introduced the per-page NX bit, for No eXecute, in their x86-64 architecture;
CPUs with this architecture began appearing in 2003. Intel followed with its XD, for eXecute Disabled,
bit. Essentially all x86-64 CPUs now provide hardware NX/XD support; support on 32-bit CPUs generally
requires so-called Physical Address Extension mode.
The NX idea applies to all memory pages, not just stack pages. This sometimes causes problems with
applications such as just-in-time compilation, where the code page is written and then immediately executed.
As a result, it is common to support NX-bit disabling in software. On linux systems, the compiler option
-z execstack disables NX-bit protection; this was used above in 21.2.2.1 The server. Windows has a
similar /NXCOMPAT option for requesting NX-bit protection.
While a non-executable stack prevents the stack-overflow attack described above, injecting shellcode onto
the heap is still potentially possible. The OpenBSD operating system introduced write or execute in 2003;
this is abbreviated W^X after the use of ^ as the XOR operator in C. A memory page may be writeable or
executable, but not both. This is strong protection against virtually all shellcode-injection attacks, but may
still be vulnerable to some return-to-libc attacks (21.2.1 Return to libc).
See [AHLR07], chapter 14, for some potential attack strategies against systems with non-executable pages.
555
clear proximity to an obvious return address. Despite that difference, heap overflows can also be used to
enable remote-code-execution attacks.
Perhaps the simplest heap overflow is to take advantage of the fact that some heap pages contain executable
code, representing application-loaded library pages. If the page with the overflowable buffer is pointed to
by p, and the following page in memory pointed to by q contains code, then all an attacker has to do is to
have the overflow fill the q page with a long NOPslide and a shellcode. When at some point a call is made
to the code pointed to by q, the shellcode is executed instead. A drawback to this attack is that the layout of
heap pages like this is seldom known. On the other hand, the fact that heap pages do sometimes legitimately
contain executable code means that uniformly marking the heap as non-executable, as is commonly done
with the stack, may not be an option.
As with the stack-overflow example, the gets(p) results in an overflow from block p into block q, overwriting not just the data in block q but also the block headers maintained by malloc(). While there is
no guarantee in general that block q will immediately follow block p in memory, in practice this usually
happens unless there has been a great deal of previous allocate/free activity.
The vulnerability here relies on some features of the 2003-era (glibc-2.2.4) malloc(). All blocks are
either allocated to the user or are free; free blocks are kept on a doubly linked list. We will assume the data
portion of each block is always a power of 2; there is a separate free-block list for each block size. When
two adjacent blocks are both free, malloc() coalesces them and moves the joined block to the free-block
list of the next size up.
All blocks are prefixed by a four-byte field containing the size, which we will assume here is the actual size
1032 including the header and alignment rather than the user-available size of 1024. As the three low-order
bits of the size are always zero, one of these bits is used as a flag to indicate whether the block is allocated
or free. Crucially, another bit is used to indicate whether the previous block is allocated or free.
In addition to the size-plus-flags field, the first two 32-bit words of a free block are the forward and backward
pointers linking the block into the doubly linked free-block list; the size-plus-flag field is also replicated as
the last word of the block:
556
21 Security
The strategy of the attacker, in brief, is to overwrite the p block in such a way that, when block q is freed,
malloc() thinks block p is also free, and so attempts to coalesce them, in the process unlinking block
p. But p was not in fact free, and the forward and backward pointers manipulated by malloc() as
part of the unlinking process are in fact provided by the attacker. As we shall see, this allows writing two
attacker-designated 32-bit words to two attacker-designated memory locations; if one of the locations holds
a return address and is updated so as to point to attacker-supplied shellcode also in block p, the system has
been compromised.
The normal operation of free(q), for an arbitrary block q, includes the following:
Get the block size (size) and flags at address q-4
Check the following block at address p+size to see if it is free; if so, merge (we are not interested
in this case)
Check the flags to see if the preceding block is free; if so, load its size prev_size from address
q-8, the address of the copy of size field in the diagram above; from that calculate a pointer to
the previous block as p = q - prev_size; then unlink block p (as the coalesced block will go
on a different free-block list).
For our specific block q, however, the attacker can overwrite the final size field of block p, prev_size
above, and can also overwrite the flag at address q-4 indicating whether or not block p is free. The attacker
can not overwrite the header of block p that would properly indicate that block p was still in use, but the
free() code did not double-check that.
We will suppose that the attacker has overwritten block p to include the following:
setting the previous-block-is-free flag in the header of block q to true
setting the final size field of block p to a desired value, badsize
placing value ADDR_F at address q-badsize; this is where free() will believe the previous
blocks forward pointer is located
placing the value ADDR_B at address q-badsize+4; this is where free() will believe the previous
blocks backward pointer is located
When the free(q) operation is now executed, the system will calculate the previous block as at address
p1 = q-badsize and attempt to unlink the false block p1. The normal unlink is
21.3 Heap Buffer Overflow
557
where, for the pointer increment in the second line, we take the type of ADDR_F to be char * or void
*.
At this point the jig is pretty much up. If we take ADDR_B to be the location of a return address on the
stack, and ADDR_F to be the location of our shellcode, then the shellcode will be executed when the current
stack frame returns. Extending this to a working example still requires a fair bit of attention to details; see
[AB03].
One important detail is that the data we use to overwrite block p generally cannot contain NUL bytes,
and yet a small positive number for badsize will have several NUL bytes. One workaround is to have
badsize be a small negative number, meaning that the false previous-block pointer p1 will actually come
after q in memory.
The first of these simplifies to *blink = flink, as the offset to field forward is 0; this action allows
the attacker to write any word of memory (at address blink) with any desired value.
The JPEG-comment-overflow operation eventually runs out of heap and results in a segmentation fault, at
which point the heap manager attempts to allocate more blocks. However, the free list has already been
558
21 Security
overwritten, and so, as above, this block-allocation attempt instead executes *blink = flink.
The attackers conceptual goal is to have flink hold an instruction that branches to the shellcode, and
blink hold the address of an instruction that will soon be executed, or, equivalently, to have flink hold
the address of the shellcode and blink represent a location soon to be loaded and used as the target of a
branch. The catch is that the attacker doesnt exactly know where the heap is, so a variant of the return-tolibc approach described in 21.2.1 Return to libc is necessary. The strategy described in the remainder of
this section, described in [JA05], is one of several possible approaches.
In Windows XP SP1, location 0x77ed73b4 holds a pointer to the entry point of the Undefined
Exception Filter handler; if an otherwise-unhandled program exception occurs, Windows creates an
EXCEPTION_POINTERS structure and branches to the address stored here. It is this address, which we
will refer to as UEF, the attack will overwrite, by setting blink = UEF. A call to the Undefined Exception
Filter will be triggered by a suitable subsequent program crash.
When the exception occurs (typically by the second operation above, flink -> backward = blink),
before the branch to the address loaded from UEF, the EXCEPTION_POINTERS structure is created on the
heap, overwriting part of the JPEG comment buffer. The address of this structure is stored in register edi.
It turns out that, scattered among some standard libraries, there are half a dozen instructions at known
addresses of the form call DWORD [edi+0x74], that is, call the subroutine at 32-bit address edi +
0x74 ([AHLR07], p 186). All these call instructions are intended for contexts in which register edi
has been set up by immediately preceding instructions to point to something appropriate. In our attackers
context, however, edi points to the EXCEPTION_POINTERS structure; 0x74 bytes past edi is part of the
attackers overflowed JPEG buffer that is safely past this structure and will not have been overwritten by it.
One such call instruction happens to be at address 0x77d92a34, in user32.dll. This address is the
value the attacker will put in flink.
So now, when the exception comes, control branches to the address stored in UEF. This address now points
to the above call DWORD [edi+0x74], so control branches again, this time into the attacker-controlled
buffer. At this point, the processor lands on the NOPslide and ends up executing the shellcode (sometimes
one more short jump instruction is needed, depending on layout).
This attack depends on the fact that a specific instruction, call DWORD [edi+0x74], can be found at a
specific, fixed address, 0x77d92a34. Address-space layout randomization (21.2.3.2 ASLR) would have
prevented this; it was introduced by Microsoft in Windows Vista in 2007.
Unless the website in question does careful html filtering of what users upload, any other site visitor who so
much as views this comment will have the do_something_bad() script executed by his or her browser.
The script might email information about the target user to the attacker, or might attempt to exploit a browser
vulnerability on the target system in order to take it over completely. The script and its enclosing tags will
not appear in what the victim actually sees on the screen.
21.3 Heap Buffer Overflow
559
The do_something_bad() code block will usually include javascript function definitions as well as
function calls.
In general, the attacker can achieve the same effect if the victim visits the attackers website. However, the
popularity (and apparent safety) of the third-party website is usually important in practice; it is also common
for the attack to involve obtaining private information from the victims account on that website.
560
21 Security
21.4.1 Evasion
The NIDS will have to reassemble TCP streams (and sometimes sequences of UDP packets) in order to
match signatures. This raises the possibility of evasion: the attacker might arrange the packets so that the
NIDS reassembles them one way and the target another way. The possibilities for evasion are explored in
great detail in [PN98]; see also [SP03].
One simple way to do this is with overlapping TCP segments. What happens if one packet contains bytes
1-6 of a TCP connection as help and a second packet contains bytes 2-7 as armful?
h
e
a
l
r
p
m
These can be reassembled as either helpful or harmful; the TCP specification does not say which is preferred and different operating systems routinely reassemble these in different ways. If the NIDS reassembles
the packets one way, and the target the other, the attack goes undetected. If the attack is spread over multiple
packets, there may be many more than two ways that reassembly can occur, increasing the probability that
the NIDS and the target will differ.
Another possibility is that one of the overlapping segments has a header irregularity (in either the IP or TCP
header) that causes it to be rejected by the target but not by the NIDS, or vice-versa. If the packets are
h
h
e
a
l
r
p
m
and both systems normally prefer data from the first segment received, then both would reassemble as
helpful. But if the first packet is rejected by the target due to a header flaw, then the target receives
21.4 Network Intrusion Detection
561
harmful. If the flaw is not recognized by the NIDS, the NIDS does not detect the problem.
A very similar problem occurs with IPv4 fragment reassembly, although IPv4 fragments are at this point
intrinsically suspicious.
One approach to preventing evasion is to configure the NIDS with information about how the actual target
does its reassembly, so the NIDS can match it. An even safer approach is to have the NIDS reassemble any
overlapping packets and then forward the result on to the potential target.
21 Security
In the MD5 hash function, the input is processed in blocks of 64 bytes. Each block is divided into sixteen
32-bit words. One such block of input results in 64 iterations from a set of sixteen rounds-functions Hi , each
applied four times in all. Each 32-bit input word is used as the key parameter to one of the Hi four times.
If the total input length is not a multiple of 512 bits, it is padded; the padding includes a length attribute so
two messages differing only by the amount of padding should not hash to the same value.
While this framework in general is believed secure, and is also used by the SHA-2 family, it does suffer
from what is sometimes called the length-extension vulnerability. If h = hash(m), then the value h is simply
the final state after the above mechanism has processed message m. An attacker knowing only h can then
initialize the above algorithm with h, and continue it to find the hash h = hash(m m ), for an arbitrary
message m concatenated to the end of m, without knowing m. If the original message m was padded to
message mp , then the attacker will find h = hash(mp m ), but that is often enough. This vulnerability must
be considered when using secure-hash functions for message authentication, next.
The SHA-3 family of hash functions does not use the Merkle-Dmgard construction and is believed not
vulnerable to length-extension attacks.
563
The values 0x36 (0011 0110) and 0x5c (0101 1100) are not critical, but the XOR of them has, by intent, a
reasonable number of both 0-bits and 1-bits; see [BCK96].
The HMAC algorithm is, somewhat surprisingly, believed to be secure even when the underlying hash
function is vulnerable to some kinds of collision attacks; see [MB06] and RFC 6151. That said, a hash
function vulnerable to collision attacks may have other vulnerabilities as well, and so HMAC-MD5 should
still be phased out.
564
21 Security
Typical key lengths for shared-key ciphers believed secure range from 128 bits to 256 bits. Recent keylength recommendations for public-key RSA encryption are 2048 bits. The difference here is because for
most shared-key ciphers there are no known attacks that are much faster than brute force, and 2256 1077
is quite a large number. For most public-key encryption mechanisms, on the other hand, there are shortcuts;
for RSA an attacker can use a factoring shortcut.
Shared-key ciphers can be either block ciphers, encrypting data in units of blocks that might typically be 8
bytes long, or stream ciphers, which generate a pseudorandom keystream. Each byte (or even bit) of the
message is then encrypted by (typically) XORing it with the corresponding byte of the keystream.
565
which we can call L and H for Low and High. A typical round function is now of the following form, where
K is the key and F(x,K) takes a word x and the key K and returns a new word:
L,H H, L XOR F(H,K)
Visually, this is often diagrammed as
One round here scrambles only half the block, but the other half gets scrambled in the next round (sometimes the operation above is called a half-round for that reason). The total number of rounds is typically
somewhere between 10 and 50. Much of the art of designing high-quality block ciphers is to come up with
round functions that result in overall very fast execution, without introducing vulnerabilities. The more
rounds, the slower.
The internal function F, often different for each round, may look at only a designated subset of the bits of
the key K. Note that the operation above is invertible that is, can be decrypted regardless of F; given the
right-hand side the receiver can compute F(H,K) and thus, by XORing this with L XOR F(H,K), can recover
L. This remains true if, as is sometimes the case, the XOR operation is replaced with ordinary addition.
Crypto Law
The Salsa20 cipher mentioned here is a member of Daniel Bernsteins snuffle family of ciphers based
on secure-hash functions. In the previous century, the US government banned the export of ciphers but
not secure-hash functions. They also at one point banned the export (and thus publication) of one of
Bernsteins papers; he sued. In 1999, the US Court of Appeals for the Ninth Circuit found in his favor.
The decision, 176 F.3d 1132, is the only appellate decision of which the author is aware that contains
(in the footnotes) not only samples of C code, but also of Lisp.
A simple F might return the result of XORing H and a subset of the bits of K. This is usually a little too
simple, however. Ordinary modulo-32 addition of H and a subset of K often works well; the interaction of
addition and XORing introduces considerable bit-shuffling (or diffusion in the language of cryptography).
Other operations used in F(H,K) include Boolean And and Or operations. 32-bit multiplication introduces
considerable bit-shuffling, but is often computationally more expensive. The Salsa20 cipher of [DB08] uses
only XOR and addition, for speed.
It is not uncommon for the round function also to incorporate bit rotation of one or both of L and H; the
566
21 Security
result of bit-rotation of 1000 1110 two places to the left is 0011 1010.
If a larger blocksize is desired, say 128 bits, but we still want to avail ourselves of the efficiency of 32-bit
operations, the block can be divided into A,B,C,D. The round function might then become
A,B,C,D B,C,D, (A XOR F(B,C,D,K))
As mentioned above, many secure-hash functions use block-cipher round functions that then use chunks of
the message being hashed as the key. In the MD5 algorithm, block A above is transformed into the 32bit sum of input-message fragment M, a constant K, and G(B,C,D) which can be any of several Boolean
combinations of B, C and D.
An alternative framework for block ciphers is the so-called substitution-permutation network.
The first block cipher in widespread use was the federally sanctioned Data Encryption Standard, or DES
(commonly pronounced dez). It was developed at IBM by 1974 and then selected by the US National
Bureau of Standards (NBS) as a US standard after some alterations recommended by the National Security
Agency. One of the NSAs recommendations was that a key size of 56 bits was sufficient; this was in an
era when the US government was very concerned about the spread of strong encryption. For years, many
people assumed the NSA had intentionally introduced other weaknesses in DES to facilitate government
eavesdropping, but after forty years no such vulnerability has been discovered and this no longer seems so
likely. The suspicion that the NSA had in the 1970s the resources to launch brute-force attacks against
DES, however, has greater credence.
In 1997 an academic team launched a successful brute-force attack on DES. The following year the Electronic Frontier Foundation (EFF) built a hardware DES cracker for about US$250,000 that could break DES
in a few days.
In an attempt to improve the security of DES, triple-DES or 3DES was introduced. It did not become an
official standard until the late 1990s, but a two-key form was proposed in 1978. 3DES involves three
applications of DES with keys K1,K2,K3; encrypting a plaintext P to ciphertext C is done by C =
EK3 (DK2 (EK1 (P))). The middle deciphering option DK2 means the algorithm reduces to DES when K1 =
K2 = K3; it also reduces exposure to a particular vulnerability known as meet in the middle (no relation to man in the middle). In [MH81] it is estimated that 3DES with three distinct keys has a strength
roughly equivalent to 256 = 112 bits. That same paper also uses the meet-in-the-middle attack to show
that straightforward double-DES encryption C = EK2 (EK1 (P)) has an effective keystrength of only 56 bits
no better than single DES if sufficient quantities of plaintext and corresponding ciphertext are known.
As concerns about the security of DES continued to mount, the US National Institute of Standards and
Technology (NIST, the new name for the NBS) began a search for a replacement. The result was the
Advanced Encryption Standard, AES, officially approved in 2001. AES works with key lengths of 128, 192
and 256 bits. The algorithm is described in [DR02], and is based on the Rijndael family of ciphers; the name
Rijndael (rain-dahl) is a combination of the authors names.
Earlier non-DES ciphers include IDEA, the International Data Encryption Algorithm, described in [LM91],
and Blowfish, described in [BS93]. Both use 128-bit keys. The IDEA algorithm was patented; Blowfish
was intentionally not patented. A successor to Blowfish is Twofish.
567
to MALLORY
Amount $1000
If the attacker also knows the ciphertext for Amount $100,000, and is in a position to rewrite
the message (or to inject a new message), then the attacker can combine the first two blocks above
with the third $100,000 block to create a rather different message, without knowing the key. At
http://en.wikipedia.org/wiki/Block_cipher_mode_of_operation there is a remarkable example of the failure
of ECB to fail to effectively conceal an encrypted image.
As a result, ECB is seldom used. A common alternative is cipher block chaining or CBC mode. In this
mode, each plaintext block is XORed with the previous ciphertext block before encrypting:
Ci = E(K,Ci-1 XOR Pi )
To decrypt, we use
Pi = D(K,Ci ) XOR Ci-1
If we stop here, this means that if two messages begin with several identical plaintext blocks, the encrypted
messages will also begin with identical blocks. To prevent this, the first ciphertext block C0 is a random
string, known as the initialization vector, or IV. The plaintext is then taken to start with block P1 . The IV
is sent in the clear, but contains no private information.
See exercise 5.
then the attacker can XOR the two bytes over the ^ with 0 XOR 2, changing the character 0 to a 2
and the 2 to a 0. The attacker does this to the encrypted stream, but the decrypted plaintext stream retains
568
21 Security
the change. Appending an authentication code such as HMAC, 21.5.1 Secure Hashes and Authentication,
prevents this.
Stream ciphers are, in theory, well suited to the encryption of data sent a single byte at a time, such as data
from a telnet session. The ssh protocol (21.9.1 SSH), however, generally uses block ciphers; when it has to
send a single byte it pads the block with random data.
21.6.5.1 RC4
The RC4 stream cipher is quite widely used. It was developed by Ron Rivest at RSA Security in 1987, but
never formally published. The code was leaked, however, and so the internal details are widely available.
Unofficial implementations are sometimes called ARC4 or ARCFOUR, the A for Alleged.
RC4 was designed for very fast implementation in software; to this end, all operations are on whole bytes.
RC4 generates a keystream of pseudorandom bytes, each of which is XORed with the corresponding plaintext byte. The keystream is completely independent of the plaintext.
The key length can range from 5 up to 256 bytes. The unofficial protocol contains no explicit mechanism
for incorporating an initialization vector, but an IV is well-nigh essential for stream ciphers; otherwise an
identical keystream is generated each time. One simple approach is to create a session key by concatenating
the IV and the secret RC4 key; alternatively (and perhaps more safely) the IV and the RC4 key can be
XORed together.
RC4 was the basis for the ill-fated WEP Wi-Fi encryption, 21.6.8 Wi-Fi WEP Encryption Failure, due in
part to WEPs requirement that the 3-byte IV precede the 5-byte RC4 key. The flaw there did not affect
other applications of RC4, but newer attacks have suggested that RC4 be phased out.
Internally, an RC4 implementation maintains the following state variables:
An array S[] representing a permutation iS[i] of all bytes
two 8-bit integer indexes to S[], i and j
S[] is guaranteed to represent a permutation (ie is guaranteed to be a 1-1 map) because it is initialized to the
identity and then all updates are transpositions (involving swapping S[i] and S[j]).
Initially, we set S[i] = i for all i, and then introduce 256 transpositions. This is known as the key-scheduling
algorithm. In the following, keylength is the length of the key key[] in bytes.
J=0;
for I=0 to 255:
J = J + S[I] + key[I mod keylength];
swap S[I] and S[J]
As we will see in 21.6.8 Wi-Fi WEP Encryption Failure, 256 transpositions is apparently not enough.
After initialization, I and J are reset to 0. Then, for each keystream byte needed, the following is executed,
where I and J retain their values between calls:
I = (I+1) mod 256
J = J + S[I] mod 256
swap S[I] and S[J]
return S[ S[I] + S[J] mod 256 ]
569
570
21 Security
encrypt-then-MAC: encrypt the plaintext and calculate the MAC of the ciphertext; append the MAC
to the ciphertext
These are analyzed in [BN00], in which it is proven that encrypt-then-MAC in general satisfies some stronger
cryptographic properties than the others, although these properties may hold for the others in special cases.
Encrypt-then-MAC means that no decryption is attempted of a forged or spoofed message.
571
The first value of J, when I = 0, is K[0] which is 3. After the first transposition, S is as follows, where the
swapped values are in bold:
For the next iteration, I is 1 and J is 3+1+(-1) = 3 again. After the swap, S is
Next, I is 2 and J is 3+2+5 = 10. In general, if the IV were 3,-1,X, J would be 5+X.
At this point, we have processed the three-byte IV; the next iteration involves the first secret byte of K. I is
3 and J becomes 10+1+K[3] = 15:
Recall that the first byte returned in the keystream is S[ S[1] + S[S[1]] ]. At this point, that would
be S[0+3] = 15.
From this value, 15, and the IV 3,-1,5, the attacker can calculate, by repeating the process above, that
K[3] = 4.
If the IV were 3,-1,X, then, as noted above, in the I=2 step we would have J = 5+X, and then in the I=3
step we would have J = 6 + X + K[3]. If Y is the value of the first byte of the keystream, using S[] as of
this point, then K[3] = Y X 6 (assuming that X is not -5 or -4, so that S[0] and S[1] are not changed
in step 3).
572
21 Security
If none of S[0], S[1] and S[3]] are changed in the remaining 252 iterations, the value we predict here
after three iterations eg 15 would be the first byte of the actual keystream. Treating future selections of
the value J as random, the probability that one iteration does not select J in the set {0,1,3} and thus leaves
these values alone is 253/256. The probability that the remaining iterations do not change these values in
S[] is thus about (253/256)252 5%
A 5% success rate is not terribly impressive, on the face of it. But 5% of the time we identify K[3]
correctly, and the other 95% of the time the (incorrect) values we calculate for K[3] are uniformly, randomly
distributed in the range 0 to 255. With 60 guesses, we expect about three correct guesses. The other 57 are
spread about with, most likely, no more than two guesses for any one value. If we look for the value guessed
most often, it is likely to be the true value of K[3]; increasing the number of 3,-1,X IVs will increase the
certainty here.
For the sake of completeness, in the next two iterations, I is 4 and 5 respectively, and J is 15+4+(-14) = 5
and 5+4+3=12 respectively. The corresponding contents of S[] are
Now that we know K[3], the next step is to find K[4]. Here, we search the traffic for IVs of the form
4,-1,X, but the general strategy above still works.
The [FMS01] attack, as originally described, requires on average about 60 weak IVs for each secret key
byte, and a weak IV turns up about every 216 packets. Each key byte, however, requires a different IV, so
any one IV has one chance per key byte to be a weak IV. To guess a key thus requires 6065536 4 million
packets.
But we can do quite a bit better. In [TWP07], the number of packets needed is only about 40,000 on average.
At that point the attack is entirely practical.
This vulnerability was used in the attack on TJX Corp mentioned in the opening section.
573
574
21 Security
21.8.1 RSA
Public-key encryption was outlined in [DH76], but the best-known technique is arguably the RSA algorithm
of [RSA78]. (The RSA algorithm was discovered in 1973 by Clifford Cocks at GCHQ, but was classified.)
To construct RSA keys, one first finds two very large primes p and q, perhaps 1024 binary digits each. These
primes should be chosen at random, by choosing a random N and then testing N, N+1, ... for primality until
success. Let n = pq.
It now follows from Fermats little theorem that, for any integer m, m(p-1)(q-1) = 1 mod n (it suffices to show
m(p-1)(q-1) = 1 mod p and m(p-1)(q-1) = 1 mod q).
One then finds positive integers e and d so that ed = 1 mod (p-1)(q-1); given any e relatively prime to both
p-1 and q-1 it is possible to find d using the Extended Euclidean Algorithm (a simple Python implementation
appears below in 21.9.3 Creating an RSA Key). From the claim in the previous paragraph, we now know
med = (me )d = m mod p.
If we take m as a message (that is, as a bit-string of length less than the bit-length of n, rather than as an
integer), we encrypt it as c = me mod n. We can decrypt c and recover m by calculating cd mod n = med
mod n = m.
The public key is the pair n,e; the private key is d.
575
Elliptic-curve cryptography
One of the concerns with RSA is that a faster factoring algorithm will someday make the encryption
useless. A newer alternative is the use of elliptic curves; for example, the set of solutions modulo a
large prime p of the equation y2 = x3 + ax + b. This set has a natural (but nonobvious) product operation
completely unrelated to modulo-p multiplication, and, as with modulo-p arithmetic, finding n given B
= An appears to be quite difficult. Several cryptographic protocols based on elliptic curves have been
proposed; see Wikipedia.
As with Diffie-Hellman-Merkle (21.7.1 Fast Arithmetic), the operations above can all be done in polynomial time with respect to the bit-lengths of p and q.
The theory behind the security of RSA is that, if one knows the public key, the only way to find d in practice
is to factor n. And factoring is a hard problem, with a long history. There are faster ways to factor than to
try every candidate less than n, but it is still believed to require, in general, exponential time with respect
to the bit-length of n.
RSA encryption is usually thousands of times slower than comparably secure shared-key ciphers. However,
RSA needs only to be used to encrypt a secret key for a shared-key cipher; it is this latter key that actually
protects the document.
For this reason, if a message is encrypted for multiple recipients, it is usually not much larger than if it were
encrypted for only one: the additional space needed for each additional recipient is just the encrypted key
for the shared-key cipher.
Similarly, for digital-signature purposes Alice doesnt have to encrypt the entire message with her private
key; it is sufficient (and much faster) to encrypt a secure hash of the message.
If Alice and Bob want to negotiate a session key using public-key encryption, it is sufficient for Alice to
know Bobs public key; Alice does not even need a public key. Alice can choose a session key, encrypt it
with Bobs public key, and send it to Bob.
We will look at an actual calculation of an RSA key and then an encrypted message in 21.9.3 Creating an
RSA Key.
576
21 Security
577
21 Security
will have been signed by a certificate authority (21.9.2.1 Certificate Authorities); Alice presumably trusts
this certificate authority. (SSH does now support certificate authorities too, but their use with SSH seems
not yet to be common.)
Both SSH and TLS eventually end up negotiating a shared-secret session key, which is then used for most
of the actual data encryption.
21.9.1 SSH
The SSH protocol establishes an encrypted communications channel between two hosts, after establishing
the identities of each endpoint. Its primary use is to enable secure remote-command execution with input
and output via the secure channel; this includes the remote execution of an interactive shell, which is in
effect a telnet-style terminal login with encryption. The companion program scp uses the SSH protocol to
implement secure file transfer. Finally, ssh supports secure port forwarding from one machine (and port)
to another; unrelated applications can then connect to one machine and find themselves securely talking to
another. The current version of the SSH protocol is 2.0, and is defined in RFC 4251. The authentication
and transport sub-protocols are defined in RFC 4252 and RFC 4253 respectively.
One of the first steps in an SSH connection is for the endpoints to negotiate which secret-key cipher (and
mode) to use. Support of the following ciphers is recommended and there is a much longer list of optional ciphers (which include RC4 and Blowfish); the table below includes those added by RFC 4344:
cipher
3DES
AES
AES
AES
modes
CBC, CTR
CBC, CTR
CTR
CTR
nominal keylength
168 bits
192 bits
128 bits
256 bits
SSH supports a special name format for including new ciphers for local use.
The SSH protocol also allows the endpoints to negotiate a public-key-encryption mechanism, a secure-hash
function, and even a key-exchange algorithm although only minor variants of Diffie-Hellman-Merkle key
exchange are implemented.
If Alice wishes to connect to a server S, the server clearly wants to verify Alices identity, either through a
password or some other means. But it is just as important for Alice to verify the identity of S, to prevent
man-in-the-middle attacks and the possibility that an attacker masquerading as S is simply collecting Alices
password for later use.
In the following subsections we focus on the common SSH configuration and ignore some advanced
options.
21.9.1.1 Server Authentication
To this end, one of the first steps in SSH connection negotiation is for the server to send the public half of
its host key to Alice. Alice verifies this key, which is typically in her known_hosts file. Alice also asks
S to sign something with its host key. If Alice can then decrypt this with the public host key of S, she is
confident that she is talking to the real S.
If this is Alices first attempt to connect to S, however, she should get a message like the one below:
579
21 Security
In several command-line implementations of ssh, the various stages of authentication can be observed from
the client side by using the ssh or slogin command with the -v option.
21.9.1.4 The Session
Once an SSH connection has started, a new session key is periodically negotiated. RFC 4253 recommends
this after one hour or after 1 GB of data is exchanged.
Data is then sent in packets generally with size a multiple of the block size of the cipher. If not enough data
is available, eg because only a single keystroke (byte) is being sent, the packet is padded with random data
as needed. Every packet is required to be padded with at least four bytes of random data to thwart attacks
based on known plaintext/ciphertext pairs. Included in the encrypted part of the packet is a byte indicating
the length of the padding.
21.9.2 TLS
Transport Layer Security, or TLS, is an IETF extension of the Secure Socket Layer (SSL) protocol originally
developed by Netscape Communications. Its primary role is encrypting web connections and verifying for
the client the authenticity of the server; its current specification is RFC 5246.
Unlike SSH, client authentication, while possible, is not common; web servers often have no pre-existing
relationship with the client. Also unlike SSH, the public-key mechanisms are all based on established
certificate authorities, or CAs, whereas the most common way for an SSH servers host key to end up on
a client is for it to have been accepted by the user during the first connection. Browsers (and other TLS
applications as necessary) have embedded lists of certificate authorities trusted by the browser vendor. SSH
requires no such centralized trust.
If Bob wishes to use TLS on his web server SBob , he must first arrange for a certificate authority, say CA1 ,
to sign his certificate. A certificate contains the full DNS name of SBob , say bob.int, a public key KS used
by SBob , and also an expiration date. The use of the ITU-T X.509 certificate format is common.
Now imagine that Alice connect to Bobs server SBob using TLS. Early in the process SBob will send her its
signed certificate, claiming public key KS . Alices browser will note that the certificate is signed by CA1 ,
and will look up CA1 on its list of trusted certificate authorities. If found, the next step is for the browser to
use CA1 s public key, also on the list, to verify the signature on the certificate SBob sent.
If everything checks out, Alices browser now knows that CA1 certifies bob.int has public key KS . As SBob
has presented this key KS , and is able to verify that it possesses the matching private key, this is proof that
SBob is legitimately the server with domain name bob.int.
Assuming, of course, that CA1 is correct.
As with SSH, once Alice has accepted the public key KS of SBob , a secret-key cipher is negotiated and the
remainder of the exchange is encrypted.
21.9.2.1 Certificate Authorities
A certificate authority, or CA, is just an entity in the business of signing certificates. The purpose of the
certificate authority is to prevent man-in-the-middle attacks (21.8.3 Trust and the Man in the Middle); Alice
21.9 SSH and TLS
581
SBad
SBob
then Alice can be tricked into connecting to SBad instead of SBob . Alice will request a certificate, but from
SBad instead of SBob , and get Mallorys from CA2 instead of Bobs actual certificate from CA1 . Mallory
opens a second connection from SBad to SBob (this is easy, as Bob makes no attempt to verify Alices
identity), and forwards information from one connection to the other. As far as TLS is concerned everything
checks out. From Alices perspective, Mallorys false certificate vouches for the key Kbad of SBad , CA2 has
signed this certificate, and CA2 is trusted by Alices browser. There is no search at any point through the
other CAs to see if any of them have any contrary information about SBob . In fact, there is not necessarily
even contact with CA2 , though see 21.9.2.2 Certificate revocation below.
582
21 Security
If Alice is very careful, she may click on the lock icon in her browser and see that the certificate authority
signing her connection to SBad is CA2 rather than CA1 . But if Alice has a secure way of finding Bobs real
certificate authority, she might as well use it to find Bobs key KS . As she probably does not, this is of
limited utility in practice.
The second certificate authority CA2 might be a legitimate certificate authority that has been tricked, coerced
or bribed into signing Mallorys certificate. Alternatively, it might be Mallorys own creation, inserted by
Mallory into Alices browser through some other vulnerability.
Mallorys machinations here do require the man-in-the-middle attack as well as the bad certificate. If Alice
is able to establish a direct connection with SBob , then the latter will send its true key KS signed by CA1 .
As another attack, Mallory might obtain a certificate for b0b.int and hope Alice doesnt notice the spelling
difference between B0B and BOB. When this is done, Mallory often also sends Alice a disguised link to
b0b.int in the hope she will click on it. Unicode domain names make this even easier, as Unicode provides
many character pairs that are different but which look identical.
Extended-Validation certificates were introduced in 2007 as a way of providing greater assurances that the
certificate issued to bob.int was in fact generated by a request from Bob himself; Mallory should in theory
have a much harder time obtaining an EV certificate for bob.int from CA2 . Browsers that have secured a
TLS connection using an EV certificate typically add the name of the domain owner, highlighted in green
and/or with a green padlock icon, to the address bar. Financial institutions often use EV certificates. So does
mozila.org.
21.9.2.2 Certificate revocation
Suppose the key KS underlying the certificate for bob.int has been compromised, and Mallory has the private
key. Then, if Mallory can trick Alice into connecting to SBad instead of SBob , the original CA1 will validate
the connection. SBad can prove to Alice that it has the secret key corresponding to KS , and all the certificate
does is to attest that bob.int has key KS . Mallorys deception works even if Bob noticed the compromise and
updated SBob s key to K2 ; the point is that Mallory still has the original key, and CA1 s certificate attesting
to that original key is still valid.
Fortunately, there is a mechanism by which a certificate can be revoked. Revocation information, however,
must be kept at some central directory; a server can continue to serve up a revoked certificate and unless the
clients actively check, they will be none the wiser. This is one reason certificates have expiration dates.
At one time, for efficiency reasons browsers by default did not check for revoked certificates. Now mostly
they do.
The original revocation mechanism was the global certificate revocation list. A newer alternative is the
Online Certificate Status Protocol, OCSP, described in RFC 6960. If Alice receives a certificate signed by
CA1 , she can send the serial number of the certificate to a designated OCSP responder run by or on behalf
of CA1 ; this site will confirm the validity of the original certificate.
Of course, an eavesdropper watching Alices traffic arriving at the OCSP responder now knows that Alice is
visiting bob.int. An eavedropper closer to Alice, however, knows that anyway.
583
584
21 Security
This is in the so-called PEM format, which means that the two lines in the middle are the base64 encoding
of the ASN.1 encoding (20.12 SNMP and ASN.1 Encoding) of the actual data. Despite the PRIVATE KEY
label, the file in fact contains both private and public keys. SSH private keys, typically generated with the
ssh-keygen command, are also in PEM format.
We can extract the PEM-file data with the next command:
openssl rsa -in key96.pem -text
The default OpenSSL encryption exponent, denoted e in 21.8.1 RSA, is 65537 = 216 +1.
We next convert all these hex numbers to decimal; the corresponding notation of 21.8.1 RSA is in parentheses.
585
modulus (n)
privateExponent (d)
prime1 (p)
prime2 (q)
exponent1
exponent2
coefficient
52252327837124407964427358327
48545702997494592199601992577
238813258387343
218799945153689
201481036403145
185951742453425
42985747170220
We now verify some arithmetic, using any tool that supports large integers (eg python3, used here, or the
unix bc command). First we check n = pq:
>>> 238813258387343 * 218799945153689
52252327837124407964427358327
To encrypt a message m, we must use efficient mod-n calculations; here is an implementation of the repeatedsquaring algorithm (mentioned above in 21.7.1 Fast Arithmetic) in python3. (This function is built into
python3 as pow(x,e,n).)
def power(x,e,n): # computes x^e mod n
pow = 1
while e>0:
if e%2 == 1: pow = pow*x % n
x = x* x % n
e = e//2
# // denotes integer division
return pow
What about the last three numbers in the PEM file, exponent1, exponent2 and coefficient? These
are pre-computed values to speed up decryption. Their values are
exponent1 = d mod (p-1)
exponent2 = d mod (q-1)
coefficient is the solution of coefficient q = 1 mod p
586
21 Security
The factors are indeed the values of p and q, above. Factoring took 2.584 seconds on the authors laptop. Of
course, 96-bit RSA keys were never secure; recall that the current recommendation is to use 2048-bit keys.
The Gnu/linux factor command uses Pollards rho algorithm, and, while serviceable, is not especially
well suited to factoring the product of two large primes. The author was able to factor a 200-bit modulus in
just over 5 seconds using the msieve program, one of several large-number-factoring programs available on
the Internet.
We are almost done; we now need to find the decryption key d, knowing e, p-1 and q-1. For this we need an
implementation of the extended Euclidean algorithm; the following Python implementation is taken from
WikiBooks:
def egcd(a, b):
if a == 0:
return (b, 0, 1)
else:
g, y, x = egcd(b % a, a)
return (g, x - (b // a) * y, y)
A call to egcd(a,b) returns a triple (g,x,y) where g is the greatest common divisor of a and b, and
x and y are solutions to g = ax + by. From 21.8.1 RSA, we need d to be positive and to satisfy 1 =
de + (p-1)(q-1)y. The x value (the second value) returned by egcd(e, (p-1)*(q-1)) satisfies the
second part, but it may be negative in which case we need to add (p-1)(q-1) to get a positive value which is
congruent mod (p-1)(q-1). This x value is -3706624839629358151621824719; after adding (p-1)(q-1) we
get d=48545702997494592199601992577.
This is the value of d we started with. If c is the ciphertext, we now calculate m = pow(c,d,n) as before,
yielding m=90612911403892, or, in hex, 52:69:76:65:73:74, Rivest.
21.10 Exercises
1. Modify the netsploit.c program of 21.2.2.3 The exploit so that the NOPslide is about 1024 bytes long,
and the NOPslide and shellcode are now above the overwritten return address in memory.
2.
Disable ASLR on a linux system by writing the appropriate value to
/proc/sys/kernel/randomize_va_space. Now get the netsploit attack to work without having the attacked server print out a stack address. You are allowed beforehand to run a test copy of the server that
prints its address.
3. Below are a set of four possible TCP-segment reassembly rules. These rules (and more) appear in [SP03].
Recall that, once values for all positions ji are received, these values are acknowleged and released to the
application; there must be some earlier hole for segments to be subject to reassembly at all.
21.10 Exercises
587
(b).
a a a a a
b b b b b
c c c c c c
d d d
e
4. Suppose a network intrusion-detection system is 10 hops from the attacker, but the target is 11 hops,
behind one more router R. Outline a strategy of tweaking the TTL field so that the NIDS receives
TCP stream helpful and the target receives harmful. Segments should either be disjoint or cover
exactly the same range of bytes; you may assume that the target accepts the first segment for any given
range of bytes.
5. Suppose Alice encrypts blocks P1, P2 and P3 using CBC mode (21.6.4 Cipher Modes). The initialization
vector is C0. The encrypt and decrypt operations are E(P) = C and D(C) = P. We have
C1 = E(C0 XOR P1)
C2 = E(C1 XOR P2)
C3 = E(C2 XOR P3)
Suppose Mallory intercepts C2 in transit, and replaces it with C2 = C2 XOR M; C1 and C3 are transmitted
normally; that is, Bob receives [C1,C2,C3] where C1 = C1 and C3 = C3. Bob now attempts to decrypt;
C1 decrypts normally to P1 while C2 decrypts to gibberish.
Show that C3 decrypts to P3 XOR M. (This is sometimes used in attacks to flip a specific bit or byte, in
cases where earlier gibberish does not matter.)
6. Suppose Alice uses a block-based stream cipher (21.6.6 Block-cipher-based stream ciphers); block i
of the keystream is Ki and Ci = Ki XOR Pi. Alice sends C1, C2 and C3 as in the previous exercise, and
Mallory again replaces C2 by C2 XOR M. What are the three plaintext blocks Bob deciphers?
588
21 Security
7. Show that if p and q are primes with p = 2q + 1, then g is a primitive root mod p if g=1, g2 =1, and gq =1.
(This exercise requires familiarity with modular arithmetic and primitive roots.)
8. Suppose we have a short message m, eg a bank PIN number. Alice wants to send message M to Bob that,
without revealing m, can be used later to verify that Alice knew m.
(a). Suppose Alice simply sends M = hash(m). Explain how Bob can quickly recover m.
(b). How can Alice construct M using a secure-hash function, avoiding the problem of (a)? Hint: as part of
the later verification, Alice can supply additional information to Bob.
9. In the example of 21.6.8 Wi-Fi WEP Encryption Failure, suppose the IV is 4,-1,5 and the first two
bytes of the key are 10,20. What is the first keystream byte S[ S[1] + S[S[1]] ]?
10. Suppose Alice uses encryption exponent e=3 to three friends, Bob, Charlie and Deborah, with respective encryption moduli nB , nC and nD , all of which are relatively prime. Alice sends message m to each,
encrypted as
CB = m3 mod nB
CC = m3 mod nC
CD = m3 mod nD
If Mallory intercepts all three encrypted messages, explain how he can efficiently decrypt m. Hint: the
Chinese Remainder Theorem implies that Mallory can find C < nB nC nD such that
C = CB mod nB
C = CC mod nC
C = CD mod nD
(One simple way to avoid this risk is for Alice to include a timestamp and the recipients name in each
message, ensuring that she never sends exactly the same message twice.)
11. Repeat the key-creation of 21.9.3 Creating an RSA Key using a 110-bit key. Extract the modulus from
the key file, convert it to decimal, and attempt to factor it. Can you do the factoring in under a minute?
12. Below are a series of public RSA keys and encrypted messages; the encrypted message is c and the
modulus is n=pq. In each case, find the original message, using the methods of 21.9.3.1 Breaking the
key; you will have to factor n and then find d. For some keys, the Gnu/Linux factor command will be
sufficient; for the larger keys consider msieve or some other fast factorer.
Each number below is in decimal. The encryption exponent e is always 65537; the encryption is c =
power(message,e,n). Each message is an ASCII string; that is, after the numeric message is converted
to a string, the byte values are each in the range 32-127. The following Python function may be useful in
converting numeric messages to strings:
def int2ascii(n):
if n==0: return ""
return int2ascii(n // 256) + chr(n % 256)
589
n=26294146550372428669569992924076340150116542153388301312743129088600749884420889043685502979
590
21 Security
BIBLIOGRAPHY
591
592
Bibliography
genindex
search
593
594
BIBLIOGRAPHY
[AO96] Aleph One (Elias Levy), Smashing The Stack For Fun And Profit, Phrack vol 7 number 49, 1996,
available at http://insecure.org/stf/smashstack.html.
[AP99] Mark Allman and Vern Paxson, On Estimating End-to-End Network Path Properties, Proceedings
of ACM SIGCOMM 1999, August 1999.
[AHLR07] Chris Anley, John Heasman, Felix FX Linder and Gerardo Richarte, The Shellcoders Handbook, second edition, Wiley, 2007.
[JA05] John Assalone, Exploiting the GDI+ JPEG COM Marker Integer Underflow Vulnerability, Global Information Assurance Certification technical note, January 2005, available
at
http://www.giac.org/paper/gcih/679/exploiting-gdi-plus-jpeg-marker-integer-underflowvulnerability/106878.
[PB62] Paul Baran, On Distributed Computing Networks, Rand Corporation Technical Report P-2626,
1962.
[MB06] Mihir Bellare, New Proofs for NMAC and HMAC: Security without Collision-Resistance, Advances in Cryptology - CRYPTO 06 Proceedings, Lecture Notes in Computer Science volume 4117,
Springer-Verlag, 2006.
[BCK96] Mihir Bellare, Ran Canetti and Hugo Krawczyk, Keying Hash Functions for Message Authentication, Advances in Cryptology - CRYPTO 96 Proceedings, Lecture Notes in Computer Science
volume 1109, Springer-Verlag, 1996.
[BN00] Mihir Bellare and Chanathip Namprempre, Authenticated Encryption: Relations among notions
and analysis of the generic composition paradigm, Advances in Cryptology ASIACRYPT 2000 /
Lecture Notes in Computer Science volume 1976, Springer-Verlag, 2000; updated version July 2007.
[BZ97] Jon Bennett and Hui Zhang, Hierarchical Packet Fair Queueing Algorithms, IEEE/ACM Transactions on Networking, vol 5, October 1997.
[DB08] Daniel Bernstein, The Salsa20 family of stream ciphers, Chapter, New Stream Cipher Designs,
Matthew Robshaw and Olivier Billet, editors, Springer-Verlag, 2008.
[JB05] John Bickett, Bit-rate Selection in Wireless Networks, MS Thesis, Massachusetts Institute of
Technology, 2005.
[BP95] Lawrence Brakmo and Larry Peterson, TCP Vegas: End to End Congestion Avoidance on a Global
Internet, IEEE Journal on Selected Areas in Communications, vol 13, no 8, 1995.
595
596
Bibliography
[FMS01] Scott Fluhrer, Itsik Mantin and Adi Shamir, Weaknesses in the Key Scheduling Algorithm of
RC4, SAC 01 Revised Papers from the 8th Annual International Workshop on Selected Areas in
Cryptography, Springer-Verlag, 2001.
[FL03] Cheng Fu and Soung Liew, TCP Veno: TCP Enhancement for Transmission over Wireless Access
Networks, IEEE Journal on Selected Areas in Communications, vol 21 number 2, February 2003.
[LG01] Lixin Gao, On Inferring Autonomous System Relationships in the Internet, IEEE/ACM Transactions on Networking, vol 9, December 2001.
[GR01] Lixin Gao and Jennifer Rexford, Stable Internet Routing without Global Coordination,
IEEE/ACM Transactions on Networking, vol 9, December 2001.
[JG93] Jose J Garcia-Lunes-Aceves, Loop-Free Routing Using Diffusing Computations, IEEE/ACM
Transactions on Networking, vol 1, February 1993.
[GP11] L Gharai and C Perkins, RTP with TCP Friendly Rate Control,
http://tools.ietf.org/html/draft-gharai-avtcore-rtp-tfrc-00.
Internet Draft,
[GV02] Sergey Gorinsky and Harrick Vin, Extended Analysis of Binary Adjustment Algorithms, Technical Report TR2002-39, Department of Computer Sciences, University of Texas at Austin, 2002.
[GM03] Luigi Grieco and Saverio Mascolo, End-to-End Bandwidth Estimation for Congestion Control in
Packet Networks, Proceedings of the Second International Workshop on Quality of Service in Multiservice IP Networks, 2003.
[GM04] Luigi Grieco and Saverio Mascolo, Performance Evaluation and Comparison of Westwood+, New
Reno, and Vegas TCP Congestion Control, ACM SIGCOMM Computer Communication Review, vol
34 number 2, April 2004.
[HRX08] Sangtae Ha, Injong Rhee and Lisong Xu, CUBIC: A New TCP-Friendly High-Speed TCP Variant, ACM SIGOPS Operating Systems Review - Research and developments in the Linux kernel, vol
42 number 5, July 2008.
[SH04] Steve
Hanna,
Shellcoding
for
Linux
and
http://www.vividmachines.com/shellcode/shellcode.html, July 2004
Windows
Tutorial,
[MH04] Martin Hellman, Oral history interview with Martin Hellman, Charles Babbage Institute, 2004.
Retrieved from the University of Minnesota Digital Conservancy, http://purl.umn.edu/107353.
[JH96] Janey Hoe, Improving the Start-up Behavior of a Congestion Control Scheme for TCP, ACM
SIGCOMM Symposium on Communications Architectures and Protocols, August 1996.
[HVB01] Gavin Holland, Nitin Vaidya and Paramvir Bahl, A rate-adaptive MAC protocol for multi-Hop
wireless networks, MobiCon 01: Proceedings of the 7th annual International Conference on Mobile
Computing and Networking, 2001.
[CH99] Christian Huitema, Routing in the Internet, second edition, Prentice Hall, 1999.
[HBT99] Paul Hurley, Jean-Yves Le Boudec and Patrick Thiran, A Note on the Fairness of Additive Increase and Multiplicative Decrease, Proceedings of ITC-16, 1999.
[JK88] Van Jacobson and Michael Karels, Congestion Avoidance and Control, Proceedings of the Sigcomm 88 Symposium, vol 18(4), 1988.
Bibliography
597
[JWL04] Cheng Jin, David Wei and Steven Low, FAST TCP: Motivation, Architecture, Algorithms, Performance, IEEE INFOCOM, 2004.
[KM97] Ad Kamerman and Leo Monteban, WaveLAN-II: A high-performance wireless LAN for the unlicensed band, AT&T Bell Laboratories Technical Journal, vol 2 number 3, 1997.
[SK88] Srinivasan Keshav, REAL: A Network Simulator (Technical Report), University of California at
Berkeley, 1988.
[KKCQ06] Jongseok Kim, Seongkwan Kim, Sunghyun Choi and Daji Qiao, CARA: Collision- Aware
Rate Adaptation for IEEE 802.11 WLANs, IEEE INFOCOM 2006 Proceedings, April 2006.
[LH06] Mathieu Lacage and Thomas Henderson, Yet Another Network Simulator, Proceedings of WNS2
06: Workshop on ns-2: the IP network simulator, 2006.
[LM91] Xuejia Lai and James L. Massey, A Proposal for a New Block Encryption Standard, EUROCRYPT 90 Proceedings of the workshop on the theory and application of cryptographic techniques on
Advances in cryptology, Springer-Verlag, 1991.
[LKCT96] Eliot Lear, Jennifer Katinsky, Jeff Coffin and Diane Tharp, Renumbering: Threat or Menace?,
Tenth USENIX System Administration Conference, Chicago, 1996.
[LSL05] DJ Leith, RN Shorten and Y Lee, H-TCP: A framework for congestion control in high-speed and
long-distance networks, Hamilton Institute Technical Report, August 2005.
[LSM07] DJ Leith, RN Shorten and G McCullagh, Experimental evaluation of Cubic-TCP, Extended
version of paper presented at Proc. Protocols for Fast Long Distance Networks, 2007.
[LBS06] Shao Liu, Tamer Basar and R Srikant, TCP-Illinois: A Loss and Delay-Based Congestion Control
Algorithm for High-Speed Networks, Proceedings of the 1st international conference on Performance
evaluation methodolgies and tools, 2006.
[AM90] Allison Mankin, Random Drop Congestion Control, ACM SIGCOMM Symposium on Communications Architectures and Protocols, 1990.
[MCGSW01] Saverio Mascolo, Claudio Casetti, Mario Gerla, MY Sanadidi, Ren Wang, TCP westwood:
Bandwidth estimation for enhanced transport over wireless links, MobiCon 01: Proceedings of the
7th annual International Conference on Mobile Computing and Networking, 2001.
[McK90] Paul McKenney, Stochastic Fairness Queuing, IEEE INFOCOM 90 Proceedings, June 1990.
[RM78] Ralph Merkle, Secure Communications over Insecure Channels, Communications of the ACM,
volume 21, April 1978.
[RM88] Ralph Merkle, A Digital Signature Based on a Conventional Encryption Function, Advances in
Cryptology CRYPTO 87, Lecture Notes in Computer Science volume 293, Springer-Verlag, 1988.
[MH81] Ralph Merkle and Martin Hellman, On the Security of Multiple Encryption, Communications of
the ACM, volume 24, July 1981.
[MB76] Robert Metcalfe and David Boggs, Ethernet: Distributed Packet Switching for Local Computer
Networks, Communications of the ACM, vol 19 number 7, 1976.
[MW00] Jeonghoon Mo and Jean Walrand, Fair End-to-End Window-Based Congestion Control,
IEEE/ACM Transactions on Networking, vol 8 number 5, October 2000.
598
Bibliography
[JM92] Jeffrey Mogul, Observing TCP Dynamics in Real Networks, ACM SIGCOMM Symposium on
Communications Architectures and Protocols, 1992.
[MM94] Mart Molle, A New Binary Logarithmic Arbitration Method for Ethernet, Technical Report
CSRI-298, Computer Systems Research Institute, University of Toronto, 1994.
[RTM85] Robert T Morris, A Weakness in the 4.2BSD Unix TCP/IP Software, AT&T Bell Laboratories
Technical Report, February 1985
[OKM96] Teunis Ott, JHB Kemperman and Matt Mathis, The stationary behavior of ideal TCP congestion
avoidance, 1996.
[PFTK98] Jitendra Padhye, Victor Firoiu, Don Towsley and Jim Kurose, Modeling TCP Throughput: A
Simple Model and its Empirical Validation, ACM SIGCOMM conference on Applications, technologies, architectures, and protocols for computer communication, 1998.
[PG93] Abhay Parekh and Robert Gallager, A Generalized Processor Sharing Approach to Flow Control
in Integrated Services Networks - The Single-Node Case, IEEE/ACM Transactions on Networking,
vol 1 number 3, June 1993.
[PG94] Abhay Parekh and Robert Gallager, A Generalized Processor Sharing Approach to Flow Control
in Integrated Services Networks - The Multiple Node Case, IEEE/ACM Transactions on Networking,
vol 2 number 2, April 1994.
[VP97] Vern Paxson, End-to-End Internet Packet Dynamics, ACM SIGCOMM conference on Applications, technologies, architectures, and protocols for computer communication, 1997.
[CP09] Colin Percival, Stronger Key Derivation Via Sequential Memory-Hard Functions, BSDCan - The
Technical BSD Conference, May 2009.
[PB94] Charles Perkins and Pravin Bhagwat, Highly Dynamic Destination-Sequenced Distance-Vector
Routing (DSDV) for Mobile Computers, ACM SIGCOMM Computer Communications Review, vol
24 number 4, October 1994.
[PR99] Charles Perkins and Elizabeth Royer, Ad-hoc On-Demand Distance Vector Routing, Proceedings
of the Second IEEE Workshop on Mobile Computing Systems and Applications, February 1999.
[RP85] Radia Perlman, An Algorithm for Distributed Computation of a Spanning Tree in an Extended
LAN, ACM SIGCOMM Computer Communication Review 15(4), 1985.
[PN98] Thomas Ptacek and Timothy Newsham, Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection, Technical report, Secure Networks Inc, January 1998.
[RJ90] Kadangode Ramakrishnan and Raj Jain, A Binary Feedback Scheme for Congestion Avoidance in
Computer Networks, ACM Transactions on Computer Systems, vol 8 number 2, May 1990.
[RX05] Injong Rhee and Lisong Xu, Cubic: A new TCP-friendly high-speed TCP variant, 3rd International Workshop on Protocols for Fast Long-Distance Networks, February 2005.
[RR91] Ronald Rivest, The MD4 message digest algorithm, Advances in Cryptology - CRYPTO 90
Proceedings, Springer-Verlag, 1991.
[RSA78] Ronald Rivest, Adi Shamir and Leonard Adelman, A Method for Obtaining Digital Signatures
and Public-Key Cryptosystems, Communications of the ACM, volume 21, February 1978.
Bibliography
599
[SRC84] Jerome Saltzer, David Reed and David Clark, End-to-End Arguments in System Design, ACM
Transactions on Computer Systems, vol 2 number 4, November 1984.
[BS93] Bruce Schneier, Description of a New Variable-Length Key, 64-Bit Block Cipher (Blowfish), Fast
Software Encryption, Cambridge Security Workshop Proceedings (December 1993), Springer-Verlag,
1994.
[SM90] Nachum Shacham and Paul McKenney, Packet recovery in high-speed networks using coding and
buffer management, IEEE INFOCOM 90 Proceedings, June 1990.
[SP03] Umesh Shankar and Vern Paxson, Active Mapping: Resisting NIDS Evasion without Altering
Traffic, Proceedings of the 2003 IEEE Symposium on Security and Privacy, 2003.
[SV96] M Shreedhar and George Varghese, Efficient Fair Queuing Using Deficit Round Robin,
IEEE/ACM Transactions on Networking, vol 4 number 3, June 1996.
[TWHL05] Ao Tang, Jintao Wang, Sanjay Hegde and Steven Low, Equilibrium and Fairness of Networks
Shared by TCP Reno and Vegas/FAST, Telecommunications Systems special issue on High Speed
Transport Protocols, 2005.
[TWP07] Erik Tews, Ralf-Philipp Weinmann and Andrei Pyshkin, Breaking 104-bit WEP in less than 60
seconds, WISA07 Proceedings of the 8th International Conference on Information Security Applications, Springer-Verlag, 2007.
[VGE00] Kannan Varadhan, Ramesh Govindan and Deborah Estrin, Persistent Route Oscillations in Interdomain Routing, Computer Networks, vol 32, January, 2000.
[SV02] Serge Vaudenay, Security Flaws Induced by CBC Padding Applications to SSL, IPSEC,
WTLS..., EUROCRYPT 02 Proceedings, 2002.
[WJLH06] David Wei, Cheng Jin, Steven Low and Sanjay Hegde, FAST TCP: Motivation, Architecture,
Algorithms, Performance, ACM Transactions on Networking, December 2006.
[LZ89] Lixia Zhang, A New Architecture for Packet Switching Network Protocols, PhD Thesis, Massachusetts Institute of Technology, 1989.
[ZSC91] Lixia Zhang, Scott Shenker and David Clark, Observations on the Dynamics of a Congestion
Control Algorithm: The Effects of Two-Way Traffic, ACM SIGCOMM Symposium on Communications Architectures and Protocols, 1991.
600
Bibliography
INDEX
Symbols
2-D parity, 104
2.4 GHz, 57
3DES, 567
4B/5B, 87
4G, 68
5 GHz, 57
802.11, 57
802.16, 68
802.1Q, 51
802.1X, IEEE, 64
A
accelerated open, TCP, 256
access point, Wi-Fi, 62
accurate costs, 176
ACD, IPv4, 141
ACK compression, 331
ACK, TCP, 239
acknowledgment, 22
acknowledgment number, TCP, 238
ACKs of unsent data, TCP, 251
ACK[N], 109
active close, TCP, 250
active queue management, 311
active subqueue, 419
ad hoc configuration, Wi-Fi, 62
ad hoc wireless network, 66
additive increase, multiplicative decrease, 266
address, 8
address randomization, 554
Address Resolution Protocol, 140
Administratively Prohibited, 146
admission control, RSVP, 467
advertised window size, 257
AES, 567
AF drop precedence, 471
AfriNIC, 21
agent configuration, SNMP, 509
agent, SNMP, 485
AIMD, 266, 309
algorithm, distance-vector, 169
algorithm, DSDV, 178
algorithm, EIGRP, 179
algorithm, exponential backoff, 38
algorithm, fair queuing bit-by-bit round-robin, 421
algorithm, fair queuing GPS, 423
algorithm, fair queuing, quantum, 429
algorithm, hierarchical weighted fair queuing, 436
algorithm, Karn/Partridge, 259
algorithm, link-state, 180
algorithm, loop-free distance vector, 177
algorithm, Nagle, 258
algorithm, Shortest-Path First, 181
algorithm, spanning-tree, 47
Alice, 565
all-nodes multicast address, 156
all-routers multicast address, 156
ALOHA, 41
AMI, 89
anycast address, 155
Aodh, 565
AODV, 178, 413
APNIC, 21
ARC4, 569
ARCFOUR, 569
architecture, network, 484
ARIN, 21
arithmetic, fast, 574
ARP, 140
ARP cache, 141
ARP failover, 143
ARP spoofing, 142
ARPANET, 27
601
B
B8ZS, 89
backbone, 20
backoff, Ethernet, 38
backup link, BGP, 205
backwards compatibility, TCP, 324
bad news, distance-vector, 170
band width, radio, 57
bandwidth, 7
bandwidth delay, 98, 114
bandwidth delay, 95
bandwidth guarantees, 445
base station, WiMAX, 68
basic encoding rules, 511
BBRR, 421
beacon, Wi-Fi, 63
BER, 511
Berkeley Unix, 28
best-effort, 17, 22
best-path selection, BGP, 200
BGP, 197
BGP relationships, 207
BGP speaker, 197
big-endian, 9
binary data, 222
bind(), 218
bit stuffing, 88
bit-by-bit round robin, 421
BLAM, 40
602
Blowfish, 567
Bob, 565
border gateway protocol, 197
border routers, 177
bottleneck link, 116, 117, 290
BPDU message, spanning tree, 47
bps, 7, 14
broadcast IP address, 131
broadcast, Ethernet, 15
BSD, 28
buffer overflow, 24
buffer overflow, heap, 556
byte stuffing, 88
C
CA, 581
Canopy, 72
capture effect, Ethernet, 39
care-of address, 149
carrier Ethernet, 56
CBC mode, 568
cell-loss priority bit, ATM, 78
certificate authorities, 577
certificate authority, 581
certificate revocation, 583
certificate revocation list, 583
CFB mode, 570
channel, Wi-Fi, 57
Christmas day attack, 255
CIDR, 189
cipher feedback mode, 570
cipher modes, 567
Cisco, 51, 179, 204, 281
class A/B/C addressing, 16
Class Selector PHB, 468
class, queuing discipline, 417
classful queuing discipline, 417
Classless Internet Domain Routing, 189
clear-to-send, Wi-Fi, 60
client, 23
cliff, 12, 265
CLNP, 27
clock recovery, 85
clock synchronization, 85
CMIP, 484
CMNS, 27
CMOT, 484
collision, 14, 33
Index
D
DAD, IPv6, 161
DALLY, TFTP, 227
data rate, 7
data types, SNMPv1, 489
datagram forwarding, 8
Data[N], 109
Index
E
EAP, 64
603
F
fair queuing, 418
fair queuing and AF, 471
fair queuing and EF, 470
fairness, 24
fairness, TCP, 290, 297, 304, 324
fast arithmetic, 574
Fast Ethernet, 43
604
G
generalized processor sharing, 423
generic hierarchical queuing, 432
geographical routing, 196
getAllByName(), java, 219
GetBulk, SNMPv2, 514
getByName(), java, 221
Index
gigabit Ethernet, 44
glibc-2.2.4, 556
global scope, IPv6 addresses, 164
goodput, 7, 349
GPS, fair queuing, 423
granularity, loss counting, 376
gratuitous ARP, 141
greediness in TCP, 301
H
H-TCP, 335
half-closed, TCP, 250
Hamilton TCP, 335
Hamming code, 104
Hash Message Authentication Code, 563
hash, password, 564
HDLC, 88
head-of-line blocking, 24, 216, 460
header, 8
header, Ethernet, 34
header, IPv4, 127
header, IPv6, 153
header, TCP, 238
header, UDP, 215
heap buffer overflow, 555
heap vulnerability, 556
Hello, spanning-tree, 47
henhouse, 263
hidden-node collisions, 60
hidden-node problem, 60
hierarchical routing, 135, 189, 191
hierarchical token bucket, 448
hierarchical token bucket, linux, 450
high-bandwidth TCP problem, 313
Highspeed TCP, 325
HMAC, 536, 563
hold down, 175
home address, 149
home agent, 149
host identifier, ipv6, 155
host key, ssh, 579
Host Top N, RMON, 529
Host Unreachable, 146
host-specific forwarding, 56
hot-potato routing, 195
htb, linux, 450
htonl, 224
htons, 224
Index
https, 578
hub, Ethernet, 34
hulu, 459
Hybla, TCP, 333
I
IAB, 26
IANA, 17, 21, 506
ICMP, 145
ICMPv6, 164
IDEA, 567
idempotent, 232
IDENT field, 132
IEEE, 33, 36
IEEE 802.11, 57
IEEE 802.1Q, 51
IEEE 802.1X, 64
IETF, 26
ifDescr, 505
ifIndex, 504
IFS, Wi-Fi, 59
ifType, SNMP, 505
ifXTable, SNMP, 519
Illinois, TCP, 333
implementations, at least two, 28
import filtering, BGP, 200
incarnation, connection, 224
inflation, cwnd, 276
infrastructure configuration, Wi-Fi, 62
initial sequence number, TCP, 239, 242, 253
initialization vector, 568
instability, BGP, 208
integrated services, 461
integrated services, generic, 457
interface, 129
interface table, extended, SNMP, 519
interior routing, 169
Internet Architecture Board, 26
Internet checksum, 101
Internet Engineering Task Force, 26
Internet exchange point, 194
Internet Society, 26
intrusion detection, 560
IntServ, 461
IP, 7, 16
IP forwarding, 18
IP fragmentation, 131
IP multicast, 462
605
J
jail, staying out of, 194
java getAllByName(), 219
java getByName(), 221
javascript, 559
jitter, 22, 459, 477
join, multicast, 464
JPEG heap vulnerability, 558
JSON, 224
jumbogram, IPv6, 157
K
Karn/Partridge algorithm, 259
KeepAlive, TCP, 259
key-scheduling algorithm, RC4, 569
key-signing parties, 577
keystream, 568
kings, 27
knee, 12, 265, 326
known_hosts, ssh, 579
L
LACNIC, 21
ladder diagram, 97
LAN, 7, 14
LAN layer, 36
LARTC, 183
layers, 7
606
M
MAC address, 15
MAC layer, 36
MAC-then-encrypt, 570
MAE, 194
MAE-East, 194
man-in-the-middle attack, 574, 577
managed device, SNMP, 485
manager, SNMP, 485
Manchester encoding, 86
MANET, 66
Index
N
Nagle algorithm, 258
NAT, 25
NAT, IPv6-to-IPv4, 167
Neighbor Advertisement, IPv6, 160
Neighbor Discovery, IPv6, 159
Neighbor Solicitation, IPv6, 160
net neutrality, 458
Net-SNMP, 485, 489, 500, 502, 510
Net-SNMP and SNMPv3, 539
Index
network address, 17
network address translation, 25
network architecture, 484
network entry, WiMAX, 69
Network File Sharing, 232
network interface, Ethernet, 15
network management, 416
Network Management System, 484
network model
five layer, 7
four layer, 7
seven layer, 27
network number, 17
network prefix, 17
network prefix, IPv6, 158
Network Unreachable, 146
NewReno, TCP, 276
next_hop, 9
NEXT_HOP attribute, BGP, 202
NFS, 232
NMS, 484
no-transit, BGP, 205
no-valley theorem, BGP, 208
non-compliant, token-bucket, 440
non-congestive loss, 314
non-executable stack, 555
nonpersistence, 40
NOPslide, 553
noSuchObject, SNMP, 514
NRZ, 85
NRZI, 86
ns-2 trace file, 352
ns-2 tracefiles, reading with python, 355
NSFNet, 27
NSIS, 472
ntohl, 224
ntohs, 224
NX page bit, 555
O
Object ID, SNMP, 486
OBJECT-IDENTITY, SNMP, 514
OBJECT-TYPE, SNMP, 490
OC-3, 92
OCSP, 583
OID, SNMP, 486
old duplicate packets, 224
old duplicates, TCP, 251, 252
607
OLSR, 413
one-time pad, 568
ones-complement, 101
OpenBSD, 555
optimistic DAD, 161
opus, 459
OSI, 26
OSPF, 180
overhead, ns-2, 369
P
packet loss rate, 305
packet pairs, 296
packet size, 99
Parekh-Gallager claim, 428
Parekh-Gallager theorem, 451
parking-lot topology, 304
partial ACKs, 276
passive close, TCP, 250
password hash, 564
password sniffing, 15, 143
path attributes, BGP, 202
path bandwidth, 117
Path MTU Discovery, 133, 146
path MTU discovery, TCP, 256
PATH packet, RSVP, 466
PAWS, TCP, 255
PCF Wi-Fi, 65
peer, BGP, 207
per-hop behaviors, DiffServ, 468
perfect forward secrecy, 576
persist timer, TCP, 258
persistence, 40
phase effects, TCP, 366
PHBs, DiffServ, 468
physical address, Ethernet, 15
physical layer, 7
PIFS, Wi-Fi, 66
PIM-SM, 463
ping, 146
ping6, 165
pipe drain, 116
pipelining, SMTP, 105
playback buffer, 459
point-to-point protocol, 88
poison reverse, 175
policing, 440
policy routing, 197, 200
608
Q
QoS, 457
QoS and routing, 169
quality of service, 10, 18, 169, 457
quantum algorithm, fair queuing, 429
queue capacity, typical, 281
queue overflow, 18
queue utilization, token bucket, 445
queue-competition rule, 291
queuing delay, 95
queuing discipline, 417
Index
R
RADIUS, 64
radvd, 159
random drop, 289
Random Early Detection, 312
ranging intervals, WiMAX, 69
ranging, WiMAX, 69
rate control, Wi-Fi, 61
rate scaling, Wi-Fi, 61
rate-adaptive traffic, 459
RC4, 569
Real-Time Protocol, 308
real-time traffic, 457, 459
Real-time Transport Protocol, 24
reassembly, 17
reboots, 225
Record Route, 129
RED, 312
regional registry, 21
registry, regional, 21
reliable, 237
reliable flooding, 180
Remote Procedure Call, 230
rendezvous point, multicast, 464
Reno, 28, 265
repeater, Ethernet, 34
request for comment, 26
request-to-send, Wi-Fi, 60
request/reply, 230, 237
reservations, 18
reset, TCP, 239
RESV packet, RSVP, 466
retransmit-on-duplicate, 111
retransmit-on-timeout, 109
return-to-libc attack, 548
RFC, 26
RFC 1065, 486
RFC 1066, 501
RFC 1122, 10, 128, 142, 215, 253, 257, 260
RFC 1123, 227
RFC 1155, 486489, 491
RFC 1213, 486, 488, 501503, 505, 507, 508,
522524
RFC 1271, 524, 527
Index
RPC, 230
RSA, 575
RST, 239
RSVP, 461, 466
RTCP, 477
RTCP measurement of, 477
RTO, TCP, 259
RTP, 24, 308
RTP and VoIP, 476
RTP mixer, 475
RTS, Wi-Fi, 60
RTT, 97
RTT bias in TCP, 301
RTT inflation, 119
RTT-noLoad, 98
S
SACK TCP, 279
SACK TCP in ns-2, 381
Salsa20, 566
satellite Internet, 73
satellite-link TCP problem, 315
sawtooth, TCP, 266, 270, 314, 351, 377, 388
scalable routing, 127
scheduled transmission, WiMAX, 69
search warrant, 578
secure hash functions, 562
secure shell, 579
security, 543
segment, 8
segmentation, 17
segments, TCP, 239
Selective ACKs, TCP, 279
self-ARP, 141
self-clocking, 114
sequence number, TCP, 238
serialization, RPC, 233
server, 23
session key, 565
session layer, 27
SHA-1, 562
SHA-2, 562
shaping, 440
shared-key ciphers, 564
shellcode, 550
Shortest-Path First algorithm, 181
SHOULD, 27
sibling, BGP, 207
Index
SIFS, Wi-Fi, 59
signaling losses, 312
signatures, 561
Simple Network Management Protocol, 484
simplex talk, UDP, 216
simplex-talk, TCP, 244
simultaneous open, TCP, 250
single link-state, 20
singlebell network topology, 292
site-local IPv6 address, 158
size, packet, 99
SLAAC, 160, 161
SLAAC privacy extensions, 162
sliding windows, 23, 113
sliding windows, TCP, 257
slot time, Wi-Fi, 59
slow convergence, 174
small-packet priority, WFQ, 425
SMI, 491
SMTP, 105, 584
SNMP, 484
SNMP agent configuration, 509
SNMP agents and managers, 485
SNMP enumerated type, 505
SNMP versions, 486
SNMPv1 data types, 489
SNMPv3 engines, 534
socket, 23
socket address, 23
soft state, 461
SONET, 91
Sorcerers Apprentice bug, 112
Source Quench, 146, 312
source-specific multicast tree, 465
spanning-tree algorithm, 47
sparse-mode multicast, 463
spatial streams, Wi-Fi, 58
speex, 459
split horizon, 174
SQL injection, 560
SSH, 143
ssh, 578, 579
ssh host key, 579
SSID, Wi-Fi, 63
ssl, 578
SSRC, 475
stack canary, 554
STARTTLS, 584
611
T
T/TCP, 256
T1 line, 90
T3 line, 90
tables, SNMP, 491
Tahoe, 28, 265
tail drop, 289
TCP, 7, 235
TCP accelerated open, 256
TCP checksum offloading, 244
TCP Cubic, 335
TCP fairness, 297, 304
TCP Fast Open, 256
TCP Friendliness, 307
TCP Hamilton, 335
TCP header, 238
TCP Hybla, 333
612
W^X, 555
weighted fair queuing, 418
WEP encryption failure, 546
WEP, Wi-Fi, 64
Westwood, TCP, 330
Wi-Fi, 57
Wi-Fi fragmentation, 61
Wi-Fi polling mode, 65
WiMAX, 68
window, 114
window scale option, TCP, 257
window size, 23, 113
Windows, 129, 251
Windows XP SP1 vulnerability, 558
winsize, 113
wireless, 57
wireless, fixed, 70
wireless, satellite, 73
wireless, terrestrial, 70
WireShark, 405
WireShark, TCP example, 243
work-conserving queuing, 418
WPA authenticator, 64
WPA supplicant, 64
WPA, Wi-Fi, 64
WPA-Enterprise, 64
WPA-Personal, 64
write-or-execute, 555
VACM, 510
VarBind list, 495
VCI, 75
video, streaming, 460
virtual circuit, 8, 17, 74
Virtual LANs, 51
virtual link, 55
virtual private network, 55
virtual tributary, 92
VLANs, 51
voice over IP, 18
VoIP, 18
Index
613