0% found this document useful (0 votes)
16 views378 pages

Networks

The document is an introduction to computer networks authored by Peter Lars Dordal, available through the LibreTexts Project as an Open Educational Resource. It covers various topics including LANs, IP, TCP, and the structure of networks, organized into chapters that detail different aspects of networking technology. The text aims to provide accessible educational material to reduce textbook costs and enhance learning in higher education.

Uploaded by

Ayeah Godlove
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views378 pages

Networks

The document is an introduction to computer networks authored by Peter Lars Dordal, available through the LibreTexts Project as an Open Educational Resource. It covers various topics including LANs, IP, TCP, and the structure of networks, organized into chapters that detail different aspects of networking technology. The text aims to provide accessible educational material to reduce textbook costs and enhance learning in higher education.

Uploaded by

Ayeah Godlove
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 378

AN INTRODUCTION TO

COMPUTER
NETWORKS

Peter Lars Dordal


Loyola University of Chicago
Preface
Page created for new attachment

1 https://eng.libretexts.org/@go/page/11004
Loyola University of Chicago
Book: An Introduction to Computer Networks
(Dordal)

Peter Lars Dordal


This text is disseminated via the Open Education Resource (OER) LibreTexts Project (https://LibreTexts.org) and like the hundreds
of other texts available within this powerful platform, it is freely available for reading, printing and "consuming." Most, but not all,
pages in the library have licenses that may allow individuals to make changes, save, and print this book. Carefully
consult the applicable license(s) before pursuing such effects.
Instructors can adopt existing LibreTexts texts or Remix them to quickly build course-specific resources to meet the needs of their
students. Unlike traditional textbooks, LibreTexts’ web based origins allow powerful integration of advanced features and new
technologies to support learning.

The LibreTexts mission is to unite students, faculty and scholars in a cooperative effort to develop an easy-to-use online platform
for the construction, customization, and dissemination of OER content to reduce the burdens of unreasonable textbook costs to our
students and society. The LibreTexts project is a multi-institutional collaborative venture to develop the next generation of open-
access texts to improve postsecondary education at all levels of higher learning by developing an Open Access Resource
environment. The project currently consists of 14 independently operating and interconnected libraries that are constantly being
optimized by students, faculty, and outside experts to supplant conventional paper-based books. These free textbook alternatives are
organized within a central environment that is both vertically (from advance to basic level) and horizontally (across different fields)
integrated.
The LibreTexts libraries are Powered by NICE CXOne and are supported by the Department of Education Open Textbook Pilot
Project, the UC Davis Office of the Provost, the UC Davis Library, the California State University Affordable Learning Solutions
Program, and Merlot. This material is based upon work supported by the National Science Foundation under Grant No. 1246120,
1525057, and 1413739.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not
necessarily reflect the views of the National Science Foundation nor the US Department of Education.
Have questions or comments? For information about adoptions or adaptions contact [email protected]. More information on our
activities can be found via Facebook (https://facebook.com/Libretexts), Twitter (https://twitter.com/libretexts), or our blog
(http://Blog.Libretexts.org).
This text was compiled on 11/27/2024
TABLE OF CONTENTS
Preface
Licensing

1: An Overview of Networks
1.1: Layers
1.2: Data Rate, Throughput and Bandwidth
1.3: Packets
1.4: Datagram Forwarding
1.5: Topology
1.6: Routing Loops
1.7: Congestion
1.8: Packets Again
1.9: LANs and Ethernet
1.10: IP - Internet Protocol
1.11: DNS
1.12: Transport
1.13: Firewalls
1.14: Some Useful Utilities
1.15: IETF and OSI
1.16: Berkeley Unix
1.E: An Overview of Networks (Exercises)
Index

2: Ethernet
2.1: Prelude to Ethernet
2.2: 10-Mbps Classic Ethernet
2.3: 100 Mbps (Fast) Ethernet
2.4: Gigabit Ethernet
2.5: Ethernet Switches
2.6: Spanning Tree Algorithm and Redundancy
2.7: Virtual LAN (VLAN)
2.8: TRILL and SPB
2.9: Software-Defined Networking
2.E: Ethernet (Exercises)
Index

3: Other LANs
3.1: Virtual Private Networks
3.2: Carrier Ethernet
3.3: Token Ring
3.4: Virtual Circuits
3.5: Asynchronous Transfer Mode - ATM
3.6: Adventures in Radioland
3.7: Wi-Fi
3.8: WiMAX and LTE

1 https://eng.libretexts.org/@go/page/40616
3.9: Fixed Wireless
3.10: Epilog and Exercises
Index

4: Links
4.1: Encoding and Framing
4.2: Time-Division Multiplexing
4.E: Links (Exercises)
Index

5: Packets
6: Abstract Sliding Windows
6.1: Building Reliable Transport - Stop-and-Wait
6.2: Sliding Windows
6.3: Linear Bottlenecks
6.4: Epilog and Exercises
Index

7: IP version 4
7.1: Prelude to IP version 4
7.2: The IPv4 Header
7.3: Interfaces
7.4: Special Addresses
7.5: Fragmentation
7.6: The Classless IP Delivery Algorithm
7.7: IPv4 Subnets
7.8: Network Address Translation
7.9: DNS
7.10: Address Resolution Protocol - ARP
7.11: Dynamic Host Configuration Protocol (DHCP)
7.12: Internet Control Message Protocol
7.13: Unnumbered Interfaces
7.14: Mobile IP
7.15: Epilog and Exercises
Index

8: IP version 6
8.1: Prelude to IP version 6
8.2: The IPv6 Header
8.3: IPv6 Addresses
8.4: Network Prefixes
8.5: IPv6 Multicast
8.6: IPv6 Extension Headers
8.7: Neighbor Discovery
8.8: IPv6 Host Address Assignment
8.9: Globally Exposed Addresses
8.10: ICMPv6
8.11: IPv6 Subnets

2 https://eng.libretexts.org/@go/page/40616
8.12: Using IPv6 and IPv4 Together
8.13: IPv6 Examples Without a Router
8.14: IPv6 Connectivity via Tunneling
8.15: IPv6-to-IPv4 Connectivity
8.16: Epilog and Exercises
Index

9: Routing-Update Algorithms
9.1: Prelude to Routing-Update Algorithms
9.2: Distance-Vector Routing-Update Algorithm
9.3: Distance-Vector Slow-Convergence Problem
9.4: Observations on Minimizing Route Cost
9.5: Loop-Free Distance Vector Algorithms
9.6: Link-State Routing-Update Algorithm
9.7: Routing on Other Attributes
9.8: ECMP
9.9: Epilog and Exercises
Index

10: Large-Scale IP Routing


10.1: Classless Internet Domain Routing - CIDR
10.2: Hierarchical Routing
10.3: Legacy Routing
10.4: Provider-Based Routing
10.5: Geographical Routing
10.6: Border Gateway Protocol, BGP
10.7: Epilog and Exercises
Index

11: UDP Transport

12: TCP Transport

13: TCP Reno and Congestion Management

14: Dynamics of TCP


15: Newer TCP Implementations

16: Network Simulations - ns-2

17: The ns-3 Network Simulator

18: Mininet

3 https://eng.libretexts.org/@go/page/40616
19: Queuing and Scheduling

20: Quality of Service

21: Network Management and SNMP

22: Security

23: Selected Solutions


Bibliography

Index
Index

Glossary
Detailed Licensing

4 https://eng.libretexts.org/@go/page/40616
Licensing
A detailed breakdown of this resource's licensing can be found in Back Matter/Detailed Licensing.

1 https://eng.libretexts.org/@go/page/115116
CHAPTER OVERVIEW

1: An Overview of Networks
Somewhere there might be a field of interest in which the order of presentation of topics is well agreed upon. Computer
networking is not it.
There are many interconnections in the field of networking, as in most technical fields, and it is difficult to find an order of
presentation that does not involve endless “forward references” to future chapters; this is true even if – as is done here – a largely
bottom-up ordering is followed. I have therefore taken here a different approach: this first chapter is a summary of the essentials –
LANs, IP and TCP – across the board, and later chapters expand on the material here.
Local Area Networks, or LANs, are the “physical” networks that provide the connection between machines within, say, a home,
school or corporation. LANs are, as the name says, “local”; it is the IP, or Internet Protocol, layer that provides an abstraction for
connecting multiple LANs into, well, the Internet. Finally, TCP deals with transport and connections and actually sending user
data.
This chapter also contains some important other material. The section on datagram forwarding, central to packet-based switching
and routing, is essential. This chapter also discusses packets generally, congestion, and sliding windows, but those topics are
revisited in later chapters. Firewalls and network address translation are also covered here and not elsewhere.
1.1: Layers
1.2: Data Rate, Throughput and Bandwidth
1.3: Packets
1.4: Datagram Forwarding
1.5: Topology
1.6: Routing Loops
1.7: Congestion
1.8: Packets Again
1.9: LANs and Ethernet
1.10: IP - Internet Protocol
1.11: DNS
1.12: Transport
1.13: Firewalls
1.14: Some Useful Utilities
1.15: IETF and OSI
1.16: Berkeley Unix
1.E: An Overview of Networks (Exercises)
Index

This page titled 1: An Overview of Networks is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1
1.1: Layers

This page is a draft and is under active development.

These three topics – LANs, IP and TCP – are often called layers; they constitute the Link layer, the Internetwork layer, and the
Transport layer respectively. Together with the Application layer (the software you use), these form the “four-layer model” for
networks. A layer, in this context, corresponds strongly to the idea of a programming interface or library, with the understanding
that a given layer communicates directly only with the two layers immediately above and below it. An application hands off a
chunk of data to the TCP library, which in turn makes calls to the IP library, which in turn calls the LAN layer for actual delivery.
An application does not interact directly with the IP and LAN layers at all.
The LAN layer is in charge of actual delivery of packets, using LAN-layer-supplied addresses. It is often conceptually subdivided
into the “physical layer” dealing with, eg, the analog electrical, optical or radio signaling mechanisms involved, and above that an
abstracted “logical” LAN layer that describes all the digital – that is, non-analog – operations on packets (Section 2.1.4). The
physical layer is generally of direct concern only to those designing LAN hardware; the kernel software interface to the LAN
corresponds to the logical LAN layer.

Application
Transport
IP
Logical LAN
Physical LAN

This LAN physical/logical division gives us the Internet five-layer model. This is less a formal hierarchy as an ad hoc
classification method. We will return to this below in 1.15 IETF and OSI, where we will also introduce two more rather obscure
layers that complete the seven-layer model.

This page titled 1.1: Layers is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.1.1 https://eng.libretexts.org/@go/page/11062
1.2: Data Rate, Throughput and Bandwidth

This page is a draft and is under active development.

Any one network connection – eg at the LAN layer – has a data rate: the rate at which bits are transmitted. In some LANs (eg Wi-
Fi) the data rate can vary with time. Throughput refers to the overall effective transmission rate, taking into account things like
transmission overhead, protocol inefficiencies and perhaps even competing traffic. It is generally measured at a higher network
layer than the data rate.
The term bandwidth can be used to refer to either of these, though we here use it mostly as a synonym for data rate. The term
comes from radio transmission, where the width of the frequency band available is proportional, all else being equal, to the data
rate that can be achieved.
In discussions about TCP, the term goodput is sometimes used to refer to what might also be called “application-layer throughput”:
the amount of usable data delivered to the receiving application. Specifically, retransmitted data is counted only once when
calculating goodput but might be counted twice under some interpretations of “throughput”.
Data rates are generally measured in kilobits per second (kbps) or megabits per second (Mbps); the use of the lower-case “b” here
denotes bits. In the context of data rates, a kilobit is 103 bits (not 210) and a megabit is 106 bits. Somewhat inconsistently, we follow
the tradition of using kB and MB to denote data volumes of 210 and 220 bytes respectively, with the upper-case B denoting bytes.
The newer abbreviations KiB [en.Wikipedia.org/wiki/Kibibyte] and MiB [en.Wikipedia.org/wiki/Mebibyte] would be more precise,
but the consequences of confusion are modest.

This page titled 1.2: Data Rate, Throughput and Bandwidth is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated
by Peter Lars Dordal.

1.2.1 https://eng.libretexts.org/@go/page/11063
1.3: Packets

This page is a draft and is under active development.

Packets are modest-sized buffers of data, transmitted as a unit through some shared set of links. Of necessity, packets need to be
prefixed with a header containing delivery information. In the common case known as datagram forwarding, the header contains
a destination address; headers in networks using so-called virtual-circuit forwarding contain instead an identifier for the
connection. Almost all networking today (and for the past 50 years) is packet-based, although we will later look briefly at some
“circuit-switched” options for voice telephony.

header data

header1 header2 data

Single and multiple headers

At the LAN layer, packets can be viewed as the imposition of a buffer (and addressing) structure on top of low-level serial lines;
additional layers then impose additional structure. Informally, packets are often referred to as frames at the LAN layer, and as
segments at the Transport layer.
The maximum packet size supported by a given LAN (eg Ethernet, Token Ring or ATM) is an intrinsic attribute of that LAN.
Ethernet allows a maximum of 1500 bytes of data. By comparison, TCP/IP packets originally often held only 512 bytes of data,
while early Token Ring packets could contain up to 4 kB of data. While there are proponents of very large packet sizes, larger even
than 64 kB, at the other extreme the ATM (Asynchronous Transfer Mode) protocol uses 48 bytes of data per packet, and there are
good reasons for believing in modest packet sizes.
One potential issue is how to forward packets from a large-packet LAN to (or through) a small-packet LAN; in later chapters we
will look at how the IP (or Internet Protocol) layer addresses this.
Generally each layer adds its own header. Ethernet headers are typically 14 bytes, IP headers 20 bytes, and TCP headers 20 bytes.
If a TCP connection sends 512 bytes of data per packet, then the headers amount to 10% of the total, a not-unreasonable overhead.
For one common Voice-over-IP option, packets contain 160 bytes of data and 54 bytes of headers, making the header about 25% of
the total. Compressing the 160 bytes of audio, however, may bring the data portion down to 20 bytes, meaning that the headers are
now 73% of the total; see 20.11.4 RTP and VoIP.
In datagram-forwarding networks the appropriate header will contain the address of the destination and perhaps other delivery
information. Internal nodes of the network called routers or switches will then try to ensure that the packet is delivered to the
requested destination.
The concept of packets and packet switching was first introduced by Paul Baran in 1962 ([PB62]). Baran’s primary concern was
with network survivability in the event of node failure; existing centrally switched protocols were vulnerable to central failure. In
1964, Donald Davies independently developed many of the same concepts; it was Davies who coined the term “packet”.
It is perhaps worth noting that packets are buffers built of 8-bit bytes, and all hardware today agrees what a byte is (hardware agrees
by convention on the order in which the bits of a byte are to be transmitted). 8-bit bytes are universal now, but it was not always so.
Perhaps the last great non-byte-oriented hardware platform, which did indeed overlap with the Internet era broadly construed, was
the DEC-10, which had a 36-bit word size; a word could hold five 7-bit ASCII characters. The early Internet specifications
introduced the term octet (an 8-bit byte) and required that packets be sequences of octets; non-octet-oriented hosts had to be able to
convert. Thus was chaos averted. Note that there are still byte-oriented data issues; as one example, binary integers can be
represented as a sequence of bytes in either big-endian or little-endian byte order (11.1.5 Binary Data). RFC 1700
[https://tools.ietf.org/html/rfc1700.html] specifies that Internet protocols use big-endian byte order, therefore sometimes called
network byte order.

This page titled 1.3: Packets is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.3.1 https://eng.libretexts.org/@go/page/11064
1.4: Datagram Forwarding

This page is a draft and is under active development.

In the datagram-forwarding model of packet delivery, packet headers contain a destination address. It is up to the intervening
switches or routers to look at this address and get the packet to the correct destination.
In datagram forwarding this is achieved by providing each switch with a forwarding table of ⟨destination,next_hop⟩ pairs. When a
packet arrives, the switch looks up the destination address (presumed globally unique) in its forwarding table and finds the
next_hop information: the immediate-neighbor address to which – or interface by which – the packet should be forwarded in order
to bring it one step closer to its final destination. The next_hop value in a forwarding table is a single entry; each switch is
responsible for only one step in the packet’s path. However, if all is well, the network of switches will be able to deliver the packet,
one hop at a time, to its ultimate destination.
The “destination” entries in the forwarding table do not have to correspond exactly with the packet destination addresses, though in
the examples here they do, and they do for Ethernet datagram forwarding. However, for IP routing, the table “destination” entries
will correspond to prefixes of IP addresses; this leads to a huge savings in space. The fundamental requirement is that the switch
can perform a lookup operation, using its forwarding table and the destination address in the arriving packet, to determine the next
hop.
Just how the forwarding table is built is a question for later; we will return to this for Ethernet switches in 2.4.1 Ethernet Learning
Algorithm and for IP routers in 9 Routing-Update Algorithms. For now, the forwarding tables may be thought of as created through
initial configuration.
In the diagram below, switch S1 has interfaces 0, 1 and 2, and S2 has interfaces 0,1,2,3. If A is to send a packet to B, S1 must have
a forwarding-table entry indicating that destination B is reached via its interface 2, and S2 must have an entry forwarding the
packet out on interface 3.

C D

1 1

A 0 S1 2 0 S2 3 B
2

Two switches S1 and S2,


with interface numbers shown E

A complete forwarding table for S1, using interface numbers in the next_hop column, would be:
S1
destination next_hop

A 0

C 1

B 2

D 2

E 2

The table for S2 might be as follows, where we have consolidated destinations A and C for visual simplicity.
S2
destination next_hop

A,C 0

D 1

1.4.1 https://eng.libretexts.org/@go/page/11065
destination next_hop

E 2

B 3

In the network diagrammed above, all links are point-to-point, and so each interface corresponds to the unique immediate neighbor
reached by that interface. We can thus replace the interface entries in the next_hop column with the name of the corresponding
neighbor. For human readers, using neighbors in the next_hop column is usually much more readable. S1’s table can now be
written as follows (with consolidation of the entries for B, D and E):
S1
destination next_hop

A A

C C

B,D,E S2

A central feature of datagram forwarding is that each packet is forwarded “in isolation”; the switches involved do not have any
awareness of any higher-layer logical connections established between endpoints. This is also called stateless forwarding, in that
the forwarding tables have no per-connection state. RFC 1122 [https://tools.ietf.org/html/rfc1122.html] put it this way (in the
context of IP-layer datagram forwarding):

To improve robustness of the communication system, gateways are designed to be


stateless, forwarding each IP datagram independently of other datagrams. As a result,
redundant paths can be exploited to provide robust service in spite of failures of
intervening gateways and networks.
The fundamental alternative to datagram forwarding is virtual circuits, 3.4 Virtual Circuits. In virtual-circuit networks, each router
maintains state about each connection passing through it; different connections can be routed differently. If packet forwarding
depends, for example, on per-connection information – eg both TCP port numbers – it is not datagram forwarding. (That said, it
arguably still is datagram forwarding if web traffic – to TCP port 80 – is forwarded differently than all other traffic, because that
rule does not depend on the specific connection.)
Datagram forwarding is sometimes allowed to use other information beyond the destination address. In theory, IP routing can be
done based on the destination address and some quality-of-service information, allowing, for example, different routing to the
same destination for high-bandwidth bulk traffic and for low-latency real-time traffic. In practice, most Internet Service Providers
(ISPs) ignore user-provided quality-of-service information in the IP header, except by prearranged agreement, and route only based
on the destination.
By convention, switching devices acting at the LAN layer and forwarding packets based on the LAN address are called switches
(or, originally, bridges; some still prefer that term), while such devices acting at the IP layer and forwarding on the IP address are
called routers. Datagram forwarding is used both by Ethernet switches and by IP routers, though the destinations in Ethernet
forwarding tables are individual nodes while the destinations in IP routers are entire networks (that is, sets of nodes).
In IP routers within end-user sites it is common for a forwarding table to include a catchall default entry, matching any IP address
that is nonlocal and so needs to be routed out into the Internet at large. Unlike the consolidated entries for B, D and E in the table
above for S1, which likely would have to be implemented as actual separate entries, a default entry is a single record representing
where to forward the packet if no other destination match is found. Here is a forwarding table for S1, above, with a default entry
replacing the last three entries:
S1
destination next_hop

A 0

C 1

1.4.2 https://eng.libretexts.org/@go/page/11065
destination next_hop

default 2

Default entries make sense only when we can tell by looking at an address that it does not represent a nearby node. This is common
in IP networks because an IP address encodes the destination network, and routers generally know all the local networks. It is
however rare in Ethernets, because there is generally no correlation between Ethernet addresses and locality. If S1 above were an
Ethernet switch, and it had some means of knowing that interfaces 0 and 1 connected directly to individual hosts, not switches –
and S1 knew the addresses of these hosts – then making interface 2 a default route would make sense. In practice, however,
Ethernet switches do not know what kind of device connects to a given interface.

This page titled 1.4: Datagram Forwarding is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1.4.3 https://eng.libretexts.org/@go/page/11065
1.5: Topology

This page is a draft and is under active development.

In the network diagrammed in the previous section, there are no loops; graph theorists might describe this by saying the network
graph is acyclic, or is a tree. In a loop-free network there is a unique path between any pair of nodes. The forwarding-table
algorithm has only to make sure that every destination appears in the forwarding tables; the issue of choosing between alternative
paths does not arise.
However, if there are no loops then there is no redundancy: any broken link will result in partitioning the network into two pieces
that cannot communicate. All else being equal (which it is not, but never mind for now), redundancy is a good thing. However,
once we start including redundancy, we have to make decisions among the multiple paths to a destination. Consider, for a moment,
the following network:

C D

1 1

A 0 S1 2 0 S2 3 B
2

Two switches S1 and S2,


with interface numbers shown E

Should S1 list S2 or S3 as the next_hop to B? Both paths A─S1─S2─S4─B and A─S1─S3─S4─B get there. There is no right
answer. Even if one path is “faster” than the other, taking the slower path is not exactly wrong (especially if the slower path is, say,
less expensive). Some sort of protocol must exist to provide a mechanism by which S1 can make the choice (though this
mechanism might be as simple as choosing to route via the first path discovered to the given destination). We also want protocols
to make sure that, if S1 reaches B via S2 and the S2─S4 link fails, then S1 will switch over to the still-working S1─S3─S4─B
route.
As we shall see, many LANs (in particular Ethernet) prefer “tree” networks with no redundancy, while IP has complex protocols in
support of redundancy (9 Routing-Update Algorithms).

1.5.1 Traffic Engineering


In some cases the decision above between routes A─S1─S2─S4─B and A─S1─S3─S4─B might be of material significance –
perhaps the S2–S4 link is slower than the others, or is more congested. We will use the term traffic engineering to refer to any
intentional selection of one route over another, or any elevation of the priority of one class of traffic. The route selection can either
be directly intentional, through configuration, or can be implicit in the selection or tuning of algorithms that then make these route-
selection choices automatically. As an example of the latter, the algorithms of 9.1 Distance-Vector Routing-Update Algorithm build
forwarding tables on their own, but those tables are greatly influenced by the administrative assignment of link costs.
With pure datagram forwarding, used at either the LAN or the IP layer, the path taken by a packet is determined solely by its
destination, and traffic engineering is limited to the choices made between alternative paths. We have already, however, suggested
that datagram forwarding can be extended to take quality-of-service information into account; this may be used to have voice
traffic – with its relatively low bandwidth but intolerance for delay – take an entirely different path than bulk file transfers.
Alternatively, the network manager may simply assign voice traffic a higher priority, so it does not have to wait in queues behind
file-transfer traffic.
The quality-of-service information may be set by the end-user, in which case an ISP may wish to recognize it only for designated
users, which in turn means that the ISP will implicitly use the traffic source when making routing decisions. Alternatively, the
quality-of-service information may be set by the ISP itself, based on its best guess as to the application; this means that the ISP
may be using packet size, port number (1.12 Transport) and other contents as part of the routing decision. For some explicit
mechanisms supporting this kind of routing, see 9.6 Routing on Other Attributes.

1.5.1 https://eng.libretexts.org/@go/page/11066
At the LAN layer, traffic-engineering mechanisms are historically limited, though see 2.8 Software-Defined Networking. At the IP
layer, more strategies are available; see 20 Quality of Service.

This page titled 1.5: Topology is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.5.2 https://eng.libretexts.org/@go/page/11066
1.6: Routing Loops

This page is a draft and is under active development.

A potential drawback to datagram forwarding is the possibility of a routing loop: a set of entries in the forwarding tables that cause
some packets to circulate endlessly. For example, in the previous picture we would have a routing loop if, for (nonexistent)
destination C, S1 forwarded to S2, S2 forwarded to S4, S4 forwarded to S3, and S3 forwarded to S1. A packet sent to C would not
only not be delivered, but in circling endlessly it might easily consume a large majority of the bandwidth. Routing loops typically
arise because the creation of the forwarding tables is often “distributed”, and there is no global authority to detect inconsistencies.
Even when there is such an authority, temporary routing loops can be created due to notification delays.
Routing loops can also occur in networks where the underlying link topology is loop-free; for example, in the previous diagram we
could, again for destination C, have S1 forward to S2 and S2 forward back to S1. We will refer to such a case as a linear routing
loop.
All datagram-forwarding protocols need some way of detecting and avoiding routing loops. Ethernet, for example, avoids nonlinear
routing loops by disallowing loops in the underlying network topology, and avoids linear routing loops by not having switches
forward a packet back out the interface by which it arrived. IP provides for a one-byte “Time to Live” (TTL) field in the IP header;
it is set by the sender and decremented by 1 at each router; a packet is discarded if its TTL reaches 0. This limits the number of
times a wayward packet can be forwarded to the initial TTL value, typically 64.
In datagram routing, a switch is responsible only for the next hop to the ultimate destination; if a switch has a complete path in
mind, there is no guarantee that the next_hop switch or any other downstream switch will continue to forward along that path.
Misunderstandings can potentially lead to routing loops. Consider this network:
B

A C

D E

D might feel that the best path to B is D–E–C–B (perhaps because it believes the A–D link is to be avoided). If E similarly decides
the best path to B is E–D–A–B, and if D and E both choose their next_hop for B based on these best paths, then a linear routing
loop is formed: D routes to B via E and E routes to B via D. Although each of D and E have identified a usable path, that path is
not in fact followed. Moral: successful datagram routing requires cooperation and a consistent view of the network.

This page titled 1.6: Routing Loops is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.6.1 https://eng.libretexts.org/@go/page/11067
1.7: Congestion
Switches introduce the possibility of congestion: packets arriving faster than they can be sent out. This can happen with just two
interfaces, if the inbound interface has a higher bandwidth than the outbound interface; another common source of congestion is
traffic arriving on multiple inputs and all destined for the same output.
Whatever the reason, if packets are arriving for a given outbound interface faster than they can be sent, a queue will form for that
interface. Once that queue is full, packets will be dropped. The most common strategy (though not the only one) is to drop any
packets that arrive when the queue is full.
The term “congestion” may refer either to the point where the queue is just beginning to build up, or to the point where the queue is
full and packets are lost. In their paper [CJ89], Chiu and Jain refer to the first point as the knee; this is where the slope of the load
vs throughput graph flattens. They refer to the second point as the cliff; this is where packet losses may lead to a precipitous decline
in throughput. Other authors use the term contention for knee-congestion.
In the Internet, most packet losses are due to congestion. This is not because congestion is especially bad (though it can be, at
times), but rather that other types of losses (eg due to packet corruption) are insignificant by comparison.

When to Upgrade?
Deciding when a network really does have insufficient bandwidth is not a technical issue but an economic one. The number of
customers may increase, the cost of bandwidth may decrease or customers may simply be willing to pay more to have data
transfers complete in less time; “customers” here can be external or in-house. Monitoring of links and routers for congestion
can, however, help determine exactly what parts of the network would most benefit from upgrade.

We emphasize that the presence of congestion does not mean that a network has a shortage of bandwidth. Bulk-traffic senders
(though not real-time senders) attempt to send as fast as possible, and congestion is simply the network’s feedback that the
maximum transmission rate has been reached. For further discussion, including alternative definitions of longer-term congestion,
see [BCL09].
Congestion is a sign of a problem in real-time networks, which we will consider in 20 Quality of Service. In these networks losses
due to congestion must generally be kept to an absolute minimum; one way to achieve this is to limit the acceptance of new
connections unless sufficient resources are available.

This page titled 1.7: Congestion is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.7.1 https://eng.libretexts.org/@go/page/11069
1.8: Packets Again
Perhaps the core justification for packets, Baran’s concerns about node failure notwithstanding, is that the same link can carry, at
different times, different packets representing traffic to different destinations and from different senders. Thus, packets are the key
to supporting shared transmission lines; that is, they support the multiplexing of multiple communications channels over a single
cable. The alternative of a separate physical line between every pair of machines grows prohibitively complex very quickly (though
virtual circuits between every pair of machines in a datacenter are not uncommon; see 3.4 Virtual Circuits).
From this shared-medium perspective, an important packet feature is the maximum packet size, as this represents the maximum
time a sender can send before other senders get a chance. The alternative of unbounded packet sizes would lead to prolonged
network unavailability for everyone else if someone downloaded a large file in a single 1 Gigabit packet. Another drawback to
large packets is that, if the packet is corrupted, the entire packet must be retransmitted; see 5.3.1 Error Rates and Packet Size.
When a router or switch receives a packet, it (generally) reads in the entire packet before looking at the header to decide to what
next node to forward it. This is known as store-and-forward, and introduces a forwarding delay equal to the time needed to read
in the entire packet. For individual packets this forwarding delay is hard to avoid (though some switches do implement cut-
through switching to begin forwarding a packet before it has fully arrived), but if one is sending a long train of packets then by
keeping multiple packets en route at the same time one can essentially eliminate the significance of the forwarding delay; see 5.3
Packet Size.
Total packet delay from sender to receiver is the sum of the following:
Bandwidth delay, ie sending 1000 Bytes at 20 Bytes/millisecond will take 50 ms. This is a per-link delay.
Propagation delay due to the speed of light. For example, if you start sending a packet right now on a 5000-km cable across
the US with a propagation speed of 200 m/µsec (= 200 km/ms, about 2/3 the speed of light in vacuum), the first bit will not
arrive at the destination until 25 ms later. The bandwidth delay then determines how much after that the entire packet will take
to arrive.
Store-and-forward delay, equal to the sum of the bandwidth delays out of each router along the path
Queuing delay, or waiting in line at busy routers. At bad moments this can exceed 1 sec, though that is rare. Generally it is less
than 10 ms and often is less than 1 ms. Queuing delay is the only delay component amenable to reduction through careful
engineering.
See 5.1 Packet Delay for more details.

This page titled 1.8: Packets Again is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.8.1 https://eng.libretexts.org/@go/page/11068
1.9: LANs and Ethernet
A local-area network, or LAN, is a system consisting of

physical links that are, ultimately, serial lines


common interfacing hardware connecting the hosts to the links
protocols to make everything work together
We will explicitly assume that every LAN node is able to communicate with every other LAN node. Sometimes this will require
the cooperation of intermediate nodes acting as switches.
Far and away the most common type of (wired) LAN is Ethernet, originally described in a 1976 paper by Metcalfe and Boggs
[MB76]. Ethernet’s popularity is due to low cost more than anything else, though the primary reason Ethernet cost is low is that
high demand has led to manufacturing economies of scale.
The original Ethernet had a bandwidth of 10 Mbps (megabits per second; we will use lower-case “b” for bits and upper-case “B”
for bytes), though nowadays most Ethernet operates at 100 Mbps and gigabit (1000 Mbps) Ethernet (and faster) is widely used in
server rooms. (By comparison, as of this writing (2015) the data transfer rate to a typical faster hard disk is about 1000 Mbps.)
Wireless (“Wi-Fi”) LANs are gaining popularity, and in many settings have supplanted wired Ethernet to end-users.
Many early Ethernet installations were unswitched; each host simply tapped in to one long primary cable that wound through the
building (or floor). In principle, two stations could then transmit at the same time, rendering the data unintelligible; this was called
a collision. Ethernet has several design features intended to minimize the bandwidth wasted on collisions: stations, before
transmitting, check to be sure the line is idle, they monitor the line while transmitting to detect collisions during the transmission,
and, if a collision is detected, they execute a random backoff strategy to avoid an immediate recollision. See 2.1.5 The Slot Time
and Collisions. While Ethernet collisions definitely reduce throughput, in the larger view they should perhaps be thought of as a
part of a remarkably inexpensive shared-access mediation protocol.
In unswitched Ethernets every packet is received by every host and it is up to the network card in each host to determine if the
arriving packet is addressed to that host. It is almost always possible to configure the card to forward all arriving packets to the
attached host; this poses a security threat and “password sniffers” that surreptitiously collected passwords via such eavesdropping
used to be common.

Password Sniffing
In the fall of 1994 at Loyola University I remotely changed the root password on several CS-department unix machines at the
other end of campus, using telnet. I told no one. Within two hours, someone else logged into one of these machines, using the
new password, from a host in Europe. Password sniffing was the likely culprit.
Two months later was the so-called “Christmas Day Attack” (12.10.1 ISNs and spoofing). One of the hosts used to launch this
attack was Loyola’s hacked apollo.it.luc.edu. It is unclear the degree to which password sniffing played a role in that exploit.

Due to both privacy and efficiency concerns, almost all Ethernets today are fully switched; this ensures that each packet is
delivered only to the host to which it is addressed. One advantage of switching is that it effectively eliminates most Ethernet
collisions; while in principle it replaces them with a queuing issue, in practice Ethernet switch queues so seldom fill up that they
are almost invisible even to network managers (unlike IP router queues). Switching also prevents host-based eavesdropping, though
arguably a better solution to this problem is encryption. Perhaps the more significant tradeoff with switches, historically, was that
Once Upon A Time they were expensive and unreliable; tapping directly into a common cable was dirt cheap.
Ethernet addresses are six bytes long. Each Ethernet card (or network interface) is assigned a (supposedly) unique address at the
time of manufacture; this address is burned into the card’s ROM and is called the card’s physical address or hardware address or
MAC (Media Access Control) address. The first three bytes of the physical address have been assigned to the manufacturer; the
subsequent three bytes are a serial number assigned by that manufacturer.
By comparison, IP addresses are assigned administratively by the local site. The basic advantage of having addresses in hardware is
that hosts automatically know their own addresses on startup; no manual configuration or server query is necessary. It is not
unusual for a site to have a large number of identically configured workstations, for which all network differences derive ultimately
from each workstation’s unique Ethernet address.

1.9.1 https://eng.libretexts.org/@go/page/11070
The network interface continually monitors all arriving packets; if it sees any packet containing a destination address that matches
its own physical address, it grabs the packet and forwards it to the attached CPU (via a CPU interrupt).
Ethernet also has a designated broadcast address. A host sending to the broadcast address has its packet received by every other
host on the network; if a switch receives a broadcast packet on one port, it forwards the packet out every other port. This broadcast
mechanism allows host A to contact host B when A does not yet know B’s physical address; typical broadcast queries have forms
such as “Will the designated server please answer” or (from the ARP protocol) “will the host with the given IP address please tell
me your physical address”.
Traffic addressed to a particular host – that is, not broadcast – is said to be unicast.
Because Ethernet addresses are assigned by the hardware, knowing an address does not provide any direct indication of where that
address is located on the network. In switched Ethernet, the switches must thus have a forwarding-table record for each individual
Ethernet address on the network; for extremely large networks this ultimately becomes unwieldy. Consider the analogous situation
with postal addresses: Ethernet is somewhat like attempting to deliver mail using social-security numbers as addresses, where each
postal worker is provided with a large catalog listing each person’s SSN together with their physical location. Real postal mail is, of
course, addressed “hierarchically” using ever-more-precise specifiers: state, city, zipcode, street address, and name / room#.
Ethernet, in other words, does not scale well to “large” sizes.
Switched Ethernet works quite well, however, for networks with up to 10,000-100,000 nodes. Forwarding tables with size in that
range are straightforward to manage.
To forward packets correctly, switches must know where all active destination addresses in the LAN are located; traditional
Ethernet switches do this by a passive learning algorithm. (IP routers, by comparison, use “active” protocols, and some newer
Ethernet switches take the approach of 2.8 Software-Defined Networking.) Typically a host physical address is entered into a
switch’s forwarding table when a packet from that host is first received; the switch notes the packet’s arrival interface and source
address and assumes that the same interface is to be used to deliver packets back to that sender. If a given destination address has
not yet been seen, and thus is not in the forwarding table, Ethernet switches still have the backup delivery option of flooding:
forwarding the packet to everyone by treating the destination address like the broadcast address, and allowing the host Ethernet
cards to sort it out. Since this broadcast-like process is not generally used for more than one packet (after that, the switches will
have learned the correct forwarding-table entries), the risks of excessive traffic and of eavesdropping are minimal.
The ⟨host,interface⟩ forwarding table is often easier to think of as ⟨host,next_hop⟩, where the next_hop node is whatever switch or
host is at the immediate other end of the link connecting to the given interface. In a fully switched network where each link
connects only two interfaces, the two perspectives are equivalent.

This page titled 1.9: LANs and Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1.9.2 https://eng.libretexts.org/@go/page/11070
1.10: IP - Internet Protocol
To solve the scaling problem with Ethernet, and to allow support for other types of LANs and point-to-point links as well, the
Internet Protocol was developed. Perhaps the central issue in the design of IP was to support universal connectivity (everyone can
connect to everyone else) in such a way as to allow scaling to enormous size (in 2013 there appear to be around ~109 nodes,
although IP should work to 1010 nodes or more), without resulting in unmanageably large forwarding tables (currently the largest
tables have about 300,000 entries.)
In the early days, IP networks were considered to be “internetworks” of basic networks (LANs); nowadays users generally ignore
LANs and think of the Internet as one large (virtual) network.
To support universal connectivity, IP provides a global mechanism for addressing and routing, so that packets can actually be
delivered from any host to any other host. IP addresses (for the most-common version 4, which we denote IPv4) are 4 bytes (32
bits), and are part of the IP header that generally follows the Ethernet header. The Ethernet header only stays with a packet for one
hop; the IP header stays with the packet for its entire journey across the Internet.
An essential feature of IPv4 (and IPv6) addresses is that they can be divided into a network part (a prefix) and a host part (the
remainder). The “legacy” mechanism for designating the IPv4 network and host address portions was to make the division
according to the first few bits:

first few bits first byte network bits host bits name application

0 0-127 8 24 class A a few very large networks

10 128-191 16 16 class B institution-sized networks

110 192-223 24 8 class C sized for smaller entities

For example, the original IP address allocation for Loyola University Chicago was 147.126.0.0, a class B. In binary, 147 is
10010011.
IP addresses, unlike Ethernet addresses, are administratively assigned. Once upon a time, you would get your Class B network
prefix from the Internet Assigned Numbers Authority, or IANA (they now delegate this task), and then you would in turn assign the
host portion in a way that was appropriate for your local site. As a result of this administrative assignment, an IP address usually
serves not just as an endpoint identifier but also as a locator, containing embedded location information (at least in the sense of
location within the IP-address-assignment hierarchy, which may not be geographical). Ethernet addresses, by comparison, are
endpoint identifiers but not locators.
The Class A/B/C definition above was spelled out in 1981 in RFC 791 [https://tools.ietf.org/html/rfc791.html], which introduced
IP. Class D was added in 1986 by RFC 988 [https://tools.ietf.org/html/rfc988.html]; class D addresses must begin with the bits
1110. These addresses are for multicast, that is, sending an IP packet to every member of a set of recipients (ideally without
actually transmitting it more than once on any one link).
Nowadays the division into the network and host bits is dynamic, and can be made at different positions in the address at different
levels of the network. For example, a small organization might receive a /27 address block (1/8 the size of a class-C /24) from its
ISP, eg 200.1.130.96/27. The ISP routes to the organization based on this /27 prefix. At some higher level, however, routing might
be based on the prefix 200.1.128/18; this might, for example, represent an address block assigned to the ISP (note that the first 18
bits of 200.1.130.x match 200.1.128; the first two bits of 128 and 130, taken as 8-bit quantities, are “10”). The network/host
division point is not carried within the IP header; routers negotiate this division point when they negotiate the next_hop forwarding
information. We will return to this in 7.5 The Classless IP Delivery Algorithm.
The network portion of an IP address is sometimes called the network number or network address or network prefix. As we
shall see below, most forwarding decisions are made using only the network prefix. The network prefix is commonly denoted by
setting the host bits to zero and ending the resultant address with a slash followed by the number of network bits in the address: eg
12.0.0.0/8 or 147.126.0.0/16. Note that 12.0.0.0/8 and 12.0.0.0/9 represent different things; in the latter, the second byte of any host
address extending the network address is constrained to begin with a 0-bit. An anonymous block of IP addresses might be referred
to only by the slash and following digit, eg “we need a /22 block to accommodate all our customers”.

1.10.1 https://eng.libretexts.org/@go/page/11071
All hosts with the same network address (same network bits) are said to be on the same IP network and must be located together
on the same LAN; as we shall see below, if two hosts share the same network address then they will assume they can reach each
other directly via the underlying LAN, and if they cannot then connectivity fails. A consequence of this rule is that outside of the
site only the network bits need to be looked at to route a packet to the site.
Usually, all hosts (or more precisely all network interfaces) on the same physical LAN share the same network prefix and thus are
part of the same IP network. Occasionally, however, one LAN is divided into multiple IP networks.
Each individual LAN technology has a maximum packet size it supports; for example, Ethernet has a maximum packet size of
about 1500 bytes but the once-competing Token Ring had a maximum of 4 kB. Today the world has largely standardized on
Ethernet and almost entirely standardized on Ethernet packet-size limits, but this was not the case when IP was introduced and
there was real concern that two hosts on separate large-packet networks might try to exchange packets too large for some small-
packet intermediate network to carry.
Therefore, in addition to routing and addressing, the decision was made that IP must also support fragmentation: the division of
large packets into multiple smaller ones (in other contexts this may also be called segmentation). The IP approach is not very
efficient, and IP hosts go to considerable lengths to avoid fragmentation. IP does require that packets of up to 576 bytes be
supported, and so a common legacy strategy was for a host to limit a packet to at most 512 user-data bytes whenever the packet
was to be sent via a router; packets addressed to another host on the same LAN could of course use a larger packet size. Despite its
limited use, however, fragmentation is essential conceptually, in order for IP to be able to support large packets without knowing
anything about the intervening networks.
IP is a best effort system; there are no IP-layer acknowledgments or retransmissions. We ship the packet off, and hope it gets there.
Most of the time, it does.
Architecturally, this best-effort model represents what is known as connectionless networking: the IP layer does not maintain
information about endpoint-to-endpoint connections, and simply forwards packets like a giant LAN. Responsibility for creating and
maintaining connections is left for the next layer up, the TCP layer. Connectionless networking is not the only way to do things: the
alternative could have been some form connection-oriented internetworking, in which routers do maintain state information about
individual connections. Later, in 3.4 Virtual Circuits, we will examine how virtual-circuit networking can be used to implement a
connection-oriented approach; virtual-circuit switching is the primary alternative to datagram switching.
Connectionless (IP-style) and connection-oriented networking each have advantages. Connectionless networking is conceptually
more reliable: if routers do not hold connection state, then they cannot lose connection state. The path taken by the packets in some
higher-level connection can easily be dynamically rerouted. Finally, connectionless networking makes it hard for providers to bill
by the connection; once upon a time (in the era of dollar-a-minute phone calls) this was a source of mild astonishment to many new
users. (This was not always a given; the paper [CK74] considers, among other things, the possibility of per-packet accounting.)
The primary advantage of connection-oriented networking, on the other hand, is that the routers are then much better positioned to
accept reservations and to make quality-of-service guarantees. This remains something of a sore point in the current Internet: if
you want to use Voice-over-IP, or VoIP, telephones, or if you want to engage in video conferencing, your packets will be treated by
the Internet core just the same as if they were low-priority file transfers. There is no “priority service” option.
The most common form of IP packet loss is router queue overflows, representing network congestion. Packet losses due to packet
corruption are rare (eg less than one in 104; perhaps much less). But in a connectionless world a large number of hosts can
simultaneously attempt to send traffic through one router, in which case queue overflows are hard to avoid.
Although we will often assume, for simplicity, that routers have a fixed input queue size, the reality is often a little more
complicated. See 14.8 Active Queue Management and 19 Queuing and Scheduling.

1.10.1 IP Forwarding
IP routers use datagram forwarding, described in 1.4 Datagram Forwarding above, to deliver packets, but the “destination” values
listed in the forwarding tables are network prefixes – representing entire LANs – instead of individual hosts. The goal of IP
forwarding, then, becomes delivery to the correct LAN; a separate process is used to deliver to the final host once the final LAN
has been reached.

1.10.2 https://eng.libretexts.org/@go/page/11071
The entire point, in fact, of having a network/host division within IP addresses is so that routers need to list only the network
prefixes of the destination addresses in their IP forwarding tables. This strategy is the key to IP scalability: it saves large amounts
of forwarding-table space, it saves time as smaller tables allow faster lookup, and it saves the bandwidth and overhead that would
be needed for routers to keep track of individual addresses. To get an idea of the forwarding-table space savings, there are currently
(2013) around a billion hosts on the Internet, but only 300,000 or so networks listed in top-level forwarding tables.
With IP’s use of network prefixes as forwarding-table destinations, matching an actual packet address to a forwarding-table entry is
no longer a matter of simple equality comparison; routers must compare appropriate prefixes.
IP forwarding tables are sometimes also referred to as “routing tables”; in this book, however, we make at least a token effort to use
“forwarding” to refer to the packet forwarding process, and “routing” to refer to mechanisms by which the forwarding tables are
maintained and updated. (If we were to be completely consistent here, we would use the term “forwarding loop” rather than
“routing loop”.)
Now let us look at an example of how IP forwarding (or routing) works. We will assume that all network nodes are either hosts –
user machines, with a single network connection – or routers, which do packet-forwarding only. Routers are not directly visible to
users, and always have at least two different network interfaces representing different networks that the router is connecting.
(Machines can be both hosts and routers, but this introduces complications.)
Suppose A is the sending host, sending a packet to a destination host D. The IP header of the packet will contain D’s IP address in
the “destination address” field (it will also contain A’s own address as the “source address”). The first step is for A to determine
whether D is on the same LAN as itself or not; that is, whether D is local. This is done by looking at the network part of the
destination address, which we will denote by Dnet. If this net address is the same as A’s (that is, if it is equal numerically to Anet),
then A figures D is on the same LAN as itself, and can use direct LAN delivery. It looks up the appropriate physical address for D
(probably with the ARP protocol, 7.9 Address Resolution Protocol: ARP), attaches a LAN header to the packet in front of the IP
header, and sends the packet straight to D via the LAN.
If, however, Anet and Dnet do not match – D is non-local – then A looks up a router to use. Most ordinary hosts use only one router
for all non-local packet deliveries, making this choice very simple. A then forwards the packet to the router, again using direct
delivery over the LAN. The IP destination address in the packet remains D in this case, although the LAN destination address will
be that of the router.
When the router receives the packet, it strips off the LAN header but leaves the IP header with the IP destination address. It extracts
the destination D, and then looks at Dnet. The router first checks to see if any of its network interfaces are on the same LAN as D;
recall that the router connects to at least one additional network besides the one for A. If the answer is yes, then the router uses
direct LAN delivery to the destination, as above. If, on the other hand, Dnet is not a LAN to which the router is connected directly,
then the router consults its internal forwarding table. This consists of a list of networks each with an associated next_hop address.
These ⟨net,next_hop⟩ tables compare with switched-Ethernet’s ⟨host,next_hop⟩ tables; the former type will be smaller because
there are many fewer nets than hosts. The next_hop addresses in the table are chosen so that the router can always reach them via
direct LAN delivery via one of its interfaces; generally they are other routers. The router looks up Dnet in the table, finds the
next_hop address, and uses direct LAN delivery to get the packet to that next_hop machine. The packet’s IP header remains
essentially unchanged, although the router most likely attaches an entirely new LAN header.
The packet continues being forwarded like this, from router to router, until it finally arrives at a router that is connected to Dnet; it is
then delivered by that final router directly to D, using the LAN.
To make this concrete, consider the following diagram:

A B C F E D: 200.0.1.37

R1 R2 R3
200.0.0/24 200.0.1/24
Two LANs joined by three routers

With Ethernet-style forwarding, R2 would have to maintain entries for each of A,B,C,D,E,F. With IP forwarding, R2 has just two
entries to maintain in its forwarding table: 200.0.0/24 and 200.0.1/24. If A sends to D, at 200.0.1.37, it puts this address into the IP
header, notes that 200.0.0 ≠ 200.0.1, and thus concludes D is not a local delivery. A therefore sends the packet to its router R1,
using LAN delivery. R1 looks up the destination network 200.0.1 in its forwarding table and forwards the packet to R2, which in
turn forwards it to R3. R3 now sees that it is connected directly to the destination network 200.0.1, and delivers the packet via the
LAN to D, by looking up D’s physical address.

1.10.3 https://eng.libretexts.org/@go/page/11071
In this diagram, IP addresses for the ends of the R1–R2 and R2–R3 links are not shown. They could be assigned global IP
addresses, but they could also use “private” IP addresses. Assuming these links are point-to-point links, they might not actually
need IP addresses at all; we return to this in 7.12 Unnumbered Interfaces.
One can think of the network-prefix bits as analogous to the “zip code” on postal mail, and the host bits as analogous to the street
address. The internal parts of the post office get a letter to the right zip code, and then an individual letter carrier (the LAN) gets it
to the right address. Alternatively, one can think of the network bits as like the area code of a phone number, and the host bits as
like the rest of the digits. Newer protocols that support different net/host division points at different places in the network –
sometimes called hierarchical routing – allow support for addressing schemes that correspond to, say, zip/street/user, or
areacode/exchange/subscriber.

The Invertebrate Internet


The backbone is not as essential as it once was. Once Upon A Time, all traffic between different providers passed through the
backbone. The legacy backbone still exists, but today it is also common for traffic from large providers such as Google
[https://google.com] to take a backbone-free path; such providers connect (or “peer”) directly with large residential ISPs such
as Comcast [http://comcast.com]. Google refers to this as their “Edge Network”; see peering.google.com
[peering.google.com/] and also 10.6.7.1 MED values and traffic engineering.

We will refer to the Internet backbone as those IP routers that specialize in large-scale routing on the commercial Internet, and
which generally have forwarding-table entries covering all public IP addresses; note that this is essentially a business definition
rather than a technical one. We can revise the table-size claim of the previous paragraph to state that, while there are many private
IP networks, there are about 800,000 separate network prefixes (as of 2019) visible to the backbone. (In 2012, the year this book
was started, there were about 400,000 prefixes.) A forwarding table of 800,000 entries is quite feasible; a table a hundred times
larger is not, let alone a thousand times larger. For a graph of the growth in network prefixes / forwarding-table entries, see 10.6.5
BGP Table Size.
IP routers at non-backbone sites generally know all locally assigned network prefixes, eg 200.0.0/24 and 200.0.1/24 above. If a
destination does not match any locally assigned network prefix, the packet needs to be routed out into the Internet at large; for
typical non-backbone sites this almost always this means the packet is sent to the ISP that provides Internet connectivity. Generally
the local routers will contain a catchall default entry covering all nonlocal networks; this means that the router needs an explicit
entry only for locally assigned networks. This greatly reduces the forwarding-table size. The Internet backbone can be
approximately described, in fact, as those routers that do not have a default entry.
For most purposes, the Internet can be seen as a combination of end-user LANs together with point-to-point links joining these
LANs to the backbone, point-to-point links also tie the backbone together. Both LANs and point-to-point links appear in the
diagram above.
Just how routers build their ⟨destnet,next_hop⟩ forwarding tables is a major topic itself, which we cover in 9 Routing-Update
Algorithms. Unlike Ethernet, IP routers do not have a “flooding” delivery mechanism as a fallback, so the tables must be
constructed in advance. (There is a limited form of IP broadcast, but it is basically intended for reaching the local LAN only, and
does not help at all with delivery in the event that the destination network is unknown.)
Most forwarding-table-construction algorithms used on a set of routers under common management fall into either the distance-
vector or the link-state category; these are described in 9 Routing-Update Algorithms. Routers not under common management –
that is, neighboring routers belonging to different organizations – exchange information through the Border Gateway Protocol,
BGP (10 Large-Scale IP Routing). BGP allows routing decisions to be based on a fusion of “technical” information (which sites are
reachable at all, and through where) together with “policy” information representing legal or commercial agreements: which
outside routers are “preferred”, whose traffic an ISP will carry even if it isn’t to one of the ISP’s customers, etc.
Most common residential “routers” involve network address translation in addition to packet forwarding. See 7.7 Network
Address Translation.

1.10.2 The Future of IPv4


As mentioned earlier, allocation of blocks of IP addresses is the responsibility of the Internet Assigned Numbers Authority. IANA
long ago delegated the job of allocating network prefixes to individual sites; they limited themselves to handing out /8 blocks (class

1.10.4 https://eng.libretexts.org/@go/page/11071
A blocks) to the five regional registries, which are
ARIN [www.arin.net/] – North America
RIPE [www.ripe.net/] – Europe, the Middle East and parts of Asia
APNIC [www.apnic.net/] – East Asia and the Pacific
AfriNIC [www.afrinic.net/] – most of Africa
LACNIC [www.lacnic.net/] – Central and South America
As of the end of January 2011, the IANA finally ran out of /8 blocks. There is a table at http://www.iana.org/assignments/ipv4-
address-space/ipv4-address-space.xml of all IANA assignments of /8 blocks; examination of the table shows all have now been
allocated.
In September 2015, ARIN ran out of its pool of IPv4 addresses [www.arin.net/announcements/2...20150924.html]. Most of ARIN’s
customers are ISPs, which can now obtain new IPv4 addresses only by buying unused address blocks from other organizations.
A few months after the IANA pool ran out, Microsoft purchased 666,624 IP addresses (2604 Class-C blocks) in a Nortel
bankruptcy auction for $7.5 million. By a year later, IP-address prices appeared to have retreated only slightly. It is possible that the
market for IPv4 address blocks will continue to develop; alternatively, this turn of events may accelerate implementation of IPv6,
which has 128-bit addresses.
An IPv4 address price in the range of $10 is unlikely to have much impact in residential Internet access, where annual connection
fees are often $600. Large organizations use NAT (7.7 Network Address Translation) extensively, leading to the need for only a
small number of globally visible addresses. The IPv4 address shortage does not even seem to have affected wireless networking. It
does, however, lead to inefficient routing tables, as a site that once had a single /17 address block – and thus a single backbone
forwarding-table entry – might now be spread over more than a hundred /24 blocks and concomitant forwarding entries.

This page titled 1.10: IP - Internet Protocol is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1.10.5 https://eng.libretexts.org/@go/page/11071
1.11: DNS
IP addresses are hard to remember (nearly impossible in IPv6). The domain name system, or DNS (7.8 DNS), comes to the rescue
by creating a way to convert hierarchical text names to IP addresses. Thus, for example, one can type www.luc.edu instead of
147.126.1.230. Virtually all Internet software uses the same basic library calls to convert DNS names to actual addresses.
One thing DNS makes possible is changing a website’s IP address while leaving the name alone. This allows moving a site to a
new provider, for example, without requiring users to learn anything new. It is also possible to have several different DNS names
resolve to the same IP address, and – through some modest trickery – have the http (web) server at that IP address handle the
different DNS names as completely different websites.
DNS is hierarchical and distributed. In looking up cs.luc.edu four different DNS servers may be queried: for the so-called
“DNS root zone”, for edu , for luc.edu and for cs.luc.edu . Searching a hierarchy can be cumbersome, so DNS
search results are normally cached locally. If a name is not found in the cache, the lookup may take a couple seconds. The DNS
hierarchy need have nothing to do with the IP-address hierarchy.

This page titled 1.11: DNS is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.11.1 https://eng.libretexts.org/@go/page/11072
1.12: Transport
The IP layer gets packets from one node to another, but it is not well-suited to transport. First, IP routing is a “best-effort”
mechanism, which means packets can and do get lost sometimes. Additionally, data that does arrive can arrive out of order. Finally,
IP only supports sending to a specific host; normally, one wants to send to a given application running on that host. Email and web
traffic, or two different web sessions, should not be commingled!
The Transport layer is the layer above the IP layer that handles these sorts of issues, often by creating some sort of connection
abstraction. Far and away the most popular mechanism in the Transport layer is the Transmission Control Protocol, or TCP. TCP
extends IP with the following features:
reliability: TCP numbers each packet, and keeps track of which are lost and retransmits them after a timeout. It holds early-
arriving out-of-order packets for delivery at the correct time. Every arriving data packet is acknowledged by the receiver;
timeout and retransmission occurs when an acknowledgment packet isn’t received by the sender within a given time.
connection-orientation: Once a TCP connection is made, an application sends data simply by writing to that connection. No
further application-level addressing is needed. TCP connections are managed by the operating-system kernel, not by the
application.
stream-orientation: An application using TCP can write 1 byte at a time, or 100 kB at a time; TCP will buffer and/or divide up
the data into appropriate sized packets.
port numbers: these provide a way to specify the receiving application for the data, and also to identify the sending
application.
throughput management: TCP attempts to maximize throughput, while at the same time not contributing unnecessarily to
network congestion.
TCP endpoints are of the form ⟨host,port⟩; these pairs are known as socket addresses, or sometimes as just sockets though the
latter refers more properly to the operating-system objects that receive the data sent to the socket addresses. Servers (or, more
precisely, server applications) listen for connections to sockets they have opened; the client is then any endpoint that initiates a
connection to a server.
When you enter a host name in a web browser, it opens a TCP connection to the server’s port 80 (the standard web-traffic port),
that is, to the server socket with socket-address ⟨server,80⟩. If you have several browser tabs open, each might connect to the same
server socket, but the connections are distinguishable by virtue of using separate ports (and thus having separate socket addresses)
on the client end (that is, your end).
A busy server may have thousands of connections to its port 80 (the web port) and hundreds of connections to port 25 (the email
port). Web and email traffic are kept separate by virtue of the different ports used. All those clients to the same port, though, are
kept separate because each comes from a unique ⟨host,port⟩ pair. A TCP connection is determined by the ⟨host,port⟩ socket address
at each end; traffic on different connections does not intermingle. That is, there may be multiple independent connections to
⟨www.luc.edu,80⟩. This is somewhat analogous to certain business telephone numbers of the “operators are standing by” type,
which support multiple callers at the same time to the same toll-free number. Each call to that number is answered by a different
operator (corresponding to a different cpu process), and different calls do not “overhear” each other.
TCP uses the sliding-windows algorithm, 6 Abstract Sliding Windows, to keep multiple packets en route at any one time. The
window size represents the number of packets simultaneously in transit (TCP actually keeps track of the window size in bytes, but
packets are easier to visualize). If the window size is 10 packets, for example, then at any one time 10 packets are in transit
(perhaps 5 data packets and 5 returning acknowledgments). Assuming no packets are lost, then as each acknowledgment arrives the
window “slides forward” by one packet. The data packet 10 packets ahead is then sent, to maintain a total of 10 packets on the
wire. For example, consider the moment when the ten packets 20-29 are in transit. When ACK[20] is received, the number of
packets outstanding drops to 9 (packets 21-29). To keep 10 packets in flight, Data[30] is sent. When ACK[21] is received, Data[31]
is sent, and so on.
Sliding windows minimizes the effect of store-and-forward delays, and propagation delays, as these then only count once for the
entire windowful and not once per packet. Sliding windows also provides an automatic, if partial, brake on congestion: the queue at
any switch or router along the way cannot exceed the window size. In this it compares favorably with constant-rate transmission,
which, if the available bandwidth falls below the transmission rate, always leads to overflowing queues and to a significant

1.12.1 https://eng.libretexts.org/@go/page/11073
percentage of dropped packets. Of course, if the window size is too large, a sliding-windows sender may also experience dropped
packets.
The ideal window size, at least from a throughput perspective, is such that it takes one round-trip time to send an entire window, so
that the next ACK will always be arriving just as the sender has finished transmitting the window. Determining this ideal size,
however, is difficult; for one thing, the ideal size varies with network load. As a result, TCP approximates the ideal size. The most
common TCP strategy – that of so-called TCP Reno – is that the window size is slowly raised until packet loss occurs, which TCP
takes as a sign that it has reached the limit of available network resources. At that point the window size is reduced to half its
previous value, and the slow climb resumes. The effect is a “sawtooth” graph of window size with time, which oscillates (more or
less) around the “optimal” window size. For an idealized sawtooth graph, see 13.1.1 The Somewhat-Steady State; for some “real”
(simulation-created) sawtooth graphs see 16.4.1 Some TCP Reno cwnd graphs.
While this window-size-optimization strategy has its roots in attempting to maximize the available bandwidth, it also has the effect
of greatly limiting the number of packet-loss events. As a result, TCP has come to be the Internet protocol charged with reducing
(or at least managing) congestion on the Internet, and – relatedly – with ensuring fairness of bandwidth allocations to competing
connections. Core Internet routers – at least in the classical case – essentially have no role in enforcing congestion or fairness
restrictions at all. The Internet, in other words, places responsibility for congestion avoidance cooperatively into the hands of end
users. While “cheating” is possible, this cooperative approach has worked remarkably well.
While TCP is ubiquitous, the real-time performance of TCP is not always consistent: if a packet is lost, the receiving TCP host will
not turn over anything further to the receiving application until the lost packet has been retransmitted successfully; this is often
called head-of-line blocking. This is a serious problem for sound and video applications, which can discretely handle modest
losses but which have much more difficulty with sudden large delays. A few lost packets ideally should mean just a few brief voice
dropouts (pretty common on cell phones) or flicker/snow on the video screen (or just reuse of the previous frame); both of these are
better than pausing completely.
The basic alternative to TCP is known as UDP, for User Datagram Protocol. UDP, like TCP, provides port numbers to support
delivery to multiple endpoints within the receiving host, in effect to a specific process on the host. As with TCP, a UDP socket
consists of a ⟨host,port⟩ pair. UDP also includes, like TCP, a checksum over the data. However, UDP omits the other TCP features:
there is no connection setup, no lost-packet detection, no automatic timeout/retransmission, and the application must manage its
own packetization. This simplicity should not be seen as all negative: the absence of connection setup means data transmission can
get started faster, and the absence of lost-packet detection means there is no head-of-line blocking. See 11 UDP Transport.
The Real-time Transport Protocol, or RTP, sits above UDP and adds some additional support for voice and video applications.

1.12.1 Transport Communications Patterns


The two “classic” traffic patterns for Internet communication are these:
Interactive or bursty communications such as via ssh or telnet, with long idle times between short bursts
Bulk file transfers, such as downloading a web page
TCP handles both of these well, although its congestion-management features apply only when a large amount of data is in transit
at once. Web browsing is something of a hybrid; over time, there is usually considerable burstiness, but individual pages now often
exceed 1 MB.
To the above we might add request/reply operations, eg to query a database or to make DNS requests. TCP is widely used here as
well, though most DNS traffic still uses UDP. There are periodic calls for a new protocol specifically addressing this pattern,
though at this point the use of TCP is well established. If a sequence of request/reply operations is envisioned, a single TCP
connection makes excellent sense, as the connection-setup overhead is minimal by comparison. See also 11.5 Remote Procedure
Call (RPC) and 12.22.2 SCTP.
This century has seen an explosion in streaming video (20.3.2 Streaming Video), in lengths from a few minutes to a few hours.
Streaming radio stations might be left playing indefinitely. TCP generally works well here, assuming the receiver can get, say, a
minute ahead, buffering the video that has been received but not yet viewed. That way, if there is a dip in throughput due to
congestion, the receiver has time to recover. Buffering works a little less well for streaming radio, as the listener doesn’t want to get
too far behind, though ten seconds is reasonable. Fortunately, audio bandwidth is smaller.

1.12.2 https://eng.libretexts.org/@go/page/11073
Another issue with streaming video is the bandwidth demand. Most streaming-video services attempt to estimate the available
throughput, and then adapt to that throughput by changing the video resolution (20.3 Real-time Traffic).
Typically, video streaming operates on a start/stop basis: the sender pauses when the receiver’s playback buffer is “full”, and
resumes when the playback buffer drops below a certain threshold.
If the video (or, for that matter, voice audio) is interactive, there is much less opportunity for stream buffering. If someone asks a
simple question on an Internet telephone call, they generally want an answer more or less immediately; they do not expect to wait
for the answer to make it through the other party’s stream buffer. 200 ms of buffering is noticeable. Here we enter the realm of
genuine real-time traffic (20.3 Real-time Traffic). UDP is often used to avoid head-of-line blocking. Lower bandwidth helps; voice-
grade communications traditionally need only 8 kB/sec, less if compression is used. On the other hand, there may be constraints on
the variation in delivery time (known as jitter; see 20.11.3 RTP Control Protocol for a specific numeric interpretation). Interactive
video, with its much higher bandwidth requirements, is more difficult; fortunately, users seem to tolerate the common pauses and
freezes.
Within the Transport layer, essentially all network connections involve a client and a server. Often this pattern is repeated at the
Application layer as well: the client contacts the server and initiates a login session, or browses some web pages, or watches a
movie. Sometimes, however, Application-layer exchanges fit the peer-to-peer model better, in which the two endpoints are more-
or-less co-equals. Some examples include
Internet telephony: there is no benefit in designating the party who place the call as the “client”
Message passing in a CPU cluster, often using 11.5 Remote Procedure Call (RPC)
The routing-communication protocols of 9 Routing-Update Algorithms. When router A reports to router B we might call A the
client, but over time, as A and B report to one another repeatedly, the peer-to-peer model makes more sense.
So-called peer-to-peer file-sharing, where individuals exchange files with other individuals (and as opposed to “cloud-based”
file-sharing in which the “cloud” is the server).
RFC 5694 [https://tools.ietf.org/html/rfc5694.html] contains additional discussion of peer-to-peer patterns.

1.12.2 Content-Distribution Networks


Sites with an extremely large volume of content to distribute often turn to a specialized communication pattern called a content-
distribution network or CDN. To reduce the amount of long-distance traffic, or to reduce the round-trip time, a site replicates its
content at multiple datacenters (also called Points of Presence (PoPs), nodes, access points or edge servers). When a user makes a
request (eg for a web page or a video), the request is routed to the nearest (or approximately nearest) datacenter, and the content is
delivered from there.

CDN Mapping
For a geographical map of the servers in the NetFlix [www.netflix.com] CDN as of 2016, see [BCTCU16]. The map was
created solely through end-user measurements. Most of the servers are in North and South America and Europe.

Large web pages typically contain both static content and also individualized dynamic content. On a typical Facebook page, for
example, the videos and javascript might be considered static, while the individual wall posts might be considered dynamic. The
CDN may cache all or most of the static content at each of its edge servers, leaving the dynamic content to come from a centralized
server. Alternatively, the dynamic content may be replicated at each CDN edge node as well, though this introduces some real-time
coordination issues.
If dynamic content is not replicated, the CDN may include private high-speed links between its nodes, allowing for rapid low-
congestion delivery to any node. Alternatively, CDN nodes may simply communicate using the public Internet. Finally, the CDN
may (or may not) be configured to support fast interactive traffic between nodes, eg teleconferencing traffic, as is outlined in 20.6.1
A CDN Alternative to IntServ.
Organizations can create their own CDNs, but often turn to specialized CDN providers, who often combine their CDN services
with website-hosting services.
In principle, all that is needed to create a CDN is a multiplicity of datacenters, each with its own connection to the Internet; private
links between datacenters are also common. In practice, many CDN providers also try to build direct connections with the ISPs that

1.12.3 https://eng.libretexts.org/@go/page/11073
serve their customers; the Google Edge Network above does this. This can improve performance and reduce traffic costs; we will
return to this in 10.6.7.1 MED values and traffic engineering.
Finding the edge server that is closest to a given user is a tricky issue. There are three techniques in common use. In the first, the
edge servers are all given different IP addresses, and DNS is configured to have users receive the IP address of the closest edge
server, 7.8 DNS. In the second, each edge server has the same IP address, and anycast routing is used to route traffic from the user
to the closest edge server, 10.6.8 BGP and Anycast. Finally, for HTTP applications a centralized server can look up the
approximate location of the user, and then redirect the web page to the closest edge server.

This page titled 1.12: Transport is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.12.4 https://eng.libretexts.org/@go/page/11073
1.13: Firewalls
One problem with having a program on your machine listening on an open TCP port is that someone may connect and then, using
some flaw in the software on your end, do something malicious to your machine. Damage can range from the unintended
downloading of personal data to compromise and takeover of your entire machine, making it a distributor of viruses and worms or
a steppingstone in later break-ins of other machines.
A strategy known as buffer overflow (22.2 Stack Buffer Overflow) has been the basis for a great many total-compromise attacks.
The idea is to identify a point in a server program where it fills a memory buffer with network-supplied data without careful length
checking; almost any call to the C library function gets(buf) will suffice. The attacker then crafts an oversized input string
which, when read by the server and stored in memory, overflows the buffer and overwrites subsequent portions of memory,
typically containing the stack-frame pointers. The usual goal is to arrange things so that when the server reaches the end of the
currently executing function, control is returned not to the calling function but instead to the attacker’s own payload code located
within the string.
A firewall is a mechanism to block connections deemed potentially risky, eg those originating from outside the site. Generally
ordinary workstations do not ever need to accept connections from the Internet; client machines instead initiate connections to
(better-protected) servers. So blocking incoming connections works reasonably well; when necessary (eg for games) certain ports
can be selectively unblocked.
The original firewalls were built into routers. Incoming traffic to servers was often blocked unless it was sent to one of a modest
number of “open” ports; for non-servers, typically all inbound connections were blocked. This allowed internal machines to operate
reasonably safely, though being unable to accept incoming connections is sometimes inconvenient.
Nowadays per-host firewalls – in addition to router-based firewalls – are common: you can configure your workstation not to
accept inbound connections to most (or all) ports regardless of whether software on the workstation requests such a connection.
Outbound connections can, in many cases, also be prevented.
The typical home router implements something called network-address translation (7.7 Network Address Translation), which, in
addition to conserving IPv4 addresses, also provides firewall protection.

This page titled 1.13: Firewalls is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.13.1 https://eng.libretexts.org/@go/page/11074
1.14: Some Useful Utilities
There exists a great variety of useful programs for probing and diagnosing networks. Here we list a few of the simpler, more
common and available ones; some of these are addressed in more detail in subsequent chapters. Some of these, like ping , are
generally present by default; others will have to be installed from somewhere.
ping
Ping is useful to determine if another machine is accessible, eg

ping www.cs.luc.edu
ping 147.126.1.230

See 7.11 Internet Control Message Protocol for how it works. Sometimes ping fails because the necessary packets are blocked
by a firewall.
ifconfig, ipconfig, ip
To find your own IP address you can use ipconfig on Windows, ifconfig on Linux and Macintosh systems, or the
newer ip addr list on Linux. The output generally lists all active interfaces but can be restricted to selected interfaces if
desired. The ip command in particular can do many other things as well. The Windows command
netsh interface ip show config also provides IP addresses.
nslookup, dig and host
This trio of programs, all developed by the Internet Systems Consortium [http://isc.org], are all used for DNS lookups. They differ
in convenience and options. The oldest is nslookup , the one with the most options (by a rather wide margin) is dig , and the
newest and arguably most convenient for normal usage is host .

nslookup intronetworks.cs.luc.edu

Non-authoritative answer:
Name: intronetworks.cs.luc.edu
Address: 162.216.18.28

dig intronetworks.cs.luc.edu

...
;; Answer SECTION:
intronetworks.cs.luc.edu. 86400 IN A 162.216.18.28
...

host intronetworks.cs.luc.edu

intronetworks.cs.luc.edu has address 162.216.18.28


intronetworks.cs.luc.edu has IPv6 address 2600:3c03::f03c:91ff:fe69:f438

See 7.8.1 nslookup (and dig).


traceroute
This lists the route from you to a remote host:

1.14.1 https://eng.libretexts.org/@go/page/11075
traceroute intronetworks.cs.luc.edu

1 147.126.65.1 (147.126.65.1) 0.751 ms 0.753 ms 0.783 ms


2 147.126.95.54 (147.126.95.54) 1.319 ms 1.286 ms 1.253 ms
3 12.31.132.169 (12.31.132.169) 1.225 ms 1.231 ms 1.193 ms
4 cr83.cgcil.ip.att.net (12.123.7.46) 4.983 ms cr84.cgcil.ip.att.net (12.123.7.170
5 cr83.cgcil.ip.att.net (12.123.7.46) 4.926 ms 4.904 ms 4.888 ms
6 cr1.cgcil.ip.att.net (12.122.99.33) 5.043 ms cr2.cgcil.ip.att.net (12.122.132.109
7 gar13.cgcil.ip.att.net (12.122.132.121) 3.879 ms 18.347 ms ggr4.cgcil.ip.att.net
8 chi-b21-link.telia.net (213.248.87.253) 2.344 ms 2.305 ms 2.409 ms
9 nyk-bb2-link.telia.net (80.91.248.197) 24.065 ms nyk-bb1-link.telia.net (213.155
10 nyk-b3-link.telia.net (62.115.112.255) 23.557 ms 23.548 ms nyk-b3-link.telia.net
11 netaccess-tic-133837-nyk-b3.c.telia.net (213.248.99.90) 23.957 ms 24.382 ms 24
12 0.e1-4.tbr1.mmu.nac.net (209.123.10.101) 24.922 ms 24.737 ms 24.754 ms
13 207.99.53.42 (207.99.53.42) 24.024 ms 24.249 ms 23.924 ms

The last router (and intronetworks.cs.luc.edu itself) don’t respond to the traceroute packets, so the list is not quite
complete. The Windows tracert utility is functionally equivalent. See 7.11.1 Traceroute and Time Exceeded for further
information.
Traceroute sends, by default, three probes for each router. Sometimes the responses do not all come back from the same router, as
happened above at routers 4, 6, 7, 9 and 10. Router 9 sent back three distinct responses.
On Linux systems the mtr [en.Wikipedia.org/wiki/MTR_(software)] command may be available as an alternative to traceroute; it
repeats the traceroute at one-second intervals and generates cumulative statistics.
route and netstat
The commands route , route print (Windows), ip route show (Linux), and netstat -r (all systems)
display the host’s local IP forwarding table. For workstations not acting as routers, this includes the route to the default router and,
usually, not much else. The default route is sometimes listed as destination 0.0.0.0 with netmask 0.0.0.0 (equivalent to 0.0.0.0/0).
The command netstat -a shows the existing TCP connections and open UDP sockets.
netcat
The netcat program, often called nc , allows the user to create TCP or UDP connections and send lines of text back and
forth. It is seldom included by default. See 11.1.4 netcat and 12.6.2 netcat again.
WireShark
This is a convenient combination of packet capture and packet analysis, from wireshark.org [http://wireshark.org]. See 12.4 TCP
and WireShark and 8.11 Using IPv6 and IPv4 Together for examples.
WireShark was originally named Etherreal. An earlier command-line-only packet-capture program is tcpdump
[www.tcpdump.org/], though WireShark has greatly expanded support for packet-format decoding. Both WireShark and tcpdump
support both live packet capture and reading from .pcap (packet capture) and .pcapng (next generation) files.
WireShark is the only non-command-line program listed here. It is sometimes desired to monitor packets on a remote system. If X-
windows is involved (eg on Linux), this can be done by logging in from one’s local system using ssh -X , which enables X-
windows forwarding, and then starting wireshark (or perhaps sudo wireshark ) from the command line. Other
alternatives include tcpdump and tshark; the latter is part of the WireShark distribution and supports the same packet-decoding
facilities as WireShark. Finally, there is termshark [https://termshark.io], a frontend for tshark that offers a terminal-based interface
reasonably similar to WireShark’s graphical interface.

This page titled 1.14: Some Useful Utilities is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1.14.2 https://eng.libretexts.org/@go/page/11075
1.15: IETF and OSI
The Internet protocols discussed above are defined by the Internet Engineering Task Force, or IETF (under the aegis of the
Internet Architecture Board, or IAB, in turn under the aegis of the Internet Society, ISOC). The IETF publishes “Request For
Comment” or RFC documents that contain all the formal Internet standards; these are available at http://www.ietf.org/rfc.html
(note that, by the time a document appears here, the actual comment-requesting period is generally long since closed). The five-
layer model is closely associated with the IETF, though is not an official standard.
RFC standards sometimes allow modest flexibility. With this in mind, RFC 2119 [https://tools.ietf.org/html/rfc2119.html] declares
official understandings for the words MUST and SHOULD. A feature labeled with MUST is “an absolute requirement for the
specification”, while the term SHOULD is used when

there may exist valid reasons in particular circumstances to ignore a particular


item, but the full implications must be understood and carefully weighed before
choosing a different course.
The original ARPANET network was developed by the US government’s Defense Advanced Research Projects Agency, or
DARPA; it went online in 1969. The National Science Foundation began NSFNet in 1986; this largely replaced ARPANET. In
1991, operation of the NSFNet backbone was turned over to ANSNet, a private corporation. The ISOC was founded in 1992 as the
NSF continued to retreat from the networking business.

Hallmarks of the IETF design approach were David Clark’s declaration


We reject: kings, presidents and voting.
We believe in: rough consensus and running code.
and RFC Editor Jon Postel [www.postel.org/postel.html]’s Robustness Principle
Be liberal in what you accept, and conservative in what you send.

Postel’s aphorism has come in for criticism in recent years, especially with regard to cryptographic protocols, for which lax
enforcement can lead to security vulnerabilities. To be fair, however, Postel wrote this in an era when protocol specifications
sometimes failed to fully spell out the rules in every possible situation; today’s cryptographic protocols are generally much more
complete. One way to read Postel’s rule is that protocol implementations should be as strict as necessary, but no stricter.
There is a persistent – though false – notion that the distributed-routing architecture of IP was due to a US Department of Defense
mandate that the original ARPAnet be able to survive a nuclear attack. In fact, the developers of IP seemed unconcerned with this.
However, Paul Baran did write, in his 1962 paper outlining the concept of packet switching, that

If [the number of stations] is made sufficiently large, it can be shown that highly
survivable system structures can be built – even in the thermonuclear era.
In 1977 the International Organization for Standardization, or ISO, founded the Open Systems Interconnection project, or OSI, a
process for creation of new network standards. OSI represented an attempt at the creation of networking standards independent of
any individual government.
The OSI project is today perhaps best known for its seven-layer networking model: between Transport and Application were
inserted the Session and Presentation layers. The Session layer was to handle “sessions” between applications (including the
graceful closing of Transport-layer connections, something included in TCP, and the re-establishment of “broken” Transport-layer
connections, which TCP could sorely use), and the Presentation layer was to handle things like defining universal data formats (eg
for binary numeric data, or for non-ASCII character sets), and eventually came to include compression and encryption as well.
Data presentation and session management are important concepts, but in many cases it has not proved necessary, or even
particularly useful, to make them into true layers, in the sense that a layer communicates directly only with the layers adjacent to it.
What often happens is that the Application layer manages its own Transport connections, and is responsible for reading and writing
data directly from and to the Transport layer. The application then uses conventional libraries for Presentation actions such as
encryption, compression and format translation, and for Session actions such as handling broken Transport connections and
multiplexing streams of data over a single Transport connection. Version 2 of the HTTP protocol, for example, contains a
subprotocol for managing multiple streams; this is generally regarded as part of the Application layer.

1.15.1 https://eng.libretexts.org/@go/page/11076
However, the SSL/TLS transport-encryption service, 22.10.2 TLS, can be viewed as an example of a true Presentation layer.
Applications generally read and write data directly to the SSL/TLS endpoint, which in turn mostly encapsulates the underlying TCP
connection. The encapsulation is incomplete, though, in that SSL/TLS applications generally are responsible for creating their own
Transport-layer (TCP) connections; see 22.10.3 A TLS Programming Example and the note at the end of 22.10.3.2 TLSserver.
OSI has its own version of IP and TCP. The IP equivalent is CLNP, the ConnectionLess Network Protocol, although OSI also
defines a connection-oriented protocol CMNS. The TCP equivalent is TP4; OSI also defines TP0 through TP3 but those are for
connection-oriented networks.
It seems clear that the primary reasons the OSI protocols failed in the marketplace were their ponderous bureaucracy for protocol
management, their principle that protocols be completed before implementation began, and their insistence on rigid adherence to
the specifications to the point of non-interoperability; indeed, Postel’s aphorism above may have been intended as a response to this
last point. In contrast, the IETF had (and still has) a “two working implementations” rule for a protocol to become a “Draft
Standard”. From RFC 2026 [https://tools.ietf.org/html/rfc2026.html]:

A specification from which at least two independent and interoperable implementations from different code bases have been
developed, and for which sufficient successful operational experience has been obtained, may be elevated to the “Draft
Standard” level. [emphasis added]

This rule has often facilitated the discovery of protocol design weaknesses early enough that the problems could be fixed. The OSI
approach is a striking failure for the “waterfall” design model, when competing with the IETF’s cyclic “prototyping” model.
However, it is worth noting that the IETF has similarly been unable to keep up with rapid changes in html, particularly at the
browser end; the OSI mistakes were mostly evident only in retrospect.
Trying to fit protocols into specific layers is often both futile and irrelevant. By one perspective, the Real-Time Protocol RTP lives
at the Transport layer, but just above the UDP layer; others have put RTP into the Application layer. Parts of the RTP protocol
resemble the Session and Presentation layers. A key component of the IP protocol is the set of various router-update protocols;
some of these freely use higher-level layers. Similarly, tunneling might be considered to be a Link-layer protocol, but tunnels are
often created and maintained at the Application layer.
A sometimes-more-successful approach to understanding “layers” is to view them instead as parts of a protocol graph. Thus, in
the following diagram we have two protocol sublayers within the transport layer (UDP and RTP), and one protocol (ARP) not
easily assigned to a layer.

A B C F E D: 200.0.1.37

R1 R2 R3
200.0.0/24 200.0.1/24
Two LANs joined by three routers

This page titled 1.15: IETF and OSI is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.15.2 https://eng.libretexts.org/@go/page/11076
1.16: Berkeley Unix
Though not officially tied to the IETF, the Berkeley Unix releases became de facto reference implementations for most of the
TCP/IP protocols. 4.1BSD (BSD for Berkeley Software Distribution) was released in 1981, 4.2BSD in 1983, 4.3BSD in 1986,
4.3BSD-Tahoe in 1988, 4.3BSD-Reno in 1990, and 4.4BSD in 1994. Descendants today include FreeBSD, OpenBSD and NetBSD.
The TCP implementations TCP Tahoe and TCP Reno (13 TCP Reno and Congestion Management) took their names from the
corresponding 4.3BSD releases.

This page titled 1.16: Berkeley Unix is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1.16.1 https://eng.libretexts.org/@go/page/11077
1.E: An Overview of Networks (Exercises)

1.18 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 2.5 is distinct, for
example, from exercises 2.0 and 3.0. Exercises marked with a ♢ have solutions or hints at 24.1 Solutions for An Overview of
Networks.
1.0. Give forwarding tables for each of the switches S1-S4 in the following network with destinations A, B, C, D. For the next_hop
column, give the neighbor on the appropriate link rather than the interface number.

A B C
│ │ │
S1────────S2────────S3


S4───D

2.0. Give forwarding tables for each of the switches S1-S4 in the following network with destinations A, B, C, D. Again, use the
neighbor form of next_hop rather than the interface form. Try to keep the route to each destination as short as possible. What
decision has to be made in this exercise that did not arise in the preceding exercise?

A───S1──────S2───B
│ │
│ │
D───S4──────S3───C

2.5. In the network of the previous exercise, suppose that destinations directly connected to an immediate neighbor are always
reached via that neighbor; eg S1’s forwarding table will always include ⟨B,S2⟩ and ⟨D,S4⟩. This leaves only routes to the
diagonally opposite nodes undetermined (eg S1 to C). Show that, no matter which next_hop entries are chosen for the diagonally
opposite destinations, no routing loops can ever be formed. (Hint: the number of links to any diagonally opposite switch is always
2.)
2.7.♢ Give forwarding tables for each of the switches A-E in the following network. Destinations are A-E themselves. Keep all
route lengths the minimum possible (one hop for an immediate neighbor, two hops for everything else). If a destination is an
immediate neighbor, you may list its next_hop as direct or local for simplicity. Indicate destinations for which there is more than
one choice for next_hop.
B
1 1

A C
1

1 1

1
D E

3.0. Consider the following arrangement of switches and destinations. Give forwarding tables (in neighbor form) for S1-S4 that
include default forwarding entries; the default entries should point toward S5. The default entries will thus automatically forward
to the “possible other destinations” shown below right.
Eliminate all table entries that are implied by the default entry (that is, if the default entry is to S3, eliminate all other entries for
which the next hop is S3).

1.E.1 https://eng.libretexts.org/@go/page/11078
A───S1
│ D
│ │
C───S3────────S4─────────S5────... possible other destinations
│ │
│ E
B───S2

4.0. Four switches are arranged as below. The destinations are S1 through S4 themselves.

S1──────S2
│ │
│ │
S4──────S3

(a). Give the forwarding tables for S1 through S4 assuming packets to adjacent nodes are sent along the connecting link, and
packets to diagonally opposite nodes are sent clockwise.
(b). Give the forwarding tables for S1 through S4 assuming the S1–S4 link is not used at all, not even for S1⟷S4 traffic.

5.0. Suppose we have switches S1 through S4; the forwarding-table destinations are the switches themselves. The tables for
S2 and S3 are as below, where the next_hop value is specified in neighbor form:
S2: ⟨S1,S1⟩ ⟨S3,S3⟩ ⟨S4,S3⟩
S3: ⟨S1,S2⟩ ⟨S2,S2⟩ ⟨S4,S4⟩

From the above we can conclude that S2 must be directly connected to both S1 and S3 as its table lists them as next_hops;
similarly, S3 must be directly connected to S2 and S4.
(a). The given tables are consistent with the network diagrammed in exercise 4.0. Are the tables also consistent with a network in
which S1 and S4 are not directly connected? If so, give such a network; if not, explain why S1 and S4 must be connected.
(b). Now suppose S3’s table is changed to the following. Find a network layout consistent with these tables in which S1 and S4 are
not directly connected. Do not add additional switches.
S3: ⟨S1,S4⟩ ⟨S2,S2⟩ ⟨S4,S4⟩
While the table for S4 is not given, you may assume that forwarding does work correctly. However, you should not assume that
paths are the shortest possible. Hint: It follows from the S3 table above that the path from S3 to S1 starts S3 ⟶ S4; how will this
path continue? The next switch along the path cannot be S1, because of the hypothesis that S1 and S4 are not directly connected.
6.0. (a) Suppose a network is as follows, with the only path from A to C passing through B:

... ──A────B────C── ...

Explain why a single routing loop cannot include both A and C. Hint: if the loop involves destination D, how does B forward to D?
(b). Suppose a routing loop follows the path A──S1──S2── … ──Sn──A, where none of the Si are equal to A. Show that all the
Si must be distinct. (A corollary of this is that any routing loop created by datagram-forwarding either involves forwarding back
and forth between a pair of adjacent switches, or else involves an actual graph cycle in the network topology; linear loops of length
greater than 1 are impossible.)
7.0 Consider the following arrangement of switches:

S1─────S4─────S10──A──E
│ │
│ │
S2─────S5─────S11──B

1.E.2 https://eng.libretexts.org/@go/page/11078
│ │
│ │
S3─────S6─────S12──C──D──F

Suppose S1-S6 have the forwarding tables below. For each of the following destinations, suppose a packet is sent to the destination
from S1.
(a). A
(b). B
(c). C
(d).♢ D
(e). E
(f). F
Give the switches the packet will pass through, including the initial switch S1, up until the final switch S10-S12.

S1: (A,S4), (B,S2), (C,S4), (D,S2), (E,S2), (F,S4)


S2: (A,S5), (B,S5), (D,S5), (E,S3), (F,S3)
S3: (B,S6), (C,S2), (E,S6), (F,S6)
S4: (A,S10), (C,S5), (E,S10), (F,S5)
S5: (A,S6), (B,S11), (C,S6), (D,S6), (E,S4), (F,S2)
S6: (A,S3), (B,S12), (C,S12), (D,S12), (E,S5), (F,S12)
7.5 Suppose a set of nodes A-F and switches S1-S6 are connected as shown.

A────S1───5───S4────D
│ │
1 1
│ │
B────S2───2───S5────E
│ │
8 1
│ │
C────S3───4───S6────F

The links between switches are labeled with weights, which are used by some routing applications. The weights represent the cost
of using that link. You are to find the path through S1-S6 with lowest total cost (that is, with smallest sum of weights), for each of
the following transmissions. For example, the lowest-cost path from A to E is A–S1–S2–S5–E, for a total cost of 1+2=3; the
alternative path A–S1–S4–S5–E has total cost 5+1=6.
(a).♢ A→F
(b). A→D
(c). A→C
(d). Give the complete forwarding table for S2, where all routes are selected for lowest total cost.
8.0 In exercise 7.0, the routes taken by packets A-D are reasonably direct, but the routes for E and F are rather circuitous.
(a). Assign weights to the seven links S1–S2, S2–S3, S1–S4, S2–S5, S3–S6, S4–S5 and S5–S6, as in exercise 7.5, so that
destination E’s route in exercise 7.0 becomes the optimum (lowest total link weight) path.
(b). Assign weights to the seven links that make destination F’s route in exercise 7.0 optimal. (This will be a different set of weights
from part (a).)

1.E.3 https://eng.libretexts.org/@go/page/11078
Hint: you can do this by assigning a weight of 1 to all links except to one or two “bad” links; the “bad” links get a weight of 10. In
each of (a) and (b) above, the route taken will be the route that avoids all the “bad” links. You must treat (a) entirely differently
from (b); there is no assignment of weights that can account for both routes.
9.0 Suppose we have the following three Class C IP networks, joined by routers R1–R4. There is no connection to the outside
Internet. Give the forwarding table for each router. For networks directly connected to a router (eg 200.0.1/24 and R1), include the
network in the table but list the next hop as direct or local.

R1── 200.0.1/24



R4──────────R2── 200.0.2/24



R3── 200.0.3/24

This page titled 1.E: An Overview of Networks (Exercises) is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated
by Peter Lars Dordal.

1.E.4 https://eng.libretexts.org/@go/page/11078
CHAPTER OVERVIEW

2: Ethernet
We now turn to a deeper analysis of the ubiquitous Ethernet LAN protocol. Current user-level Ethernet today is usually 100
Mbps, with Gigabit and 10 Gigabit Ethernet standard in server rooms and backbones, but because the potential for collisions makes
Ethernet speeds scale in odd ways, we will start with the 10 Mbps formulation. While the 10 Mbps speed is obsolete, and while
even the Ethernet collision mechanism is largely obsolete, collision management itself continues to play a significant role in
wireless networks.
2.1: Prelude to Ethernet
2.2: 10-Mbps Classic Ethernet
2.3: 100 Mbps (Fast) Ethernet
2.4: Gigabit Ethernet
2.5: Ethernet Switches
2.6: Spanning Tree Algorithm and Redundancy
2.7: Virtual LAN (VLAN)
2.8: TRILL and SPB
2.9: Software-Defined Networking
2.E: Ethernet (Exercises)
Index

This page titled 2: Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1
2.1: Prelude to Ethernet
The original Ethernet specification was the 1976 paper of Metcalfe and Boggs, [MB76]. The data rate was 10 megabits per second,
and all connections were made with coaxial cable instead of today’s twisted pair. The authors described their architecture as
follows:

We cannot afford the redundant connections and dynamic routing of store-and-forward


packet switching to assure reliable communication, so we choose to achieve reliability
through simplicity. We choose to make the shared communication facility passive so that
the failure of an active element will tend to affect the communications of only a single
station.
Classic Ethernet was indeed simple, and – mostly – passive. In its most basic form, the Ethernet medium was one long piece of
coaxial cable, onto which stations could be connected via taps. If two stations happened to transmit at the same time – most likely
because they were both waiting for a third station to finish – their signals were lost to the resultant collision. The only active
components besides the stations were repeaters, originally intended simply to make end-to-end joins between cable segments.
Repeaters soon evolved into multiport devices, allowing the creation of arbitrary tree (that is, loop-free) topologies. At this point
the standard wiring model shifted from one long cable, snaking from host to host, to a “star” network, where each host connected
directly to a central multipoint repeater. This shift allowed for the replacement of expensive coaxial cable by the much-cheaper
twisted pair; links could not be as long, but they did not need to be.
Repeaters, which forwarded collisions, soon gave way to switches, which did not (2.4 Ethernet Switches). Switches thus
partitioned an Ethernet into disjoint collision domains, or physical Ethernets, through which collisions could propagate; an
aggregation of physical Ethernets connected by switches was then sometimes known as a virtual Ethernet. Collision domains
became smaller and smaller, eventually down to individual links and then vanishing entirely.
Throughout all these changes, Ethernet never implemented true redundant connections, in that at any one instant the topology was
always required to be loop-free. However, Ethernet did adopt a mechanism by which idle backup links can quickly be placed into
service after a primary link fails; 2.5 Spanning Tree Algorithm and Redundancy.

This page titled 2.1: Prelude to Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

2.1.1 https://eng.libretexts.org/@go/page/11099
2.2: 10-Mbps Classic Ethernet
Originally, Ethernet consisted of a long piece of cable (possibly spliced by repeaters). When a station transmitted, the data went
everywhere along that cable. Such an arrangement is known as a broadcast bus; all packets were, at least at the physical layer,
broadcast onto the shared medium and could be seen, theoretically, by all other nodes. Logically, however, most packets would
appear to be transmitted point-to-point, not broadcast. This was because between each station CPU and the cable there was a
peripheral device (that is, a card) known as a network interface, which would take care of the details of transmitting and receiving.
The network interface would (and still does) decide when a received packet should be forwarded to the host, via a CPU interrupt.

A B C D

NI NI NI NI

Whenever two stations transmitted at the same time, the signals would collide, and interfere with one another; both transmissions
would fail as a result. Proper handling of collisions was an essential part of the access-mediation strategy for the shared medium. In
order to minimize collision loss, each station implemented the following:
1. Before transmission, wait for the line to become quiet
2. While transmitting, continually monitor the line for signs that a collision has occurred; if a collision is detected, cease
transmitting
3. If a collision occurs, use a backoff-and-retransmit strategy
These properties can be summarized with the CSMA/CD acronym: Carrier Sense, Multiple Access, Collision Detect. (The term
“carrier sense” was used by Metcalfe and Boggs as a synonym for “signal sense”; there is no literal carrier frequency to be sensed.)
It should be emphasized that collisions are a normal event in Ethernet, well-handled by the mechanisms above.

IEEE 802 Network Standards


The IEEE network standards all begin with 802: 802.3 is Ethernet, 802.11 is Wi-Fi, 802.16 is WiMAX, and there are many
others. One sometimes encounters the claim that 802 represents the date of an early meeting: February 1980. However, the
IEEE has a continuous stream of standards (with occasional gaps): 799: Handling and Disposal of Transformer PCBs, 800: D-
C Aircraft Rotating Machines, 803: Recommended Practice for Unique Identification in Power Plants, etc.

Classic Ethernet came in version 1 [1980, DEC-Intel-Xerox], version 2 [1982, DIX], and IEEE 802.3. There are some minor
electrical differences between these, and one rather substantial packet-format difference, below. In addition to these, the Berkeley
Unix trailing-headers packet format was used for a while.
There were three physical formats for 10 Mbps Ethernet cable: thick coax (10BASE-5), thin coax (10BASE-2), and, last to arrive,
twisted pair (10BASE-T). Thick coax was the original; economics drove the successive development of the later two. The cheaper
twisted-pair cabling eventually almost entirely displaced coax, at least for host connections.
The original specification included support for repeaters, which were in effect signal amplifiers although they might attempt to
clean up a noisy signal. Repeaters processed each bit individually and did no buffering. In the telecom world, a repeater might be
called a digital regenerator. A repeater with more than two ports was commonly called a hub; hubs allowed branching and thus
much more complex topologies.
It was the rise of hubs that enabled star topologies in which each host connects directly to the hub rather than to one long run of
coax. This in turn enabled twisted-pair cable: while this supported maximum runs of about 100 meters, versus the 500 meters of
thick coax, each run simply had to go from the host to the central hub in the wiring closet. This was much more convenient than
having to snake coax all around the building. A hub failure would bring the network down, but hubs proved largely reliable.
Bridges – later known as switches – came along a short time later. While repeaters act at the bit layer, a switch reads in and
forwards an entire packet as a unit, and the destination address is consulted to determine to where the packet is forwarded. Except
for possible collision-related performance issues, hubs and switches are interchangeable. Eventually, most wiring-closet hubs were
replaced with switches.

2.2.1 https://eng.libretexts.org/@go/page/11089
Hubs propagate collisions; switches do not. If the signal representing a collision were to arrive at one port of a hub, it would, like
any other signal, be retransmitted out all other ports. If a switch were to detect a collision one one port, no other ports would be
involved; only packets received successfully are ever retransmitted out other ports.
Originally, switches were seen as providing interconnection (“bridging”) between separate physical Ethernets; a switch for such a
purpose needed just two ports. Later, a switched Ethernet was seen as one large “virtual” Ethernet, composed of smaller collision
domains. Although the term “switch” is now much more common than “bridge”, the latter is still in use, particularly by the IEEE.
For some, a switch is a bridge with more than two ports, though that distinction is relatively meaningless as it has been years since
two-port bridges were last manufactured. We return to switching below in 2.4 Ethernet Switches.
In the original thick-coax cabling, connections were made via taps, often literally drilled into the coax central conductor. Thin coax
allowed the use of T-connectors to attach hosts. Twisted-pair does not allow mid-cable attachment; it is only used for point-to-point
links between hosts, switches and hubs. Mid-cable attachment, however, was always simply a way of avoiding the need for active
devices like hubs and switches.
There is still a role for hubs today when one wants to monitor the Ethernet signal from A to B (eg for intrusion detection analysis),
although some switches now also support a form of monitoring.
All three cable formats could interconnect, although only through repeaters and hubs, and all used the same 10 Mbps transmission
speed. While twisted-pair cable is still used by 100 Mbps Ethernet, it generally needs to be a higher-performance version known as
Category 5, versus the 10 Mbps Category 3.
Data in 10 Mbps Ethernets was transmitted using Manchester encoding; see 4.1.3 Manchester. This meant that the electronics had
to operate, in effect, at 20 Mbps. Faster Ethernets use different encodings.

This page titled 2.2: 10-Mbps Classic Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

2.2.2 https://eng.libretexts.org/@go/page/11089
2.3: 100 Mbps (Fast) Ethernet
Classic Ethernet, at 10 Mbps, is quite slow by modern standards, and so by 1995 the IEEE had created standards for Ethernet that
operated at 100 Mbps. Ethernet at this speed is commonly known as Fast Ethernet; this name is used even today as “Fast”
Ethernet is being supplanted by Gigabit Ethernet (below). By far the most popular form of 100 Mbps Ethernet is officially known
as 100BASE-TX; it operates over twisted-pair cable.
In the previous analysis of 10 Mbps Ethernet, the bandwidth, minimum packet size and maximum network diameter were all
interrelated, in order to ensure that collisions could always be detected by the sender. Increasing the speed means that at least one of
the other constraints must be scaled as well. For example, if the network physical diameter were to remain the same when moving
to 100 Mbps, then the Fast-Ethernet round-trip time would be the same in microseconds but would be 10-fold larger measured in
bits; this might mean a minimum packet size of 640 bytes instead of 64 bytes. (Actually, the minimum packet size might be
somewhat smaller, partly because the “jam signal” doesn’t have to become longer, and partly because some of the numbers in the
10 Mbps delay budget above were larger than necessary, but it would still be large enough that a substantial amount of bandwidth
would be consumed by padding.) The designers of Fast Ethernet felt that such a large minimum-packet size was impractical.
However, Fast Ethernet was developed at a time (~1995) when reliable switches (below) were widely available; the quote above at
2 Ethernet from [MB76] had become obsolete. Large “virtual” Ethernet networks could be formed by connecting small physical
Ethernets with switches, effectively eliminating the need to support large-diameter physical Ethernets. So instead of increasing the
minimum packet size, the decision was made to ensure collision detectability by reducing the network diameter instead. The
network diameter chosen was a little over 400 meters, with reductions to account for the presence of hubs. At 2.3 meters/bit, 400
meters is 174 bits, for a round-trip of 350 bits. The slot time (and minimum packet size) remains 512 bits – now 5.12 µsec – which
is safely large enough to ensure collision detection.
This 400-meter diameter, however, may be misleading: the specific 100BASE-TX standard, which uses so-called Category 5
twisted-pair cabling (or better), limits the length of any individual cable segment to 100 meters. The maximum 100BASE-TX
network diameter – allowing for hubs – is just over 200 meters. The 400-meter distance does apply to optical-fiber-based
100BASE-FX in half-duplex mode, but this is not common.
The 100BASE-TX network-diameter limit of 200 meters might seem small; it amounts in many cases to a single hub with multiple
100-meter cable segments radiating from it. In practice, however, such “star” configurations could easily be joined with switches.
As we will see below in 2.4 Ethernet Switches, switches partition an Ethernet into separate “collision domains”; the network-
diameter rules apply to each domain separately but not to the aggregated whole. In a fully switched (that is, no hubs) 100BASE-TX
LAN, each collision domain is simply a single twisted-pair link, subject to the 100-meter maximum length.
Fast Ethernet also introduced the concept of full-duplex Ethernet: two twisted pairs could be used, one for each direction. Full-
duplex Ethernet is limited to paths not involving hubs, that is, to single station-to-station links, where a station is either a host or a
switch. Because such a link has only two potential senders, and each sender has its own transmit line, full-duplex Ethernet is
entirely collision-free.
Fast Ethernet (at least the 100BASE-TX form) uses 4B/5B encoding, covered in 4.1.4 4B/5B. This means that the electronics have
to handle 125 Mbps, versus the 200 Mbps if Manchester encoding were still used.
Fast Ethernet 100BASE-TX does not particularly support links between buildings, due to the maximum-cable-length limitation.
However, fiber-optic point-to-point links are an effective alternative here, provided full-duplex is used to avoid collisions. We
mentioned above that the coax-based 100BASE-FX standard allowed a maximum half-duplex run of 400 meters, but 100BASE-FX
is much more likely to use full duplex, where the maximum cable length rises to 2,000 meters.

This page titled 2.3: 100 Mbps (Fast) Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

2.3.1 https://eng.libretexts.org/@go/page/11090
2.4: Gigabit Ethernet
The problem of scaling Ethernet to handle collision detection gets harder as the transmission rate increases. If we were continue to
maintain the same 51.2 µsec slot time but raise the transmission rate to 1000 Mbps, the maximum network diameter would now be
20-40 meters. Instead of that, Gigabit Ethernet moved to a 4096-bit (512-byte, or 4.096 µsec) slot time, at least for the twisted-pair
versions. Short frames need to be padded, but this padding is done by the hardware. Gigabit Ethernet 1000Base-T uses so-called
PAM-5 encoding, below, which supports a special pad pattern (or symbol) that cannot appear in the data. The hardware pads the
frame with these special patterns, and the receiver can thus infer the unpadded length as set by the host operating system.
Gigabit vs Disks
Once a network has reached Gigabit speed, the network is generally as fast as reading from or writing to a disk. Keeping data on
another node no longer slows things down. This greatly expands the range of possibilities for constructing things like clustered
databases.
However, the Gigabit Ethernet slot time is largely irrelevant, as full-duplex (bidirectional) operation is almost always supported.
Combined with the restriction that each length of cable is a station-to-station link (that is, hubs are no longer allowed), this means
that collisions simply do not occur and the network diameter is no longer a concern. (10 Gigabit Ethernet has officially abandoned
any pretense of supporting collisions; everything must be full-duplex.)
There are actually multiple Gigabit Ethernet standards (as there are for Fast Ethernet). The different standards apply to different
cabling situations. There are full-duplex optical-fiber formulations good for many miles (eg 1000Base-LX10), and even a version
with a 25-meter maximum cable length (1000Base-CX), which would in theory make the original 512-bit slot practical.
The most common gigabit Ethernet over copper wire is 1000BASE-T (sometimes incorrectly referred to as 1000BASE-TX. While
there exists a TX, it requires Category 6 cable and is thus seldom used; many devices labeled TX are in fact 1000BASE-T). For
1000BASE-T, all four twisted pairs in the cable are used. Each pair transmits at 250 Mbps, and each pair is bidirectional, thus
supporting full-duplex communication. Bidirectional communication on a single wire pair takes some careful echo cancellation at
each end, using a circuit known as a “hybrid” that in effect allows detection of the incoming signal by filtering out the outbound
signal.
On any one cable pair, there are five signaling levels. These are used to transmit two-bit symbols at a rate of 125 symbols/µsec, for
a data rate of 250 bits/µsec. Two-bit symbols in theory only require four signaling levels; the fifth symbol allows for some
redundancy which is used for error detection and correction, for avoiding long runs of identical symbols, and for supporting a
special pad symbol, as mentioned above. The encoding is known as 5-level pulse-amplitude modulation, or PAM-5. The target bit
error rate (BER) for 1000BASE-T is 10-10, meaning that the packet error rate is less than 1 in 106.
In developing faster Ethernet speeds, economics plays at least as important a role as technology. As new speeds reach the market,
the earliest adopters often must take pains to buy cards, switches and cable known to “work together”; this in effect amounts to
installing a proprietary LAN. The real benefit of Ethernet, however, is arguably that it is standardized, at least eventually, and thus a
site can mix and match its cards and devices. Having a given Ethernet standard support existing cable is even more important
economically; the costs of replacing inter-office cable often dwarf the costs of the electronics.
As Ethernet speeds continue to climb, it has become harder and harder for host systems to keep up. As a result, it is common for
quite a bit of higher-layer processing to be offloaded onto the Ethernet hardware, for example, TCP checksum calculation. See 12.5
TCP Offloading.

This page titled 2.4: Gigabit Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

2.4.1 https://eng.libretexts.org/@go/page/11091
2.5: Ethernet Switches
Switches join separate physical Ethernets (or sometimes Ethernets and other kinds of networks). A switch has two or more Ethernet
interfaces; when a packet is received on one interface it is retransmitted on one or more other interfaces. Only valid packets are
forwarded; collisions are not propagated. The term collision domain is sometimes used to describe the region of an Ethernet in
between switches; a given collision propagates only within its collision domain.
Switches have revolutionized Ethernet layout: all the collision-detection rules, including the rules for maximum network diameter,
apply only to collision domains, and not to the larger “virtual Ethernets” created by stringing collision domains together with
switches. As we shall see below, a switched Ethernet also offers much more resistance to eavesdropping than a non-switched (eg
hub-based) Ethernet.
Switch Costs
In the 1980’s the author once installed a two-port 10-Mbps Ethernet switch (then called a “bridge”) that cost $3000; cf the [MB76]
quote at 2 Ethernet. Today a wide variety of multiport 100-Mbps Ethernet switches are available for around $10, and almost all
installed Ethernets are fully switched.
Like simpler unswitched Ethernets, the topology for a switched Ethernet is in principle required to be loop-free. In practice,
however, most switches support the spanning-tree loop-detection protocol and algorithm, 2.5 Spanning Tree Algorithm and
Redundancy, which automatically “prunes” the network topology to make it loop-free while allowing the pruned links to be placed
back in service if a primary link fails.
While a switch does not propagate collisions, it must maintain a queue for each outbound interface in case it needs to forward a
packet at a moment when the interface is busy; on (rare) occasion packets are lost when this queue overflows.

This page titled 2.5: Ethernet Switches is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

2.5.1 https://eng.libretexts.org/@go/page/11092
2.6: Spanning Tree Algorithm and Redundancy
In theory, if you form a loop with Ethernet switches, any packet with destination not already present in the forwarding tables will
circulate endlessly, consuming most available throughput. Some early switches would actually do this; it was generally regarded as
catastrophic failure.
In practice, however, loops allow redundancy – if one link breaks there is still 100% connectivity – and so can be desirable. As a
result, Ethernet switches have incorporated a switch-to-switch protocol to construct a subset of the switch-connections graph that
has no loops and yet allows reachability of every host, known in graph theory as a spanning tree. Once the spanning tree is built,
links that are not part of the tree are disabled, even if they would represent the most efficient path between two nodes. If a link that
is part of the spanning tree fails, partitioning the network, a new tree is constructed, and some formerly disabled links may now
return to service.
One might ask, if switches can work together to negotiate a a spanning tree, whether they can also work together to negotiate loop-
free forwarding tables for the original non-tree topology, thus keeping all links active. The difficulty here is not the switches’
ability to coordinate, but the underlying Ethernet broadcast feature. As long as the topology has loops and broadcast is enabled,
broadcast packets might circulate forever. And disabling broadcast is not a straightforward option; switches rely on the broadcast-
based fallback-to-flooding strategy of 2.4.1 Ethernet Learning Algorithm to deliver to unknown destinations. However, we will
return to this point in 2.8 Software-Defined Networking. See also exercise 10.0.
The presence of hubs and other unswitched Ethernet segments slightly complicates the switch-connections graph. In the absence of
these, the graph’s nodes and edges are simply the hosts (including switches) and links of the Ethernet itself. If unswitched multi-
host Ethernet segments are present, then each of these becomes a single node in the graph, with a graph edge to each switch to
which it directly connects. (Any Ethernet switches not participating in the spanning-tree algorithm would be treated as hubs.)
Every switch has an ID, eg its smallest Ethernet address, and every edge that attaches to a switch does so via a particular, numbered
interface. The goal is to disable redundant (cyclical) paths while retaining the ability to deliver to any segment. The algorithm is
due to Radia Perlman, [RP85].
The switches first elect a root node, eg the one with the smallest ID. Then, if a given segment connects to two switches that both
connect to the root node, the switch with the shorter path to the root is used, if possible; in the event of ties, the switch with the
smaller ID is used. The simplest measure of path cost is the number of hops, though current implementations generally use a cost
factor inversely proportional to the bandwidth (so larger bandwidth has lower cost). Some switches permit other configuration here.
The process is dynamic, so if an outage occurs then the spanning tree is recomputed. If the outage should partition the network into
two pieces, both pieces will build spanning trees.
All switches send out regular messages on all interfaces called bridge protocol data units, or BPDUs (or “Hello” messages). These
are sent to the Ethernet multicast address 01:80:c2:00:00:00, from the Ethernet physical address of the interface. (Note that
Ethernet switches do not otherwise need a unique physical address for each interface.) The BPDUs contain
The switch ID
the ID of the node the switch believes is the root
the path cost (eg number of hops) to that root
These messages are recognized by switches and are not forwarded naively. Switches process each message, looking for
a switch with a lower ID than any the receiving switch has seen before (thus becoming the new root)
a shorter path to the existing root
an equal-length path to the existing root, but via a neighbor switch with a lower ID (the tie-breaker rule). If there are two ports
that connect to that switch, the port number is used as an additional tie-breaker.
In a heterogeneous Ethernet we would also introduce a preference for faster paths, but we will assume here that all links have the
same bandwidth.
When a switch sees a new root candidate, it sends BPDUs on all interfaces, indicating the distance. The switch includes the
interface leading towards the root.
Once this process has stabilized, each switch knows
its own path to the root

2.6.1 https://eng.libretexts.org/@go/page/11093
which of its ports any further-out switches will be using to reach the root
for each port, its directly connected neighboring switches
Now the switch can “prune” some (or all!) of its interfaces. It disables all interfaces that are not enabled by the following rules:
1. It enables the port via which it reaches the root
2. It enables any of its ports that further-out switches use to reach the root
3. If a remaining port connects to a segment to which other “segment-neighbor” switches connect as well, the port is enabled if the
switch has the minimum cost to the root among those segment-neighbors, or, if a tie, the smallest ID among those neighbors, or,
if two ports are tied, the port with the smaller ID.
4. If a port has no directly connected switch-neighbors, it presumably connects to a host or segment, and the port is enabled.
Rules 1 and 2 construct the spanning tree; if S3 reaches the root via S2, then Rule 1 makes sure S3’s port towards S2 is open, and
Rule 2 makes sure S2’s corresponding port towards S3 is open. Rule 3 ensures that each network segment that connects to multiple
switches gets a unique path to the root: if S2 and S3 are segment-neighbors each connected to segment N, then S2 enables its port
to N and S3 does not (because 2<3). The primary concern here is to create a path for any host nodes on segment N; S2 and S3 will
create their own paths via Rules 1 and 2. Rule 4 ensures that any “stub” segments retain connectivity; these would include all hosts
directly connected to switch ports.

This page titled 2.6: Spanning Tree Algorithm and Redundancy is shared under a CC BY-NC-ND license and was authored, remixed, and/or
curated by Peter Lars Dordal.

2.6.2 https://eng.libretexts.org/@go/page/11093
2.7: Virtual LAN (VLAN)
What do you do when you have different people in different places who are “logically” tied together? For example, for a while the
Loyola University CS department was split, due to construction, between two buildings.
One approach is to continue to keep LANs local, and use IP routing between different subnets. However, it is often convenient
(printers are one reason) to configure workgroups onto a single “virtual” LAN, or VLAN. A VLAN looks like a single LAN,
usually a single Ethernet LAN, in that all VLAN members will see broadcast packets sent by other members and the VLAN will
ultimately be considered to be a single IP subnet (7.6 IPv4 Subnets). Different VLANs are ultimately connected together, but likely
only by passing through a single, central IP router. Broadcast traffic on one VLAN will generally not propagate to any other
VLAN; this isolation of broadcast traffic is another important justification for VLAN use.
VLANs can be visualized and designed by using the concept of coloring. We logically assign all nodes on the same VLAN the
same color, and switches forward packets accordingly. That is, if S1 connects to red machines R1 and R2 and blue machines B1
and B2, and R1 sends a broadcast packet, then it goes to R2 but not to B1 or B2. Switches must, of course, be told the color of each
of their ports.

R1 R3
B1 S1 S2 S3
B3
B2

R2

S4 Router R

One network of switches S1-S4 divided into two VLANs, red and blue

In the diagram above, S1 and S3 each have both red and blue ports. The switch network S1-S4 will deliver traffic only when the
source and destination ports are the same color. Red packets can be forwarded to the blue VLAN only by passing through the router
R, entering R’s red port and leaving its blue port. R may apply firewall rules to restrict red–blue traffic.
When the source and destination ports are on the same switch, nothing needs to be added to the packet; the switch can keep track of
the color of each of its ports. However, switch-to-switch traffic must be additionally tagged to indicate the source. Consider, for
example, switch S1 above sending packets to S3 which has nodes R3 (red) and B3 (blue). Traffic between S1 and S3 must be
tagged with the color, so that S3 will know to what ports it may be delivered. The IEEE 802.1Q protocol is typically used for this
packet-tagging; a 32-bit “color” tag is inserted into the Ethernet header after the source address and before the type field. The first
16 bits of this field is 0x8100, which becomes the new Ethernet type field and which identifies the frame as tagged. A separate
802.3 amendment allows Ethernet packets to be slightly larger, to accommodate the tags.
Double-tagging is possible; this would allow an ISP to have one level of tagging and its customers to have another level.
Finally, most commercial-grade switches do provide some way of selectively allowing traffic between different VLANs; with such
switches, for example, rules could be created to allow R1 to connect to B3 without the use of the router R. One difficulty with this
approach is that there is often little standardization among switch manufacturers. This makes it difficult to create, for example,
authorization applications that allow opening inter-VLAN connections on the fly. Another issue is that some switches allow inter-
VLAN rules based only on MAC addresses, and not, for example, on TCP port numbers. The OpenFlow protocol (2.8.1 OpenFlow
Switches) has the potential to create the necessary standardization here. Even without OpenFlow, however, some specialty access-
and-authentication systems have been developed that do enable host access by dynamic creation of the appropriate switch rules.

This page titled 2.7: Virtual LAN (VLAN) is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

2.7.1 https://eng.libretexts.org/@go/page/11094
2.8: TRILL and SPB
As Ethernets get larger, the spanning-tree algorithm becomes more and more a problem, primarily because useful links are disabled
and redundancy is lost. In a high-performance network, such as within a datacenter, disabled links are a wasted resource. A
secondary issue is that, in the event of link failure, the spanning-tree approach can take many seconds to create a new tree and
restore connectivity.
To address these problems, there are now protocols which allow Ethernet to have active loops in the topology, making first-class
use of all links. The idea is to generate forwarding tables within the Ethernet switches – or at least within some of them – that route
every packet along the shortest path – or at least an approximation to the shortest path – based on all available links. This has long
been a staple in the IP world (9 Routing-Update Algorithms), but is definitely a break with tradition at the LAN layer.
There are two competing protocols here: TRILL (TRansparent Interconnection of Lots of Links) and SPB (Shortest-Path Bridging).
TRILL is documented in [RP04] and RFC 6325 [https://tools.ietf.org/html/rfc6325.html] and companions, while SPB is
standardized by IEEE 802.1aq [en.Wikipedia.org/wiki/IEEE_802.1aq]. We will focus here on TRILL.
Both TRILL and SPB envision that, initially, only a few switches will be smart enough to do shortest-path routing, just as, once
upon a time, only a few switches implemented the spanning-tree algorithm. But, with time, it is likely that eventually most if not all
Ethernet switches will be shortest-path aware. In high-performance datacenters it is particularly likely that forwarding will be based
on TRILL or SPB.
In TRILL, the Ethernet switches that are TRILL-aware are known as Router-Bridges, or RBridges (the terms RSwitches and
TRILL Switches might also be appropriate). In between the RBridges are Legacy Ethernets (called “links” in [RP04] and RFC
6325 [https://tools.ietf.org/html/rfc6325.html], though this term is misleading); Legacy Ethernets consist of maximal subnetworks
of Ethernet hosts and non-TRILL-aware switches. The intent is for the RBridges to partition the entire Ethernet into relatively small
Legacy Ethernets. In the ultimate case where all switches are RBridges, the Legacy Ethernets are simply individual hosts. In the
diagram below, four RBridges isolate Legacy Ethernets 1, 2, 3 and 4, though Legacy Ethernet 5 represents a degree of partitioning
inefficiency.

LE1 RB1 RB4 LE4

LE5

LE2 RB2 RB3 LE3

Each Legacy Ethernet elects a single connected RBridge to represent it. There is a unique choice for LE1 through LE4 above, but
LE5 must make a decision. This elected RBridge is known as the Designated RBridge, or DRB. Each Legacy Ethernet then builds
its own spanning tree, perhaps (though not necessarily) rooted at its Designated RBridge.
Traffic from a Legacy Ethernet to the outside will generally be forwarded through its Designated RBridge; connections to other
RBridges will not be used. The idea is for packets from one Legacy Ethernet to another to be delivered first to the source node’s
DRB, and then to the destination node’s DRB via true shortest-path forwarding between the RBridges, and from there to the
destination node. Of course, in the ultimate case where every switch is an RBridge, traffic will take the shortest path from start to
finish.
The one exception to this rule about forwarding through the Designated RBridge is that the DRB can delegate this forwarding task
to other RBridges for different VLANs within the Legacy Ethernet. If this is done, each VLAN will always use the same RBridge
for all its outside traffic.
The second part of the process is for the RBridges each to figure out the overall topology; that is, each builds a complete map of all
the RBridges and their interconnections. This is done using a link-state routing-update protocol, described in 9.5 Link-State
Routing-Update Algorithm. Of the two primary link-state protocols, IS-IS and OSPF, TRILL has selected the former, as it is more
easily adapted to a setting in which, as here, nodes do not necessarily have IP addresses. The RBridges each send out appropriate
“link-state packets”, using multicast and using per-RBridge databases to ensure that these packets are not re-forwarded endlessly.
These link-state packets can be compared to spanning-tree Hello messages. As is fundamental to link-state forwarding, once each
RBridge has a complete map of all the RBridges, each RBridge can calculate an optimal route to any other RBridge.

2.8.1 https://eng.libretexts.org/@go/page/11095
As Designated RBridges see packets from their Legacy Ethernets, they learn the MAC addresses of the active hosts within, via the
usual Ethernet learning protocol. They then share these addresses with other RBridges, using the IS-IS link-state protocol, so other
RBridges eventually learn how to reach most if not all Ethernet addresses present in the overall network.
Delivery still must make use of fallback-to-flooding, however, to deliver to previously unknown destinations. To this end, the
RBridges negotiate among themselves a spanning tree covering all the RBridges. Any packet with unknown destination is flooded
along this spanning tree, and then, as the packet reaches a Designated RBridge for a Legacy Ethernet, is flooded along the spanning
tree of that Ethernet. This process is also used for delivery of broadcast and multicast packets.
As RBridges talk to one another, they negotiate compact two-byte addresses – known as “nicknames” – for one another, versus the
standard Ethernet six-byte addresses. This saves space in the RBridge-to-RBridge communications.
As packets travel between RBridges, a special TRILL header is added. This header includes a hopcount field, otherwise not present
in Ethernet, which means any packets caught in transient routing loops will eventually be discarded. IS-IS may occasionally
generate such routing loops, though they are rare.
The TRILL header also includes the nicknames of the source and destination RBridges. This means that actual packet forwarding
between RBridges does not involve the MAC address of the destination host; that is used only after the packet has reached the
Designated RBridge for the destination Legacy Ethernet, at which point the TRILL header is removed.
If a link between two RBridges fails, then the link’s endpoints send out IS-IS update messages to notify all the other RBridges of
the failure. The other RBridges can then recalculate their forwarding tables so as not to use the broken link. Recovery time is
typically under 0.1 seconds, a roughly hundredfold improvement over spanning-tree recovery times.
TRILL supports the use of multiple equal-cost paths to improve throughput between two RBridges; cf 9.7 ECMP. In a high-
performance datacenter, this feature is very important.
Like TRILL, SPB uses IS-IS between the SPB-aware bridges to find shortest paths, and encapsulates packets with a special header
as they travel between RBridges. SPB does not include a hopcount in the encapsulation header; instead, it more carefully controls
forwarding. SPB also uses the original destination MAC address for inter-RBridge forwarding.

This page titled 2.8: TRILL and SPB is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

2.8.2 https://eng.libretexts.org/@go/page/11095
2.9: Software-Defined Networking
While TRILL and SPB offer one way to handle to the scaling problems of spanning trees, Software-Defined Networking, or SDN,
offers another, much more general, approach. The core idea of SDN is to place the forwarding mechanism of each participating
switch under the aegis of a controller, a user-programmable device that is capable of giving each switch instructions on how to
forward packets. Like TRILL and SPB, this approach also allows forwarding and redundant links to coexist. The controller can be
a single node on the network, or can be a distributed set of nodes. The controller manages the forwarding tables of each of the
switches.
To handle legitimate broadcast traffic, the controller can, at startup, probe the switches to determine their layout, and, from this,
construct a suitable spanning tree. The switches can then be instructed to flood broadcast traffic only along the links of this
spanning tree. Links that are not part of the spanning tree can still be used for forwarding to known destinations, however, unlike
conventional switches using the spanning tree algorithm.
Typically, if a switch sees a packet addressed to an unknown destination, it reports it to the controller, which then must figure out
what to do next. One option is to have traffic to unknown destinations flooded along the same spanning tree used for broadcast
traffic. This allows fallback-to-flooding to coexist safely with the full use of loop topologies.
Switches are often configured to report new source addresses to the controller, so that the controller can tell all the other switches
the best route to that new source.
SDN controllers can be configured as simple firewalls, disallowing forwarding between selected pairs of nodes for security
reasons. For example, if a datacenter has customers A and B, each with multiple nodes, then it is possible to configure the network
so that no node belonging to customer A can send packets to a node belonging to customer B. See also the following section.
At many sites, the SDN implementation is based on standardized modules. However, controller software can also be developed
locally, allowing very precise control of network functionality. This control, rather than the ability to combine loop topologies with
Ethernet, is arguably SDN’s most important feature. See [FRZ13].

This page titled 2.9: Software-Defined Networking is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

2.9.1 https://eng.libretexts.org/@go/page/11096
2.E: Ethernet (Exercises)

2.10 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 2.5 is distinct, for
example, from exercises 2.0 and 3.0. Exercises marked with a ♢ have solutions or hints at 24.2 Solutions for Ethernet.
1.0. Simulate the contention period of five Ethernet stations that all attempt to transmit at T=0 (presumably when some sixth station
has finished transmitting), in the style of the diagram in 2.1.6 Exponential Backoff Algorithm. Assume that time is measured in slot
times, and that exactly one slot time is needed to detect a collision (so that if two stations transmit at T=1 and collide, and one of
them chooses a backoff time k=0, then that station will transmit again at T=2). Use coin flips or some other source of randomness.
2.0. Suppose we have Ethernet switches S1 through S3 arranged as below; each switch uses the learning algorithm of 2.4 Ethernet
Switches. All forwarding tables are initially empty.

S1────────S2────────S3───D
│ │ │
A B C

(a). If A sends to B, which switches see this packet?


(b). If B then replies to A, which switches see this packet?
(c). If C then sends to B, which switches see this packet?
(d). If C then sends to D, which switches see this packet?
2.7.♢ Suppose we have the Ethernet switches S1 through S4 arranged as below. All forwarding tables are empty; each switch uses
the learning algorithm of 2.4 Ethernet Switches.

B

S4

A───S1────────S2────────S3───C

D

Now suppose the following packet transmissions take place:


A sends to D
D sends to A
A sends to B
B sends to D
For each switch S1-S4, list what source addresses (eg A,B,C,D) it has seen (and thus what nodes it has learned the location of).
3.0. Repeat the previous exercise (2.7), with the same network layout, except that instead the following packet transmissions take
place:
A sends to B
B sends to A
C sends to B
D sends to A
For each switch, list what source addresses (eg A,B,C,D) it has seen (and thus what nodes it has learned the location of).

2.E.1 https://eng.libretexts.org/@go/page/11097
4.0. In the switched-Ethernet network below, find two packet transmissions so that, when a third transmission A⟶D occurs, the
packet is seen by B (that is, it is flooded out all ports by S2), but is not similarly seen by C (because it is forwarded to D, not
flooded, by S3). All forwarding tables are initially empty, and each switch uses the learning algorithm of 2.4 Ethernet Switches.

B C
│ │
A───S1────────S2────────S3───D

Hint: Destination D must be in S3’s forwarding table, but must not be in S2’s. So there must have been a packet sent by D that was
seen by S3 but not by S2.
5.0. Given the Ethernet network with learning switches below, with (disjoint) unspecified parts represented by ?, explain why it is
impossible for a packet sent from A to B to be forwarded by S1 directly to S2, but to be flooded by S2 out all of S2’s other ports.

? ?
| |
A───S1────────S2───B

6.0. In the diagram below, from 2.4.1 Ethernet Learning Algorithm, suppose node D is connected to S5. Now, with the tables as
shown by the labels in the diagram (that is, S5 knows about A and C, etc), D sends to B.

A B A B A,C B
A S1 S2 S3 B
C

A
C A,C
C S4 S5 D

Which switches will see this D→B packet, and thus learn about D? Of these switches, which do not already know where B is and
will use fallback-to-flooding?
7.0. Suppose two Ethernet switches are connected in a loop as follows; S1 and S2 have their interfaces 1 and 2 labeled. These
switches do not use the spanning-tree algorithm.

1 1

A 0 S1 S2
2 2

Suppose A attempts to send a packet to destination B, which is unknown. S1 will therefore flood the packet out interfaces 1 and 2.
What happens then? How long will A’s packet circulate?
8.0. The following network is like that of 2.5.1 Example 1: Switches Only, except that the switches are numbered differently.
Again, the ID of switch Sn is n, so S1 will be the root. Which links end up “pruned” by the spanning-tree algorithm, and why?
Diagram the network formed by the surviving links.

S1──────S4──────S6
│ │ │
│ │ │
│ │ │
S3──────S5──────S2

2.E.2 https://eng.libretexts.org/@go/page/11097
8.5. Consider the network below, consisting of just the first two rows from the datacenter diagram in 2.8.1 OpenFlow Switches:
S1 S2

S3 S4 S5

a. ♢ Give network of surviving links after application of the spanning-tree algorithm. Assume the ID of switch Sn is n. In this
network, what is the path of traffic from S2 to S5?
b. Do the same as part (a) except assuming S4 has ID 0, and so will be the root, while the ID for the other Sn remains n. What will
be the path of traffic from S1 to S5?
9.0. Suppose you want to develop a new protocol so that Ethernet switches participating in a VLAN all keep track of the VLAN
“color” associated with every destination in their forwarding tables. Assume that each switch knows which of its ports (interfaces)
connect to other switches and which may connect to hosts, and in the latter case knows the color assigned to that port.
a. Suggest a way by which switches might propagate this destination-color information to other switches.
b. What must be done if a port formerly reserved for connection to another switch is now used for a host?
10.0. (This exercise assumes some familiarity with Distance-Vector routing as in 9 Routing-Update Algorithms.)
a. Suppose switches are able to identify the non-switch hosts that are directly connected, that is, reachable without passing
through another switch. Explain how the algorithm of 9.1 Distance-Vector Routing-Update Algorithm could be used to
construct optimal Ethernet forwarding tables even if loops were present in the network topology.
b. Suppose switches are allowed to “mark” packets; all packets are initially unmarked. Give a mechanism that allows switches to
detect which non-switch hosts are directly connected.
c. Explain why Ethernet broadcast (and multicast) would still be a problem.
11.0. Consider the scenario from 2.4.1 Ethernet Learning Algorithm:

A B A B A,C B
A S1 S2 S3 B
C

A
C A,C
C S4 S5

Five learning bridges after three packet transmissions

A sends to B
B sends to A
C sends to B
Now suppose that, before each packet transmission above, the sender first sends a broadcast packet, and the destination then sends
a unicast reply packet (this is roughly the ARP protocol, used to translate from IPv4 addresses to Ethernet physical addresses, 7.9
Address Resolution Protocol: ARP). After the three transmissions listed above, what destinations do the switches S1-S5 have in
their forwarding tables?
12.0. ♢ Consider the following arrangement of three hosts h1, h2, h3 and one OpenFlow switch S with ports 1, 2 and 3 and
controller C (not shown)

1 2
h1 S h2
3

h3

Four packets are then transmitted:

(a). h1→h2

2.E.3 https://eng.libretexts.org/@go/page/11097
(b). h2→h1
(c). h3→h1
(d). h2→h3
Assume that S reports to C all packets with unknown destination, that is, all packets for which S does not have a forwarding entry
for that packet’s destination. Packet reports include the source and destination addresses and the arrival port. On receiving a report,
if the source address is previously unknown then C installs on S a forwarding-table entry for that source address. At that point S
uses its forwarding table (including any new entries) to forward the packet, if a suitable entry exists. Otherwise S floods the packet
as usual.
For the four packets above, indicate
1. whether S reports the packet to C
2. if so, any new forwarding entry C installs on S
3. whether S is then able to forward the packet using its table, or must fall back to flooding.
(If S does not report the packet to C then S must have had a forwarding-table entry for that destination, and so S is able to forward
the packet normally.)
12.2. Consider again the arrangement of exercise 12.0 of three hosts h1, h2, h3 and one OpenFlow switch S with ports 1, 2 and 3
and controller C (not shown)

1 2
h1 S h2
3

h3

The same four packets are transmitted:

(a). h1→h2
(b). h2→h1
(c). h3→h1
(d). h2→h3
This time, assume that S reports to C all packets with unknown destination or unknown source (that is, S does not have a
forwarding entry for either the packet’s source or destination address). For the four packets above, indicate
1. whether S reports the packet to C
2. if so, any new forwarding entry C installs on S
3. whether S is then able to forward the packet using its table, or must fall back to flooding.
As before, packet reports include the source and destination addresses and the arrival port. On receiving a report, if the source
address is previously unknown then C installs on S a forwarding-table entry for that source address. At that point S uses its
forwarding table (including any new entries) to forward the packet, if a suitable entry exists. Otherwise S floods the packet as
usual. Again, if S does not report a packet to C then S must have had a forwarding-table entry for that destination, and so is able to
forward the packet normally.
13.0 Consider the following arrangement of three switches S1-S3, three hosts h1-h3 and one OpenFlow controller C.

h1 h2 h3

S1 S2 S3

As with exercise 12.0, assume that the switches report packets to C only if they do not already have a forwarding-table entry for the
packet’s destination. After each report, C installs a forwarding-table entry on the reporting switch for reaching the packet’s source

2.E.4 https://eng.libretexts.org/@go/page/11097
address via the arrival port. At that point the switch floods the packet (as the destination must not have been known). If a switch
can forward a packet without reporting to C, no new forwarding entries are installed.
Packets are now sent as follows:

h1→h2
h2→h1
h1→h3
h3→h1
h2→h3
h3→h2
At the end, what are the forwarding tables on S1♢, S2 and S3?
14.0 Here are the switch rules for the multiple-flow-table example in 2.8.2 Learning Switches in OpenFlow:

Table match field match action no-match default

T0 destaddr forward and send to T1 flood and send to T1

T1 srcaddr do nothing send to controller

Give a similar table where the matches are reversed; that is, T0 matches the srcaddr field and T1 matches the destaddr field.

This page titled 2.E: Ethernet (Exercises) is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

2.E.5 https://eng.libretexts.org/@go/page/11097
CHAPTER OVERVIEW

3: Other LANs
In the wired era, one could get along quite well with nothing but Ethernet and the occasional long-haul point-to-point link joining
different sites. However, there are important alternatives out there. Some, like token ring, are mostly of historical importance;
others, like virtual circuits, are of great conceptual importance but – so far – of only modest day-to-day significance. And then there
is wireless. It would be difficult to imagine contemporary laptop networking, let alone mobile devices, without it. In both homes
and offices, Wi-Fi connectivity is the norm. Mobile networking is ubiquitous. A return to being tethered by wires is almost
unthinkable.
3.1: Virtual Private Networks
3.2: Carrier Ethernet
3.3: Token Ring
3.4: Virtual Circuits
3.5: Asynchronous Transfer Mode - ATM
3.6: Adventures in Radioland
3.7: Wi-Fi
3.8: WiMAX and LTE
3.9: Fixed Wireless
3.10: Epilog and Exercises
Index

This page titled 3: Other LANs is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1
3.1: Virtual Private Networks
Suppose you want to connect to your workplace network from home. Your workplace, however, has a security policy that does not
allow “outside” IP addresses to access essential internal resources. How do you proceed, without leasing a dedicated
telecommunications line to your workplace?
A virtual private network, or VPN, provides a solution; it supports creation of virtual links that join far-flung nodes via the
Internet. Your home computer creates an ordinary Internet connection (TCP or UDP) to a workplace VPN server (IP-layer packet
encapsulation can also be used, and avoids the timeout problems sometimes created by sending TCP packets within another TCP
stream; see 7.13 Mobile IP). Each end of the connection is typically associated with a software-created virtual network interface;
each of the two virtual interfaces is assigned an IP address. (Virtual interfaces are not essential; VPNs created with IPsec, 22.11
IPsec, generally omit them.) When a packet is to be sent along the virtual link, it is actually encapsulated and sent along the original
Internet connection to the VPN server, wending its way through the commodity Internet; this process is called tunneling. To all
intents and purposes, the virtual link behaves like any other physical link.
Tunneled packets are often encrypted as well as encapsulated, though that is a separate issue. One relatively easy-to-implement
example of a tunneling mechanism is to treat a TCP home-workplace connection as a serial line and send packets over it back-to-
back, using PPP with HDLC; see 4.1.5.1 HDLC and RFC 1661 (though this can lead to the above-mentioned TCP-in-TCP timeout
problems).
At the workplace side, the virtual network interface in the VPN server is attached to a router or switch; at the home user’s end, the
virtual network interface can now be assigned an internal workplace IP address. The home computer is now, for all intents and
purposes, part of the internal workplace network.
In the diagram below, the user’s regular Internet connection is via hardware interface eth0 . A connection is established to Site
A’s VPN server; a virtual interface tun0 is created on the user’s machine which appears to be a direct link to the VPN server.
The tun0 interface is assigned a Site-A IP address. Packets sent via the tun0 interface in fact travel over the original
connection via eth0 and the Internet.

Site A private

eth0
Internet
tun0
VPN server
200.0.1.37

Site A
User at home
200.0.1/24

VPN: blue link represents tunnel. Actual connection is made via eth0
The tun0 interface is a virtual network interface with a Site-A address

After the VPN is set up, the home host’s tun0 interface appears to be locally connected to Site A, and thus the home host is
allowed to connect to the private area within Site A. The home host’s forwarding table will be configured so that traffic to Site A’s
private addresses is routed via interface tun0 .
VPNs are also commonly used to connect entire remote offices to headquarters. In this case the remote-office end of the tunnel will
be at that office’s local router, and the tunnel will carry traffic for all the workstations in the remote office.
Other applications of VPNs include trying to appear geographically to be at another location, and bypassing firewall rules blocking
specific TCP or UDP ports.
To improve security, it is common for the residential (or remote-office) end of the VPN connection to use the VPN connection as
the default route for all traffic except that needed to maintain the VPN itself. This may require a so-called host-specific forwarding-
table entry at the residential end to allow the packets that carry the VPN tunnel traffic to be routed correctly via eth0 . This
routing strategy means that potential intruders cannot access the residential host – and thus the workplace internal network –
through the original residential Internet access. A consequence is that if the home worker downloads a large file from a non-
workplace site, it will travel first to the workplace, then back out to the Internet via the VPN connection, and finally arrive at the
home.

3.1.1 https://eng.libretexts.org/@go/page/11101
To improve congestion response, IP packets are sometimes marked by routers that are experiencing congestion; see 14.8.3 Explicit
Congestion Notification (ECN). If such marking is done to the outer, encapsulating, packet, and the marks are not transferred at the
remote endpoint of the VPN to the inner, encapsulated, packet, then the marks are lost. Congestion response may suffer. RFC 6040
spells out a proper re-marking strategy in general; RFC 7296 defines re-marking for IPsec (22.11 IPsec). Older VPN protocols,
however, may not support congestion re-marking.

This page titled 3.1: Virtual Private Networks is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

3.1.2 https://eng.libretexts.org/@go/page/11101
3.2: Carrier Ethernet
Carrier Ethernet is a leased-line point-to-point link between two sites, where the subscriber interface at each end of the line looks
like Ethernet (in some flavor). The physical path in between sites, however, need not have anything to do with Ethernet; it may be
implemented however the carrier wishes. In particular, it will be (or at least appear to be) full-duplex, it will be collision-free, and
its length may far exceed the maximum permitted by any IEEE Ethernet standard.
Bandwidth can be purchased in whatever increments the carrier has implemented. The point of carrier Ethernet is to provide a layer
of abstraction between the customers, who need only install a commodity Ethernet interface, and the provider, who can upgrade the
link implementation at will without requiring change at the customer end.
In a sense, carrier Ethernet is similar to the widespread practice of provisioning residential DSL and cable routers with an Ethernet
interface for customer interconnection; again, the actual link technologies may not look anything like Ethernet, but the interface
will.
A carrier Ethernet connection looks like a virtual VPN link, but runs on top of the provider’s internal network rather than the
Internet at large. Carrier Ethernet connections often provide the primary Internet connectivity for one endpoint, unlike Internet
VPNs which assume both endpoints already have full Internet connectivity.

This page titled 3.2: Carrier Ethernet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

3.2.1 https://eng.libretexts.org/@go/page/11102
3.3: Token Ring
A significant part of the previous chapter was devoted to classic Ethernet’s collision mechanism for supporting shared media
access. After that, it may come as a surprise that there is a simple multiple-access mechanism that is not only collision-free, but
which supports fairness in the sense that if N stations wish to send then each will receive 1/N of the opportunities.
That method is Token Ring. Actual implementations come in several forms, from Fiber-Distributed Data Interface (FDDI) to so-
called “IBM Token Ring”. The central idea is that stations are connected in a ring:
B

A C

F D
E

Packets will be transmitted in one direction (clockwise in the ring above). Stations in effect forward most packets around the ring,
although they can also remove a packet. (It is perhaps more accurate to think of the forwarding as representing the default cable
connectivity; non-forwarding represents the station’s momentarily breaking that connectivity.)
When the network is idle, all stations agree to forward a special, small packet known as a token. When a station, say A, wishes to
transmit, it must first wait for the token to arrive at A. Instead of forwarding the token, A then transmits its own packet; this travels
around the network and is then removed by A. At that point (or in some cases at the point when A finishes transmitting its data
packet) A then forwards the token.
In a small ring network, the ring circumference may be a small fraction of one packet. Ring networks become “large” at the point
when some packets may be entirely in transit on the ring. Slightly different solutions apply in each case. (It is also possible that the
physical ring exists only within the token-ring switch, and that stations are connected to that switch using the usual point-to-point
wiring.)
If all stations have packets to send, then we will have something like the following:
A waits for the token
A sends a packet
A sends the token to B
B sends a packet
B sends the token to C
C sends a packet
C sends the token to D

All stations get an equal number of chances to transmit, and no bandwidth is wasted on collisions. (A station constantly sending
smaller packets will send the same number of packets as a station constantly sending larger packets, but the bandwidth will be
smaller in proportion to the smaller packet size.)
One problem with token ring is that when stations are powered off it is essential that the packets continue forwarding; this is
usually addressed by having the default circuit configuration be to keep the loop closed. Another issue is that some station has to
watch out in case the token disappears, or in case a duplicate token appears.
Because of fairness and the lack of collisions, IBM Token Ring was once considered to be the premium LAN mechanism. As such,
Token Ring hardware commanded a substantial price premium. But due to Ethernet’s combination of lower hardware costs and
higher bitrates (even taking collisions into account), the latter eventually won out.
There was also a much earlier collision-free hybrid of 10 Mbps Ethernet and Token Ring known as Token Bus: an Ethernet
physical network (often linear) was used with a token-ring-like protocol layer above that. Stations were physically connected to the
(linear) Ethernet but were assigned identifiers that logically arranged them in a (virtual) ring. Each station had to wait for the token
and only then could transmit a packet; after that it would send the token on to the next station in the virtual ring. As with “real”
Token Ring, some mechanisms need to be in place to monitor for token loss.

3.3.1 https://eng.libretexts.org/@go/page/11103
Token Bus Ethernet never caught on. The additional software complexity was no doubt part of the problem, but perhaps the real
issue was that it was not necessary.

This page titled 3.3: Token Ring is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

3.3.2 https://eng.libretexts.org/@go/page/11103
3.4: Virtual Circuits
Before we can get to our final LAN example, ATM, we need to detour briefly through virtual circuits.
Virtual circuits are The Road Not Taken by IP.

The Road Not Taken


A close reading of Robert Frost’s poem referenced here reveals that the supposed great difference between the two roads exists
only in the narrator’s retrospective imaginings; the roads were in fact “really about the same”. Perhaps this would also apply to
datagram and virtual-circuit forwarding, though see below on per-connection billing.

Virtual-circuit switching (or routing) is an alternative to datagram switching, which was introduced in Chapter 1. In datagram
switching, routers know the next_hop to each destination, and packets are addressed by destination. In virtual-circuit switching,
routers know about end-to-end connections, and packets are “addressed” by a connection ID.
Before any data packets can be sent, a connection needs to be established first. For that connection, the route is computed and then
each link along the path is assigned a connection ID, traditionally called the VCI, for Virtual Circuit Identifier. In most cases, VCIs
are only locally unique; that is, the same connection may use a different VCI on each link. The lack of global uniqueness makes
VCI allocation much simpler. Although the VCI keeps changing along a path, the VCI can still be thought of as identifying the
connection. To send a packet, the host marks the packet with the VCI assigned to the host–router1 link.
Packets arrive at (and depart from) switches via one of several ports, which we will assume are numbered beginning at 0. Switches
maintain a connection table indexed by ⟨VCI,port⟩ pairs; unlike a forwarding table, the connection table has a record of every
connection through that switch at that particular moment. As a packet arrives, its inbound VCIin and inbound portin are looked up in
this table; this yields an outbound ⟨VCIout,portout⟩ pair. The VCI field of the packet is then rewritten to VCIout, and the packet is
sent via portout.
Note that typically there is no source address information included in the packet (although the sender can be identified from the
connection, which can be identified from the VCI at any point along the connection). Packets are identified by connection, not
destination. Any node along the path (including the endpoints) can in principle look up the connection and figure out the endpoints.
Note also that each switch must rewrite the VCI. Datagram switches never rewrite addresses (though they do update hopcount/TTL
fields). The advantage to this rewriting is that VCIs need be unique only for a given link, greatly simplifying the naming. Datagram
switches also do not make use of a packet’s arrival interface.
As an example, consider the network below. Switch ports are numbered 0,1,2,3. Two paths are drawn in, one from A to F in red
and one from B to D in green; each link is labeled with its VCI number in the same color.
4 6
0 2 0 2
A S1 S2 D
7 8
1 1

8 4

3 3
0 2 0 2 0 1
B S3 S4 S5 F
4 8 5
1 1

C E

We will construct virtual-circuit connections between


A and F (shown above in red)
A and E
A and C
B and D (shown above in green)
A and F again (a separate connection)
The following VCIs have been chosen for these connections. The choices are made more or less randomly here, but in accordance
with the requirement that they be unique to each link. Because links are generally taken to be bidirectional, a VCI used from S1 to
S3 cannot be reused from S3 to S1 until the first connection closes.

3.4.1 https://eng.libretexts.org/@go/page/11104
A to F: A──4──S1──6──S2──4──S4──8──S5──5──F; this path goes from S1 to S4 via S2
A to E: A──5──S1──6──S3──3──S4──8──E; this path goes, for no particular reason, from S1 to S4 via S3, the opposite
corner of the square
A to C: A──6──S1──7──S3──3──C
B to D: B──4──S3──8──S1──7──S2──8──D
A to F: A──7──S1──8──S2──5──S4──9──S5──2──F
One may verify that on any one link no two different paths use the same VCI.
We now construct the actual ⟨VCI,port⟩ tables for the switches S1-S4, from the above; the table for S5 is left as an exercise. Note
that either the ⟨VCIin,portin⟩ or the ⟨VCIout,portout⟩ can be used as the key; we cannot have the same pair in both the in columns and
the out columns. It may help to display the port numbers for each switch, as in the upper numbers in following diagram of the
above red connection from A to F (lower numbers are the VCIs):

0
A S1 2 0
S2 1 3
S4 2 0
S5 1 F
4 6 4 8 5

Switch S1:

VCIin portin VCIout portout connection

4 0 6 2 A⟶F #1

5 0 6 1 A⟶E

6 0 7 1 A⟶C

8 1 7 2 B⟶D

7 0 8 2 A⟶F #2

Switch S2:

VCIin portin VCIout portout connection

6 0 4 1 A⟶F #1

7 0 8 2 B⟶D

8 0 5 1 A⟶F #2

Switch S3:

VCIin portin VCIout portout connection

6 3 3 2 A⟶E

7 3 3 1 A⟶C

4 0 8 3 B⟶D

Switch S4:

VCIin portin VCIout portout connection

4 3 8 2 A⟶F #1

3 0 8 1 A⟶E

5 3 9 2 A⟶F #2

The namespace for VCIs is small, and compact (eg contiguous). Typically the VCI and port bitfields can be concatenated to
produce a ⟨VCI,Port⟩ composite value small enough that it is suitable for use as an array index. VCIs work best as local identifiers.
IP addresses, on the other hand, need to be globally unique, and thus are often rather sparsely distributed.

3.4.2 https://eng.libretexts.org/@go/page/11104
Virtual-circuit switching offers the following advantages:
connections can get quality-of-service guarantees, because the switches are aware of connections and can reserve capacity at the
time the connection is made
headers are smaller, allowing faster throughput
headers are small enough to allow efficient support for the very small packet sizes that are optimal for voice connections. ATM
packets, for instance, have 48 bytes of data; see below.
Datagram forwarding, on the other hand, offers these advantages:
Routers have less state information to manage.
Router crashes and partial connection state loss are not a problem.
If a router or link is disabled, rerouting is easy and does not affect any connection state. (As mentioned in Chapter 1, this was
Paul Baran’s primary concern in his 1962 paper introducing packet switching.)
Per-connection billing is very difficult.
The last point above may once have been quite important; in the era when the ARPANET was being developed, typical daytime
long-distance rates were on the order of $1/minute. It is unlikely that early TCP/IP protocol development would have been as fertile
as it was had participants needed to justify per-minute billing costs for every project.
It is certainly possible to do virtual-circuit switching with globally unique VCIs – say the concatenation of source and destination
IP addresses and port numbers. The IP-based RSVP protocol (20.6 RSVP) does exactly this. However, the fast-lookup and small-
header advantages of a compact namespace are then lost.
Multi-Protocol Label Switching (20.12 Multi-Protocol Label Switching (MPLS)) is another IP-based application of virtual circuits.
Note that virtual-circuit switching does not suffer from the problem of idle channels still consuming resources, which is an issue
with circuits using time-division multiplexing (eg shared T1 lines)

This page titled 3.4: Virtual Circuits is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

3.4.3 https://eng.libretexts.org/@go/page/11104
3.5: Asynchronous Transfer Mode - ATM
ATM is a network mechanism intended to accommodate real-time traffic as well as bulk data transfer. We present ATM here as a
LAN layer, for which it is still sometimes used, but it was originally proposed as a replacement for the IP layer as well, and, to an
extent, the Transport layer. These broader plans were not greeted with universal enthusiasm within the IETF. When used as a LAN
layer, IP packets are transmitted over ATM as in 3.5.1 ATM Segmentation and Reassembly.
A distinctive feature of ATM is its small packet size. ATM has its roots in the telephone industry, and was therefore particularly
intended to support voice. A significant source of delay in voice traffic is the packet fill time: at DS0 speeds (64 kbps), voice data
accumulates at 8 bytes/ms. If we are sending 1 kB packets, this means voice is delayed by about 1/8 second, meaning in turn that
when one person stops speaking, the earliest they can hear the other’s response is 1/4 second later. Slightly smaller levels of voice
delay can introduce an annoying echo. Smaller packets reduce the fill time and thus the delay: when voice is sent over IP (VoIP),
one common method is to send 160 bytes every 20 ms.
ATM took this small-packet strategy even further: packets have 48 bytes of data, plus 5 bytes of header. Such small packets are
often called cells. To manage such a small header, virtual-circuit routing is a necessity. IP packets of such small size would likely
consume more than 50% of the bandwidth on headers, if the LAN header were included.
Aside from reduced voice fill-time, other benefits to small cells are reduced store-and-forward delay and minimal queuing delay, at
least for high-priority traffic. Prioritizing traffic and giving precedence to high-priority traffic is standard, but high-priority traffic is
never allowed to interrupt transmission already begun of a low-priority packet. If you have a high-priority voice cell, and someone
else has a 1500-byte packet just started, your cell has to wait about 30 cell times, because 1500 bytes is about 30 cells. However, if
their low-priority traffic is instead made up of 30 cells, you have only to wait for their first cell to finish; the delay is 1/30 as much.
ATM also made the decision to require fixed-size cells. The penalty for one partially used cell among many is small. Having a
fixed cell size simplifies hardware design, and, in theory, allows it easier to design for parallelism.
Unfortunately, the designers of ATM also chose to mandate no cell reordering. This means cells can use a smaller sequence-
number field, but also makes parallel switches much harder to build. A typical parallel switch design might involve distributing
incoming cells among any of several input queues; the queues would then handle the VCI lookups in parallel and forward the cells
to the appropriate output queues. With such an architecture, avoiding reordering is difficult. It is not clear to what extent the no-
reordering decision was related to the later decline of ATM in the marketplace.
ATM cells have 48 bytes of data and a 5-byte header. The header contains up to 28 bits of VCI information, three “type” bits, one
cell-loss priority, or CLP, bit, and an 8-bit checksum over the header only. The VCI is divided into 8-12 bits of Virtual Path
Identifier and 16 bits of Virtual Channel Identifier, the latter supposedly for customer use to separate out multiple connections
between two endpoints. Forwarding is by full switching only, and there is no mechanism for physical (LAN) broadcast.

3.5.1 ATM Segmentation and Reassembly


Due to the small packet size, ATM defines its own mechanisms for segmentation and reassembly of larger packets. Thus, individual
ATM links in an IP network are quite practical. These mechanisms are called ATM Adaptation Layers, and there are four of
them: AALs 1, 2, 3/4 and 5 (AAL 3 and AAL 4 were once separate layers, which merged). AALs 1 and 2 are used only for voice-
type traffic; we will not consider them further.
The ATM segmentation-and-reassembly mechanism defined here is intended to apply only to large data; no cells are ever further
subdivided. Furthermore, segmentation is always applied at the point where the data enters the network; reassembly is done at exit
from the ATM path. IPv4 fragmentation, on the other hand, applies conceptually to IP packets, and may be performed by routers
within the network.
For AAL 3/4, we first define a high-level “wrapper” for an IP packet, called the CS-PDU (Convergence Sublayer - Protocol Data
Unit). This prefixes 32 bits on the front and another 32 bits (plus padding) on the rear. We then chop this into as many 44-byte
chunks as are needed; each chunk goes into a 48-byte ATM payload, along with the following 32 bits worth of additional
header/trailer:
2-bit type field:
10: begin new CS-PDU
00: continue CS-PDU

3.5.1 https://eng.libretexts.org/@go/page/11105
01: end of CS-PDU
11: single-segment CS-PDU
4-bit sequence number, 0-15, good for catching up to 15 dropped cells
10-bit MessageID field
CRC-10 checksum.
We now have a total of 9 bytes of header for 44 bytes of data; this is more than 20% overhead. This did not sit well with the IP-
over-ATM community (such as it was), and so AAL 5 was developed.
AAL 5 moved the checksum to the CS-PDU and increased it to 32 bits from 10 bits. The MID field was discarded, as no one used
it, anyway (if you wanted to send several different types of messages, you simply created several virtual circuits). A bit from the
ATM header was taken over and used to indicate:
1: start of new CS-PDU
0: continuation of an existing CS-PDU
The CS-PDU is now chopped into 48-byte chunks, which are then used as the entire body of each ATM cell. With 5 bytes of header
for 48 bytes of data, overhead is down to 10%. Errors are detected by the CS-PDU CRC-32. This also detects lost cells (impossible
with a per-cell CRC!), as we no longer have any cell sequence number.
For both AAL3/4 and AAL5, reassembly is simply a matter of stringing together consecutive cells in order of arrival, starting a
new CS-PDU whenever the appropriate bits indicate this. For AAL3/4 the receiver has to strip off the 4-byte AAL3/4 headers; for
AAL5 the receiver has to verify the CRC-32 checksum once all cells are received. Different cells from different virtual circuits can
be jumbled together on the ATM “backbone”, but on any one virtual circuit the cells from one higher-level packet must be sent one
right after the other.
A typical IP packet divides into about 20 cells. For AAL 3/4, this means a total of 200 bits devoted to CRC codes, versus only 32
bits for AAL 5. It might seem that AAL 3/4 would be more reliable because of this, but, paradoxically, it was not! The reason for
this is that errors are rare, and so we typically have one or at most two per CS-PDU. Suppose we have only a single error, ie a
single cluster of corrupted bits small enough that it is likely confined to a single cell. In AAL 3/4 the CRC-10 checksum will fail to
detect that error (that is, the checksum of the corrupted packet will by chance happen to equal the checksum of the original packet)
with probability 1/210. The AAL 5 CRC-32 checksum, however, will fail to detect the error with probability 1/232. Even if there are
enough errors that two cells are corrupted, the two CRC-10s together will fail to detect the error with probability 1/220; the CRC-32
is better. AAL 3/4 is more reliable only when we have errors in at least four cells, at which point we might do better to switch to an
error-correcting code.
Moral: one checksum over the entire message is often better than multiple shorter checksums over parts of the message.

This page titled 3.5: Asynchronous Transfer Mode - ATM is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

3.5.2 https://eng.libretexts.org/@go/page/11105
3.6: Adventures in Radioland
For the remainder of this chapter we leave wires (and fiber) behind, and contemplate the transmission of packets via radio, freeing
nodes from their cable tethers. Wi-fi (3.7 Wi-Fi) and mobile wireless (3.8 WiMAX and LTE) are now ubiquitous. But radio is not
quite like wire, and wireless transmission of packets brings several changes.

3.6.1 Privacy
It’s hard to tap into wired Ethernet, especially if you are locked out of the building. But anyone can receive wireless transmissions,
often from a considerable distance. The data breach at TJX Corporation was achieved by attackers parking outside a company
building and pointing a directional antenna at it; encryption was used but it was weak (see 22 Security and 22.7.7 Wi-Fi WEP
Encryption Failure). Similarly, Internet café visitors generally don’t want other patrons to read their email. Radio communication
needs strong encryption.

3.6.2 Collisions
Ethernet-like collision detection is no longer feasible over radio. This has to do with the relative signal strength of the remote
signal at the local transmitter. Along a wire-based Ethernet the remote signal might be as weak as 1/100 of the transmitted signal
but that 1% received signal is still detectable during transmission. However, with radio the remote signal might easily be as little as
1/1,000,000 of the transmitted signal (-60 dB), as measured at the transmitting station, and it is simply overwhelmed during
transmission.
As a result, wireless protocols must be constructed appropriately. We will look at how Wi-Fi handles this in its most common mode
of operation in 3.7.1 Wi-Fi and Collisions. Wi-Fi also supports its PCF mode (3.7.7 Wi-Fi Polling Mode) that involves fewer (but
not zero) collisions through the use of central-point polling. Finally, WiMAX and LTE switch from polling to scheduling to further
reduce collisions, though the potential for collisions is still inevitable when new stations join the network.
It is also worth pointing out that, while an Ethernet collision affects every station in the physical Ethernet (the “collision domain”),
wireless collisions are local, occuring at the receiver. Two stations can transmit at the same time, and in range of one another, but
without a collision! This can happen if each of the relevant receivers is in range of only one of the two transmitting stations. As an
example, suppose three stations are arranged linearly, A–C–B, with the A–C and C–B distances just under the maximum effective
range. When A and B both transmit there is indeed a collision at C. But when C and B transmit simultaneously, A may receive C’s
signal just fine, as B’s is too weak to interfere.

3.6.3 Hidden Nodes


In wireless communication, two nodes A and B that are not in range of one another – and thus cannot detect one another – may still
have their signals interfere at a third node C. This creates an additional complication to collision handling. See 3.7.1.4 Hidden-
Node Problem.

3.6.4 Band Width


To radio engineers, “band width” means the frequency range used by a signal, not the data transmission rate. No information can be
conveyed using a single frequency; even signaling by switching a carrier frequency off and on at a low rate “blurs” the carrier into
a band of nonzero width.
In keeping with this we will for the remainder of this chapter use the term “data rate” for what we have previously called
“bandwidth”. We will use the terms “channel width” or “width of the frequency band” for the frequency range.
All else being equal, the data rate achievable with a radio signal is proportional to the channel width. The constant of
proportionality is limited by the Shannon-Hartley theorem: the maximum data rate divided by the width of the frequency band is
log2(1+SNR), where SNR is the signal to noise power ratio. Noise here is assumed to have a specific statistical form known as
Gaussian white noise. If SNR is 127, for example, and the width of the frequency band is 1 MHz, then the maximum theoretical
data rate is 7 Mbps, where 7 = log2(128). If the signal power S drops by about half so SNR=63, the data rate falls to 6 Mbps, as 6 =
log2(64); the relationship between signal power and data rate is logarithmic.

3.6.1 https://eng.libretexts.org/@go/page/11106
3.6.4.1 OFDM
The actual data rate achievable, for a given channel width and SNR, depends on the signal encoding, or modulation, mechanism.
Most newer modulation mechanisms use “orthogonal frequency-division multiplexing”, OFDM, or some variant.
A central feature of OFDM is that one wider frequency band is divided into multiple narrow subchannels; each subchannel then
carries a proportional fraction of the total information signal, modulated onto a subchannel-specific carrier. All the subchannels can
be allocated to one transmission at a time (time-division multiplexing, 4.2 Time-Division Multiplexing), or disjoint sets of
subchannels can be allocated to different transmissions that can then proceed (at proportionally lower data rates) in parallel. The
latter is known as frequency-division multiplexing.
In many settings OFDM comes reasonably close to the Shannon-Hartley limit. Perhaps more importantly, OFDM also performs
reasonably well with multipath interference, below, which is endemic in urban and building-interior environments with their many
reflective surfaces. Multipath interference is, however, not necessarily comparable to the Gaussian noise assumed by the Shannon-
Hartley theorem. We will not address further technical details of OFDM here, except to note that implementation usually requires
some form of digital signal processing.
The OFDMA variant, with the MA standing for Multiple Access, allows multiple users to use nonoverlapping sets of subchannels,
thus allowing simultaneous transmission. It is an option available in 802.11ax.

3.6.5 Cost
Another fundamental issue is that everyone shares the same radio spectrum. For mobile wireless providers, this constraint has
driven prices to remarkable levels; the 2014-15 FCC AWS-3 auction raised almost $45 billion for 65 MHz (usable throughout the
entire United States). This works out to somewhat over $2 per megahertz per phone. The corresponding issue for Wi-Fi users in a
dense area is that all the available Wi-Fi bandwidth may be in use. Inside busy buildings one can often see dozens of Wi-Fi access
points competing for the same Wi-Fi channel; the result is that no user will be getting close to the nominal data rates of 3.7 Wi-Fi.
Higher data rates require wider frequency bands. To reduce costs in the face of fixed demand, the usual strategy is to make the
coverage zones smaller, either by reducing power (and adding more access points as appropriate), or by using directional antennas,
or both.

3.6.6 Multipath
While a radio signal generally covers a wide area – even with ordinary directional antennas – it does so in surprisingly non-uniform
ways. A signal may reach a receiver through a line-of-sight path and also several reflected paths, possibly of varying length. In
addition to reflection, the signal may be subject to reflection-like scattering and diffraction. All of this together is known as
multipath interference (or, if analog audio is involved, multipath distortion; in the analog TV era this was ghosting).

A B

Line-of-sight and reflected signals Superposition of encoded data

The picture above shows two transmission paths from A to B. The respective carrier paths may interfere with or supplement one
another. The longer delay of the reflecting path (red) wil also delay its encoded signal. The result, shown at right, is that the line-of-
sight and reflected data symbols may overlap and interfere with each other; this is known as intersymbol interference. Multipath
interference may even change the meaning of the data symbol as seen by the receiver; for example, the red and black low data-
signal peaks above at the point of the vertical dashed line may sum together so as to be received as a higher peak (assuming the
underlying carriers are in sync).
Multipath interference tends to lead to wide fluctuations in signal intensity with a period of about half a wavelength; this
phenomenon is known as multipath fading. As an example, the wavelength of FM radio stations in the United States is about 3
meters; in fringe reception areas it is not uncommon to pull a car forward a quarter wavelength and have a station go from clear to

3.6.2 https://eng.libretexts.org/@go/page/11106
indecipherable, or even for reception to switch to another station (on the same frequency but transmitted from another location)
altogether.

Signal-intensity map (simulated) in a room with walls with 40% reflectivity


The picture above is from a mathematical simulation intended to illustrate multipath fading. The walls of the room reflect 40% of
the signal from the transmitter located in the orange ball at the lower left. The transmitter transmits an unmodulated carrier signal,
which may be reflected off the walls any number of times; at any point in the room the total signal intensity is the sum over all
possible reflection paths. On the right-hand side, the small-scale blue ripples represent the received carrier strength variation due to
multipath interference between the line-of-sight and all the reflected paths. Note that the ripple size is about half a wavelength.
In comparison to this simulated intensity map, real walls tend to have a lower reflectivity, real rooms are not two-dimensional, and
real carriers are modulated. However, real rooms also introduce scattering, diffraction and shadowing from objects within, and
significant (3× to 10×) multipath-fading signal-strength variations are common in actual wireless settings.
Multipath fading can be either flat – affecting all frequencies more or less equally – or selective – affecting some frequencies
differently than others. It is quite possible for an OFDM channel (3.6.4.1 OFDM) to encounter selective fading of only some of its
subchannel frequencies.
Generally, multipath interference is a problem that engineers go to great lengths to overcome. However, as we shall see in 3.7.3
Multiple Spatial Streams, multipath interference can sometimes be put to positive use by allowing almost-adjacent antennas to
transmit and receive independent signals, thus increasing the effective throughput.
For an alternative example of multipath interference in which the signal strength has no ripples, see exercise 13.0.

3.6.7 Power
If you are cutting the network cable and replacing it with wireless, there is a good chance you will also want to cut the power cable
as well and replace it with batteries. This tends to make power consumption a very important issue. The Wi-Fi standard has
provisions for minimizing power usage by allowing a device to “doze” most of the time, waking periodically to check if any
packets are ready to be sent to it (see 3.7.4.1 Joining a Network). The 6LoWPAN project (IPv6 Low-power Wireless Personal Area
Network) is intended to support very low-power devices; see RFC 4919 and RFC 6282.

3.6.8 Tangle
Wireless is also used simply to replace cords and their attendant tangle, and, of course, the problem of incompatible connectors.
The low-power Bluetooth wireless standard is commonly used as a cable alternative for things like computer mice and telephone
headsets. Bluetooth is also a low-power network; for many applications the working range is about 10 meters. ZigBee is another
low-power small-scale network.

This page titled 3.6: Adventures in Radioland is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

3.6.3 https://eng.libretexts.org/@go/page/11106
3.7: Wi-Fi
Wi-Fi is a trademark of the Wi-Fi Alliance denoting any of several IEEE wireless-networking protocols in the 802.11 family,
specifically 802.11a, 802.11b, 802.11g, 802.11n, 802.11ac and 802.11ax. (Strictly speaking, these are all amendments to the
original 802.11 standard, but they are also de facto standards in their own right.) Like classic Ethernet, Wi-Fi must deal with
collisions; unlike Ethernet, however, Wi-Fi is unable to detect collisions in progress, complicating the backoff and retransmission
algorithms. See 3.6.2 Collisions above.
Unlike any wired LAN protocol we have considered so far, in addition to normal data packets Wi-Fi also uses control and
management packets that exist entirely within the Wi-Fi LAN layer; these are not initiated by or delivered to higher network
layers. Control packets are used to compensate for some of the infelicities of the radio environment, such as the lack of collision
detection. Putting radio-essential control and management protocols within the Wi-Fi layer means that the IP layer can continue to
interact with the Wi-Fi LAN exactly as it did with Ethernet; no changes are required.
Wi-Fi is designed to interoperate freely with Ethernet at the logical LAN layer. Wi-Fi MAC (physical) addresses have the same 48-
bit size as Ethernet’s and the same internal structure (2.1.3 Ethernet Address Internal Structure). They also share the same
namespace: one is never supposed to see an Ethernet and a Wi-Fi interface with the same address. As a result, data packets can be
forwarded by switches between Ethernet and Wi-Fi; in many respects a Wi-Fi LAN attached to an Ethernet LAN looks like an
extension of the Ethernet LAN. See 3.7.4 Access Points.

Microwave Ovens and Wi-Fi


The impact of a running microwave oven on Wi-Fi signals is quite evident if the oven is between the sender and receiver. For
other configurations the effect may vary. Most ovens transmit only during one half of the A/C cycle, that is, they are on 1/60
sec and then off 1/60 sec; this may allow intervening transmission time. See also here.

Traditionally, Wi-Fi used the 2.4 GHz ISM (Industrial, Scientific and Medical) band used also by microwave ovens, though
802.11a used a 5 GHz band, 802.11n supports that as an option and the new 802.11ac has returned to using 5 GHz exclusively. The
5 GHz band has reduced ability to penetrate walls, often resulting in a lower effective range (though in offices and multi-unit
housing this can be an advantage). The 5 GHz band provides many more usable channels than the 2.4 GHz band, resulting in much
less interference in “crowded” environments.
Wi-Fi radio spectrum is usually unlicensed, meaning that no special permission is needed to transmit but also that others may be
trying to use the same frequency band simultaneously. The availability of unlicensed channels in the 5 GHz band continues to
improve.
The table below summarizes the different Wi-Fi versions. All data bit rates assume a single spatial stream; channel widths are
nominal. The names in the far-right column have been introduced by the Wi-Fi Alliance as a more convenient designation for the
newer versions.

IEEE name maximum bit rate frequency channel width new name

802.11a 54 Mbps 5 GHz 20 MHz

802.11b 11 Mbps 2.4 GHz 20 MHz

802.11g 54 Mbps 2.4 GHz 20 MHz

802.11n 65-150 Mbps 2.4/5 GHz 20-40 MHz Wi-Fi 4

802.11ac 78-867 Mbps 5 GHz 20-160 MHz Wi-Fi 5

802.11ax Up to 1200 Mbps 2.4/5+ GHz 20-160 MHz Wi-Fi 6

The maximum bit rate is seldom achieved in practice. The effective bit rate must take into account, at a minimum, the time spent in
the collision-handling mechanism. More significantly, all the Wi-Fi variants above use dynamic rate scaling, below; the bit rate is
reduced up to tenfold (or more) in environments with higher error rates, which can be due to distance, obstructions, competing
transmissions or radio noise. All this means that, as a practical matter, getting 150 Mbps out of 802.11n requires optimum
circumstances; in particular, no competing senders and unimpeded line-of-sight transmission. 802.11n lower-end performance can
be as little as 10 Mbps, though 40-100 Mbps (for a 40 MHz channel) may be more typical.

3.7.1 https://eng.libretexts.org/@go/page/11107
The 2.4 GHz ISM band is divided by international agreement into up to 14 officially designated (and mostly adjacent) channels,
each about 5 MHz wide, though in the United States use may be limited to the first 11 channels. The 5 GHz band is similarly
divided into 5 MHz channels. One Wi-Fi sender, however, needs several of these official channels; the typical 2.4 GHz 802.11g
transmitter uses an actual frequency range of up to 22 MHz, or up to five official channels. As a result, to avoid signal overlap Wi-
Fi use in the 2.4 GHz band is often restricted to official channels 1, 6 and 11. The end result is that there are generally only three
available Wi-Fi bands in the 2.4 GHz range, and so Wi-Fi transmitters can and do interact with and interfere with each other.
There are almost 200 5 MHz channels in the 5 GHz band. The United States requires users of the this band to avoid interfering with
weather and military applications in the same frequency range; this may involve careful control of transmission power (under the
IEEE 802.11h amendment) and so-called “dynamic frequency selection” to choose channels with little interference, and to switch
to such channels if interference is detected later. Even so, there are many more channels than at 2.4 GHz; the larger number of
channels is one of the reasons (arguably the primary reason) that 802.11ac can run faster (below). The number of channels available
for Wi-Fi use has been increasing, often as conflicts with existing 5 GHz weather systems are resolved.
802.11ax has preliminary support for additional frequency bands in the 6-7 GHz range, though these are still awaiting (in the US)
final FCC approval.
Wi-Fi designers can improve throughput through a variety of techniques, including
1. improved radio modulation techniques
2. improved error-correcting codes
3. smaller guard intervals between symbols
4. increasing the channel width
5. allowing multiple spatial streams via multiple antennas
The first two in this list seem now to be largely tapped out; OFDM modulation (3.6.4.1 OFDM) is close enough to the Shannon-
Hartley limit that there is limited room for improvement, though 802.11ax saw fit to move to ODFMA. The third reduces the range
(because there is less protection from multipath interference) but may increase the data rate by ~10%; 802.11ax introduced support
for dynamic changing of guard-interval and symbol size. The largest speed increases are obtained the last two items in the list.
The channel width is increased by adding additional 5 MHz channels. For example, the 65 Mbps bit rate above for 802.11n is for a
nominal frequency range of 20 MHz, comparable to that of 802.11g. However, in areas with minimal competition from other
signals, 802.11n supports using a 40 MHz frequency band; the bit rate then goes up to 135 Mbps (or 150 Mbps if a smaller guard
interval is used). This amounts to using two of the three available 2.4 GHz Wi-Fi bands. Similarly, the wide range in 802.11ac bit
rates reflects support for using channel widths ranging from 20 MHz up to 160 MHz (32 5-MHz official channels).
Using multiple spatial streams is the newest data-rate-improvement technique; see 3.7.3 Multiple Spatial Streams.
For all the categories in the table above, additional bits are used for error-correcting codes. For 802.11g operating at 54 Mbps, for
example, the actual raw bit rate is (4/3)×54 = 72 Mbps, sent in symbols consisting of six bits as a unit.

3.7.1 Wi-Fi and Collisions


We looked extensively at the 10 Mbps Ethernet collision-handling mechanisms in 2.1 10-Mbps Classic Ethernet, only to conclude
that with switches and full-duplex links, Ethernet collisions are rapidly becoming a thing of the past. Wi-Fi, however, has brought
collisions back from obscurity. An Ethernet sender will discover a collision, if one occurs, during the first slot time, by monitoring
for faint interference with its own transmission. However, as mentioned in 3.6.2 Collisions, Wi-Fi transmitting stations simply
cannot detect collisions in progress. If another station transmits at the same time, a Wi-Fi sender will see nothing amiss although its
signal will not be received. While there is a largely-collision-free mode for Wi-Fi operation (3.7.7 Wi-Fi Polling Mode), it is not
commonly used, and collision management has a significant impact on ordinary Wi-Fi performance.

3.7.1.1 Link-Layer ACKs


The first problem with Wi-Fi collisions is even detecting them. Because of the inability to detect collisions directly, the Wi-Fi
protocol adds link-layer ACK packets, at least for unicast transmission. These ACKs are our first example of Wi-Fi control
packets and are unrelated to the higher-layer TCP ACKs.
The reliable delivery of these link-layer ACKs depends on careful timing. There are three time intervals applicable (numeric values
here are for 802.11b/g in the 2.4 GHz band). The value we here call IFS is more formally known as DIFS (D for “distributed”; see
3.7.7 Wi-Fi Polling Mode).

3.7.2 https://eng.libretexts.org/@go/page/11107
slot time: 20 µsec
IFS, the “normal” InterFrame Spacing: 50 µsec
SIFS, the short IFS: 10 µsec
For comparison, note that the RTT between two Wi-Fi stations 100 meters apart is less than 1 µsec. At 11 Mbps, one IFS time is
enough to send about 70 bytes; at 54 Mbps it is enough to send almost 340 bytes.
Once a station has received a data packet addressed to it, it waits for time SIFS and sends its ACK. At this point in time the receiver
will be the only station authorized to send, because, as we will see in the next section, all other stations (including those on
someone else’s overlapping Wi-Fi) will be required to wait the longer IFS period following the end of the previous data
transmission. These other stations will see the ACK before the IFS time has elapsed and will thus not interfere with it (though see
exercise 4.0).
If a packet is involved in a collision, the receiver will send no ACK, so the sender will know something went awry. Unfortunately,
the sender will not be able to tell whether the problem was due to a collision, or electromagnetic interference, or signal blockage, or
excessive distance, or the receiver’s being powered off. But as a collision is usually the most likely cause, and as assuming the lost
packet was involved in a collision results in, at worst, a slight additional delay in retransmission, a collision will always be
assumed.
Link-Layer ACKs contain no information – such as a sequence number – that identifies the packet being acknowledged. These
ACKs simply acknowledge the most recent transmission, the one that ended one SIFS earlier. In the Wi-Fi context, this is
unambiguous. It may be compared, however, to 6.1 Building Reliable Transport: Stop-and-Wait, where at least one bit of packet
sequence numbering is required.

3.7.1.2 Collision Avoidance and Backoff


The Ethernet collision-management algorithm was known as CSMA/CD, where CD stood for Collision Detection. The
corresponding Wi-Fi mechanism is CSMA/CA, where CA stands for Collision Avoidance. A collision is presumed to have
occurred if the link-layer ACK is not received. As with Ethernet, there is an exponential-backoff mechanism as well, though it is
scaled somewhat differently.
Any sender wanting to send a new data packet waits the IFS time after first sensing the medium to see if it is idle. If no other traffic
is seen in this interval, the station may then transmit immediately. However, if other traffic is sensed, the sender must do an
exponential backoff even for its first transmission attempt; other stations, after all, are likely also waiting, and avoiding an initial
collision is strongly preferred.
The initial backoff is to choose a random k<25 = 32 (recall that classic Ethernet in effect chooses an initial backoff of k<20 = 1; ie
k=0). The prospective sender then waits k slot times. While waiting, the sender continues to monitor for other traffic; if any other
transmission is detected, then the sender “suspends” the backoff-wait clock. The clock resumes when the other transmission has
completed and one followup idle interval of length IFS has elapsed.
Note that, under these rules, data-packet senders always wait for at least one idle interval of length IFS before sending, thus
ensuring that they never collide with an ACK sent after an idle interval of only SIFS.
On an Ethernet, if two stations are waiting for a third to finish before they transmit, they will both transmit as soon as the third is
finished and so there will always be an initial collision. With Wi-Fi, because of the larger initial k<32 backoff range, such initial
collisions are unlikely.
If a Wi-Fi sender believes there has been a collision, it retries its transmission, after doubling the backoff range to 64, then 128,
256, 512, 1024 and again 1024. If these seven attempts all fail, the packet is discarded and the sender starts over.
In one slot time, radio signals move 6,000 meters; the Wi-Fi slot time – unlike that for Ethernet – has nothing to do with the
physical diameter of the network. As with Ethernet, though, the Wi-Fi slot time represents the fundamental unit for backoff
intervals.
Finally, we note that, unlike Ethernet collisions, Wi-Fi collisions are a local phenomenon: if A and B transmit simultaneously, a
collision occurs at node C only if the signals of A and B are both strong enough at C to interfere with one another. It is possible that
a collision occurs at station C midway between A and B, but not at station D that is close to A. We return to this below in 3.7.1.4
Hidden-Node Problem.

3.7.3 https://eng.libretexts.org/@go/page/11107
3.7.1.3 Wi-Fi RTS/CTS
Wi-Fi stations optionally also use a request-to-send/clear-to-send (RTS/CTS) protocol, again negotiated with designated control
packets. Usually this is used only for larger data packets; often, the RTS/CTS “threshold” (the size of the largest packet not sent
using RTS/CTS) is set (as part of the Access Point configuration, 3.7.4 Access Points) to be the maximum packet size, effectively
disabling this feature. The idea behind RTS/CTS is that a large packet that is involved in a collision represents a significant waste
of potential throughput; for large packets, we should ask first.
The RTS control packet – which is small – is sent through the normal procedure outlined above; this packet includes the identity of
the destination and the size of the data packet the station desires to transmit. The destination station then replies with CTS after the
SIFS wait period, effectively preventing any other transmission after the RTS. The CTS packet also contains the data-packet size.
The original sender then waits for SIFS after receiving the CTS, and sends the packet. If all other stations can hear both the RTS
and CTS messages, then once the RTS and CTS are sent successfully no collisions should occur during packet transmission, again
because the only idle times are of length SIFS and other stations should be waiting for time IFS.

3.7.1.4 Hidden-Node Problem


Consider the diagram below. Each station has a 100-meter range. Stations A and B are 150 meters apart and so cannot hear one
another at all; each is 75 meters from C. If A is transmitting and B senses the medium in preparation for its own transmission, as
part of collision avoidance, then B will conclude that the medium is idle and will go ahead and send.

However, C is within range of both A and B. If A and B transmit simultaneously, then from C’s perspective a collision occurs. C
receives nothing usable. We will call this a hidden-node collision as the senders A and B are hidden from one another; the general
scenario is known as the hidden-node problem.
Note that node D receives only A’s signal, and so no collision occurs at D.
The hidden-node problem can also occur if A and B cannot receive one another’s transmissions due to a physical obstruction such
as a radio-impermeable wall:

One of the rationales for the RTS/CTS protocol is the prevention of hidden-node collisions. Imagine that, instead of transmitting its
data packet, A sends an RTS packet, and C responds with CTS. B has not heard the RTS packet from A, but does hear the CTS
from C. A will begin transmitting after a SIFS interval, but B will not hear A’s transmission. However, B will still wait, because the
CTS packet contained the data-packet size and thus, implicitly, the length of time all other stations should remain idle. Because
RTS packets are quite short, they are much less likely to be involved in collisions themselves than data packets.

3.7.1.5 Wi-Fi Fragmentation


Conceptually related to RTS/CTS is Wi-Fi fragmentation. If error rates or collision rates are high, a sender can send a large packet
as multiple fragments, each receiving its own link-layer ACK. As we shall see in 5.3.1 Error Rates and Packet Size, if bit-error
rates are high then sending several smaller packets often leads to fewer total transmitted bytes than sending the same data as one
large packet.
Wi-Fi packet fragments are reassembled by the receiving node, which may or may not be the final destination.

3.7.4 https://eng.libretexts.org/@go/page/11107
As with the RTS/CTS threshold, the fragmentation threshold is often set to the size of the maximum packet. Adjusting the values of
these thresholds is seldom necessary, though might be appropriate if monitoring revealed high collision or error rates.
Unfortunately, it is essentially impossible for an individual station to distinguish between reception errors caused by collisions and
reception errors caused by other forms of noise, and so it is hard to use reception statistics to distinguish between a need for
RTS/CTS and a need for fragmentation.

3.7.2 Dynamic Rate Scaling


Wi-Fi senders, if they detect transmission problems, are able to reduce their transmission bit rate in a process known as rate
scaling or rate control. The idea is that lower bit rates will have fewer noise-related errors, and so as the error rate becomes
unacceptably high – perhaps due to increased distance – the sender should fall back to a lower bit rate. For 802.11g, the standard
rates are 54, 48, 36, 24, 18, 12, 9 and 6 Mbps. Senders attempt to find the transmission rate that maximizes throughput; for
example, 36 Mbps with a packet loss rate of 25% has an effective throughput of 36 × 75% = 27 Mbps, and so is better than 24
Mbps with no losses.
Senders may update their bit rate on a per-packet basis; senders may also choose different bit rates for different recipients. For
example, if a sender sends a packet and receives no confirming link-layer ACK, the sender may fall back to the next lower bit rate.
The actual bit-rate-selection algorithm lives in the particular Wi-Fi driver in use; different nodes in a network may use different
algorithms.
The earliest rate-scaling algorithm was Automatic Rate Fallback, or ARF, [KM97]. The rate decreases after two consecutive
transmission failures (that is, the link-layer ACK is not received), and increases after ten transmission successes.
A significant problem for rate scaling is that a packet loss may be due either to low-level random noise (white noise, or thermal
noise) or to a collision (which is also a form of noise, but less random); only in the first case is a lower transmission rate likely to
be helpful. If a larger number of collisions is experienced, the longer packet-transmission times caused by the lower bit rate may
increase the frequency of hidden-node collisions. In fact, a higher transmission rate (leading to shorter transmission times) may
help; enabling the RTS/CTS protocol may also help.

Signal Strength
Most Wi-Fi drivers report the received signal strength. Newer drivers use the IEEE Received Channel Power Indicator
convention; the RCPI is an 8-bit integer proportional to the absolute power received by the antenna as measured in decibel-
milliwatts (dBm). Wi-Fi values range from -10 dBm to -90 dBm and below. For comparison, the light from the star Polaris
delivers about -97 dBm to one eye on a good night; Venus typically delivers about -73 dBm. A GPS satellite might deliver -127
dBm to your phone. (Inspired by Wikipedia on DBm.)

A variety of newer rate-scaling algorithms have been proposed; see [JB05] for a summary. One, Receiver-Based Auto Rate
(RBAR, [HVB01]), attempts to incorporate the signal-to-noise ratio into the calculation of the transmission rate. This avoids the
confusion introduced by collisions. Unfortunately, while the signal-to-noise ratio has a strong theoretical correlation with the
transmission bit-error rate, most Wi-Fi radios will report to the host system the received signal strength. This is not the same as
the signal-to-noise ratio, which is harder to measure. As a result, the RBAR approach has not been quite as effective in practice as
might be hoped.
With the Collision-Aware Rate Adaptation algorithm (CARA, [KKCQ06]), a transmitting station attempts (among other things) to
infer that its packet was lost to a collision rather than noise if, after one SIFS interval following the end of its packet transmission,
no link-layer ACK has been received and the channel is still busy. This will detect collisions only when the colliding packet is
longer than the station’s own packet, and only when the hidden-node problem isn’t an issue.
Because the actual data in a Wi-Fi packet may be sent at a rate not every participant is close enough to receive correctly, every Wi-
Fi transmission begins with a brief preamble at the minimum bit rate. Link-layer ACKs, too, are sent at the minimum bit rate.

3.7.3 Multiple Spatial Streams


The latest innovation in improving Wi-Fi (and other wireless) data rates is to support multiple simultaneous data streams, through
an antenna technique known as multiple-input-multiple-output, or MIMO. To use N streams, both sender and receiver must have N
antennas; all the antennas use the same frequency channels but each transmitter antenna sends a different data stream. At first

3.7.5 https://eng.libretexts.org/@go/page/11107
glance, any significant improvement in throughput might seem impossible, as the antenna elements in the respective sending and
receiving groups are each within about half a wavelength of each other; indeed, in clear space MIMO is not possible.
The reason MIMO works in most everyday settings is that it puts multipath interference to positive use. Consider again at the right-
hand side of the final image of 3.6.6 Multipath, in which the signal strength varies according to the blue ripples; the peaks and
valleys have a period of about half a wavelength. We will assume initially that the signal strength is low enough that reception in
the darkest blue areas is no longer viable; a single antenna with the misfortune to be in one of these “dead zones” may receive
nothing.
We will start with two simpler cases: SIMO (single-input-multiple-output) and MISO (multiple-input-single-output). In SIMO, the
receiver has multiple antennas; in MISO, the transmitter. Assume for the moment that the multiple-antenna side has two antennas.
In the simplest implementation of SIMO, the receiver picks the stronger of the two received signals and uses that alone; as long as
at least one antenna is not in a “dead zone”, reception is successful. With two antennas under half a wavelength apart, the odds are
that at least one of them will be located outside a dead zone, and will receive an adequate signal.
Similarly, in simple MISO, the transmitter picks whichever of its antennas that gets a stronger signal to the receiver. The receiver is
unlikely to be in a dead zone for bothtransmitter antennas. Note that for MISO the sender must get some feedback from the receiver
to know which antenna to use.
We can do quite a bit better if signal-processing techniques are utilized so the two sender or two receiver antennas can be used
simultaneously (though this complicates the mathematics considerably). Such signal-processing is standard in 802.11n and above;
the Wi-Fi header, to assist this process, includes added management packets and fields for reporting MIMO-related information.
One station may, for example, send the other a sequence of training symbols for discerning the response of the antenna system.
MISO with these added techniques is sometimes called beamforming: the sender coordinates its multiple antennas to maximize
the signal reaching one particular receiver.
In our simplistic description of SIMO and MIMO above, in which only one of the multiple-antenna-side antennas is actually used,
we have suggested that the idea is to improve marginal reception. At least one antenna on the multiple-antenna side can
successfully communicate with the single antenna on the other side. MIMO, on the other hand, can be thought of as applying when
transmission conditions are quite good all around, and every antenna on one side can reach every antenna on the other side. The
key point is that, in an environment with a significant degree of multipath interference, the antenna-to-antenna paths may all be
independent, or uncorrelated. At least one receiving antenna must be, from the perspective of at least one transmitting antenna, in a
multipath-interference “gray zone” of reduced signal strength.

A1 B1

A B
A2 B2
MIMO antennas

As a specific example, consider the diagram above, with two sending antennas A1 and A2 at the left and two receiving antennas B1
and B2 at the right. Antenna A1transmits signal S1 and A2 transmits S2. There are thus four physical signal paths: A1-to-B1, A1-to-
B2, A2-to-B1 and A2-to-B2. If we assume that the signal along the A1-to-B2 path (dashed and blue) arrives with half the strength of
the other three paths (solid and black), then we have

signal received by B1: S1 + S2


signal received by B2: S1/2 + S2
From these, B can readily solve for the two independent signals S1 and S2. These signals are said to form two spatial streams,
though the spatial streams are abstract and do not correspond to any of the four physical signal paths.
The antennas are each more-or-less omnidirectional; the signal-strength variations come from multipath interference and not from
physical aiming. Similarly, while the diagonal paths A1-to-B2 and A2-to-B1 are slightly longer than the horizontal paths A1-to-B1
and A2-to-B2, the difference is not nearly enough to allow B to solve for the two signals.
In practice, overall data-rate improvement over a single antenna can be considerably less than a factor of 2 (or than N, the number
of antennas at each end).

3.7.6 https://eng.libretexts.org/@go/page/11107
The 802.11n standard allows for up to four spatial streams, for a theoretical maximum bit rate of 600 Mbps. 802.11ac allows for up
to eight spatial streams, for an even-more-theoretical maximum of close to 7 Gbps. MIMO support is sometimes described with an
A×B×C notation, eg 3×3×2, where A and B are the number of transmitting and receiving antennas and C ≤ min(A,B) is the number
of spatial streams.

3.7.4 Access Points


There are two standard Wi-Fi configurations: infrastructure and ad hoc. The former involves connection to a designated access
point; the latter includes individual Wi-Fi-equipped nodes communicating informally. For example, two laptops can set up an ad
hoc connection to transfer data at a meeting. Ad hoc connections are often used for very simple networks not providing Internet
connectivity. Complex ad hoc networks are, however, certainly possible; see 3.7.8 MANETs.
The infrastructure configuration is much more common. Stations in an infrastructure network communicate directly only with
their access point, which, in turn, communicates with the outside world. If Wi-Fi nodes B and C share access point AP, and B
wishes to send a packet to C, then B first forwards the packet to AP and AP then forwards it to C. While this introduces a degree of
inefficiency, it does mean that the access point and its associated nodes automatically act as a true LAN: every node can reach
every other node. (It is also often the case that most traffic is between Wi-Fi nodes and the outside world.) In an ad hoc network, by
comparison, it is common for two nodes to be able to reach each other only by forwarding through an intermediate third node; this
is in fact a form of the hidden-node scenario.
Wi-Fi access points are generally identified by their SSID (“Service Set IDentifier”), an administratively defined human-readable
string such as “linksys” or “loyola”. Ad hoc networks also have SSIDs; these are generated pseudorandomly at startup and look
like (but are not) 48-bit MAC addresses.

Portable Access Points


Being a Wi-Fi access point is a very specific job; Wi-Fi-enabled “station” devices like phones and workstations do not
generally act as access points. However, it is often possible to for a station device to become an access point if the access-point
mode is supported by the underlying radio hardware, and if suitable drivers can be found. The Linux hostapd package is
one option. The FCC may or may not bestow its blessing.

Many access points can support multiple SSIDs simultaneously. For example, an access point might support SSID “guest” with
limited authentication (below), and also SSID “secure” with much stronger authentication.
Finally, Wi-Fi is by design completely interoperable with Ethernet; if station A is associated with access point AP, and AP also
connects via (cabled) Ethernet to station B, then if A wants to send a packet to B it sends it using AP as the Wi-Fi destination but
with B also included in the header as the “actual” destination. Once it receives the packet by wireless, AP acts as an Ethernet
switch and forwards the packet to B. While this forwarding is transparent to senders, the Ethernet and Wi-Fi LAN header formats
are quite different.

dest addr src addr type data CRC

Ethernet

frame dura- receiver transmit dest addr


seq
src addr data CRC
control tion control
addr addr

Wi-Fi Data

The above diagram illustrates an Ethernet header and the Wi-Fi header for a typical data packet (not using Wi-Fi quality-of-service
features). The Ethernet type field usually moves to an IEEE Logical Link Control header in the Wi-Fi region labeled “data”. The
receiver and transmitter addresses are the MAC addresses of the nodes receiving and transmitting the (unicast) packet; these may
each be different from the ultimate destination and source addresses. If station B wants to send a packet to station C in the same
network, the source and destination are B and C but the transmitter and receiver are B and the access point. In infrastructure mode
either the receiver or transmitter address is always the access point; in typical situations either the receiver is the destination or the
sender is the transmitter. In ad hoc mode, if LAN-layer routing is used then all four addresses may be distinct; see 3.7.8.1 Routing
in MANETs.

3.7.7 https://eng.libretexts.org/@go/page/11107
3.7.4.1 Joining a Network
To join the network, an individual station must first discover its access point, and must associate and then authenticate to that
access point before general communication can begin. (Older forms of authentication – so-called “open” authentication and the
now-deprecated WEP authentication – came before association, but newer authentication protocols such as WPA2, WPA2-Personal
and WPA2-Enterprise (3.7.5 Wi-Fi Security) come after.) We can summarize the stages in the process as follows:
scanning (or active probing)
open-authentication and association
true authentication
DHCP (getting an IP address, 7.10 Dynamic Host Configuration Protocol (DHCP))
The association and authentication processes are carried out by an exchange of special management packets, which are confined
to the Wi-Fi LAN layer. Occasionally stations may re-associate to their Access Point, eg if they wish to communicate some status
update.
Access points periodically broadcast their SSID in special beacon packets (though for pseudo-security reasons the SSID in the
beacon packets can be suppressed). Beacon packets are one of several Wi-Fi-layer-only management packets; the default beacon-
broadcast interval is 100 ms. These broadcasts allow stations to see a list of available networks; the beacon packets also contain
other Wi-Fi network parameters such as radio-modulation parameters and available data rates.
Another use of beacons is to support the power-management doze mode. Some stations may elect to enter this power-conservation
mode, in which case they inform the access point, record the announced beacon-transmission time interval and then wake up
briefly to receive each beacon. Beacons, in turn, each contain a list (in a compact bitmap form) of each dozing station for which the
access point has a packet to deliver.
Ad hoc networks have beacon packets as well; all nodes participate in the regular transmission of these via a distributed algorithm.
A connecting station may either wait for the next scheduled beacon, or send a special probe-request packet to elicit a beacon-like
probe-response packet. These operations may involve listening to or transmitting on multiple radio channels, sequentially, as the
station does not yet know the correct channel to use. Unconnected stations often send probe-request packets at regular intervals, to
keep abreast of available networks; it is these probe packets that allow tracking by the station’s MAC address. See 3.7.4.2 MAC
Address Randomization.
Once the beacon is received, the station initiates an association process. There is still a vestigial open-authentication process that
comes before association, but once upon a time this could also be “shared WEP key” authentication (below). Later, support for a
wide range of authentication protocols was introduced, via the 802.1X framework; we return to this in 3.7.5 Wi-Fi Security. For our
purposes here, we will include open authentication as part of the association process.
Wi-Fi Drivers
Even in 2015, 100%-open-source Wi-Fi drivers are available only for selected hardware, and even then not all operations may be
supported. Something as simple in principle as changing one’s source Wi-Fi MAC address is sometimes not possible, though see
3.7.4.2 MAC Address Randomization. Using multiple MAC addresses for a host plus embedded virtual machines is another
problematic case.
In open authentication the station sends an authentication request to the access point and the access point replies. About all the
station needs to know is the SSID of the access point, though it is usually possible to configure the access point to restrict
admission to stations with MAC (physical) addresses on a predetermined list. Stations sometimes evade MAC-address checking by
changing their MAC address to an acceptable one, though some Wi-Fi drivers do not support this.
Because the SSID plays something of the role of a password here, some Wi-Fi access points are configured so that beacon packets
does not contain the SSID; such access points are said to be hidden. Unfortunately, access points hidden this way are easily
unmasked: first, the SSID is sent in the clear by any other stations that need to authenticate, and second, an attacker can often
transmit forged deauthentication or disassociation requests to force legitimate stations to retransmit the SSID. (See “management
frame protection” in 3.7.5 Wi-Fi Securityfor a fix to this last problem.)
The shared-WEP-key authentication was based on the (obsolete) WEP encryption mechanism (3.7.5 Wi-Fi Security). It involved a
challenge-response exchange by which the station proved to the access point that it knew the shared WEP key. Actual WEP
encryption would then start slightly later.

3.7.8 https://eng.libretexts.org/@go/page/11107
Once the open-authentication step is done, the next step in an infrastructure network is for the station to associate to the access
point. This involves an association request from station to access point, and an association response in return. The primary goal of
the association exchange is to ensure that the access point knows (by MAC address) what stations it can reach. This tells the access
point how to deliver packets to the associating station that come from other stations or the outside world. Association is not
necessary in an ad hoc network.
The entire connection process (including secure authentication, below, and DHCP, 7.10 Dynamic Host Configuration Protocol
(DHCP)), often takes rather longer than expected, sometimes several seconds. See [PWZMTQ17] for a discussion of some of the
causes. Some station and access-point pairs appear not to work as well together as other pairs.

3.7.4.2 MAC Address Randomization


Most Wi-Fi-enabled devices are configured to transmit Wi-Fi probe requests at regular intervals (and on all available channels), at
least when not connected. These probe requests identify available Wi-Fi networks, but they also reveal the device’s MAC address.
This allows sites such as stores to track customers by their device. To prevent such tracking, some devices now support MAC
address randomization, proposed in [GG03]: the use at appropriate intervals of a new MAC address randomly selected by the
device.
Probe requests are generally sent when the device is not joined to a network. To prevent tracking via probe requests, the simplest
approach is to change the MAC address used for probes at regular, frequent intervals. A device might even change its MAC address
on every probe.
Changing the MAC address used for actually joining a network is also important to prevent tracking, but introduces some
complications. RFC 7844 suggests these options for selecting new random addresses:
At regular time intervals
Per connection: each time the device connects to a Wi-Fi network, it will select a new MAC address
Per network: like the above, except that if the device reconnects to the same network (identified by SSID), it will use the same
MAC address
The first option, changing the joined MAC address at regular time intervals, breaks things. First, it will likely result in assignment
of a new IP address to the device, terminating all existing connections. Second, many sites still authenticate – at least in part –
based on the MAC address. The per-connection option prevents the first problem. The per-network option prevents both, but allows
a site at which the device actually joins the network to track repeat connections. (Configuring the device to “forget” the connection
between successive joins will usually prevent this, but may not be convenient.)
Another approach to the tracking problem is to disable probe requests entirely, except on explicit demand.
Wi-Fi MAC address randomization is, unfortunately, not a complete barrier to device tracking; there are other channels through
which devices may leak information. For example, probe requests also contain device-capability data known as Information
Elements; these values are often distinctive enough that they allow at least partial fingerprinting. Additionally, it is possible to track
many Wi-Fi devices based on minute variations in the modulated signals they transmit. MAC address randomization does nothing
to prevent such “radiometric identification”. Access points can also impersonate other popular access points, and thus trick devices
into initiating a connection with their real MAC addresses. See [BBGO08] and [VMCCP16] for these and other examples.
Finally, MAC address randomization may have applications for Ethernet as well as Wi-Fi. For example, in the original IPv6
specification, IPv6 addresses embedded the MAC address, and thus introduced the possibility of tracking a device by its IPv6
address. MAC address randomization can prevent this form of tracking as well. However, other techniques implementable solely in
the IPv6 layer appear to be more popular; see 8.2.1 Interface identifiers.

3.7.4.3 Roaming
Large installations with multiple access points can create “roaming” access by assigning all the access points the same SSID. An
individual station will stay with the access point with which it originally associated until the signal strength falls below a certain
level (as determined by the station), at which point it will seek out other access points with the same SSID and with a stronger
signal. In this way, a large area can be carpeted with multiple Wi-Fi access points, so as to look like one large Wi-Fi domain. The
access points are often connected via a wired LAN, known as the distribution system, though the use of Wi-Fi itself to provide
interconnection between access points is also an option (3.7.4.4 Mesh Networks). At any one time, a station may be associated to
only one access point. In 802.11 terminology, a multiple-access-point configuration with a single SSID is known as an “extended
service set” or ESS.

3.7.9 https://eng.libretexts.org/@go/page/11107
In order for such a large-area network to work, traffic to a wireless station, eg B, must find that station’s current access point, eg
AP. To help the distribution system track B’s current location, B is required, at the time it moves from APold to AP, to send to AP a
reassociation request, containing APold’s address. This sets in motion a number of things; one of them is that AP contacts APold to
verify (and terminate) the former association. This reassociation process also gives AP an opportunity – not spelled out in detail in
the standard – to notify the distribution system of B’s new location.
If the distribution system is a switched Ethernet supporting the usual learning mechanism (2.4 Ethernet Switches), one simple
approach to roaming stations is to handle this the same way as, in a wired Ethernet, traffic finds a laptop that has been unplugged,
carried to a new location, and plugged in again. Suppose our wireless node B has been exchanging packets via the distribution
system with wired node C (perhaps a router connecting B to the Internet). When B moves from APold to AP, all it has to do is send
any packet over the LAN to C, and the Ethernet switches on the path from B to C will then learn the route through the switched
Ethernet from C back to B’s new AP, and thus to B. It is also possible for B’s new AP to send this switch-updating packet, perhaps
as part of its reassociation response.
This process may leave other switches in the distribution system still holding in their forwarding tables the old location for B. This
is not terribly serious, as it will be fixed for any one switch as soon as B sends a packet to a destination reached by that switch. The
problem can be avoided entirely if, after moving, B (or, again, its new AP) sends out an Ethernet broadcast packet.
Running Ethernet cable to remote access points can easily dwarf the cost of the access point itself. As a result, there is considerable
pressure to find ways to allow the Wi-Fi network itself to form the distribution system. We return to this below, in 3.7.4.4 Mesh
Networks.
The IEEE 802.11r amendment introduced the standardization of fast handoffs from one access point to another within the same
ESS. It allows the station and new access point to reuse the same pairwise master keys (below) that had been negotiated between
the station and the old access point. It also slightly streamlines the reassociation process. Transitions must, however, still be
initiated by the station. The amendment is limited to handoffs; it does not address finding the access point to which a particular
station is associated, or routing between access points.
Because handoffs must be initiated by the station, sometimes all does not quite work out smoothly. Within an ESS, most newer
devices (2018) are quite good at initiating handoffs. However, this is not always the case for older devices, and is usually still not
the case for many mobile-station devices moving from one ESS to another (that is, where there is a change in the SSID). Such
devices may cling to their original access point well past the distance at which the original signal ceases to provide reasonable
bandwidth, as long as it does not vanish entirely. Many Wi-Fi “repeaters” or “extenders” (below) sold for residential use do require
a second SSID, and so will often do a poor job at supporting roaming.
Some access points support proprietary methods for dealing with older mobile stations that are reluctant to transfer to a closer
access point within the same ESS, though these techniques are now seldom necessary. By communicating amongst themselves, the
access points can detect when a station’s signal is weak enough that a handoff would be appropriate. One approach involves having
the original access point initiate a dissociation. At that point the station will reconnect to the ESS but should now connect to an
access point within the ESS that has a stronger signal. Another approach involves having the access points all use the same MAC
address, so they are indistinguishable. Whichever access point receives the strongest signal from the station is the one used to
transmit to the station.

3.7.4.4 Mesh Networks


Being able to move freely around a multiple-access-point Wi-Fi installation is very important for many users. When such a Wi-Fi
installation is created in a typical office building pre-wired for Ethernet, the access points all plug into the Ethernet, which becomes
the distribution network. However, in larger-scale residential settings, and even many offices, running Ethernet cable may be
prohibitively expensive. In such cases it makes sense to have the access points interconnect via Wi-Fi itself. If Alice associates to
access point A and sends a packet destined for the outside world, then perhaps A will have to forward the packet to Wi-Fi node B,
which will in turn forward it to C, before delivery can be complete.
This is sometimes easier said than done, however, as the original Wi-Fi standards did not provide for the use of Wi-Fi access points
as “repeaters”; there was no standard mechanism for a Wi-Fi-based distribution network.
One inexpensive approach is to use devices sometimes sold as Wi-Fi “extenders”. Such devices typically set up a new SSID, and
forward all traffic to the original SSID. Multi-hop configurations are possible, but must usually be configured manually. Because
the original access point and the extender have different SSIDs, many devices will not automatically connect to whichever is closer,

3.7.10 https://eng.libretexts.org/@go/page/11107
preferring to stick to the SSID originally connected to until that signal essentially disappears completely. This is, for many mobile
users, reason enough to give up on this strategy.
The desire for a Wi-Fi-based distribution network has led to multiple proprietary solutions. It is possible to purchase a set of Wi-Fi
“mesh routers” (2018), often sold at a considerable premium over “standard” routers. After they are set up, these generally work
quite well: they present a single SSID, and support fast handoffs from one access point to another, without user intervention. To the
user, coverage appears seamless.
The downside of a proprietary mechanism, however, is that once you buy into one solution, equipment from other vendors will
seldom interoperate. This has led to pressure for standardization. The IEEE introduced “mesh networking” in its 802.11s
amendment, finalized as part of the 2012 edition of the full 802.11 standard; it was slow to catch on. The Wi-Fi Alliance introduced
the Wi-Fi EasyMesh solution in 2018, based on 802.11s, but, as of the initial rollout, no vendors were yet on board.
We will assume, for the time being, that Wi-Fi mesh networking is restricted to the creation of a distribution network
interconnecting the access points; ordinary stations do not participate in forwarding other users’ packets through the mesh.
Common implementations often take this approach, but in fact the 802.11s amendment allows more general approaches.
In creating a mesh network with a Wi-Fi distribution system – proprietary or 802.11s – the participating access points must address
the following issues:
They must authenticate to one another
They must identify the correct access point to reach a given station B
They must correctly handle station B’s movement to a different access point
They must agree on how to route, through the mesh of access points, between the station and the connection to the Internet
Eventually the routing issue becomes the same routing problem faced by MANETs (3.7.8 MANETs), although restricted to the
(simpler) case where all nodes are under common management. Routing is not trivial; the path A→B→C might be shorter than the
alternative path A→D→E→C, but support a lower data rate.
The typical 802.11s solution is to have the multiple access points participate in a mesh BSS. This allows all the access points to
appear to be on a single LAN. In this setting, the mesh BSS is separate from the ESS seen by the user stations, and is only used for
inter-access-point communication.
One (or more) access points are typically connected to the Internet; these are referred to as root mesh stations.
In the 802.11s setting, mesh discovery is achieved via initial configuration of a mesh SSID, together with a WPA3 passphrase.
Mutual authentication is then via WPA3, below; it is particularly important that each pair of stations authenticate symmetrically to
one another.
If station B associates to access point AP, then AP uses the mesh BSS to deliver packets sent by B to the root mesh station (or to
some other AP). For reverse traffic, B’s reassociation request sent to AP gives AP an opportunity to interact with the mesh BSS to
update B’s new location. The act of B’s sending a packet via AP will also tell the mesh BSS how to find B.
Routing through the mesh BSS is handled via the HWMP protocol, 9.4.3 HWMP. This protocol typically generates a tree of
station-to-station links (that is, a subset of all links that contains no loops), based at the root station. This process uses a routing
metric that is tuned to the wireless environment, so that high-bandwidth and low-error links are preferred.
If a packet is routed through the mesh BSS from station A to station B, then more addresses are needed in the packet header. The
ultimate source and destination are A and B, and the transmitter and receiver correspond to the specific hop, but the packet also
needs a source and destination within the mesh, perhaps corresponding to the two access points to which A and B connect. 802.11s
handles this by adding a mesh control field consisting of some management fields (such as TTL and sequence number) and a
variable-length block of up to three additional addresses.
It is also possible for ordinary stations to join the 802.11s mesh BSS directly, rather than restricting the mesh BSS to the access
points. This means that the stations will participate in the mesh as routing nodes. It is hard to predict, in 2018, how popular this will
become.
The EasyMesh standard of the Wi-Fi Alliance is not exactly the same as the IEEE 802.11s standard. For one thing, the EasyMesh
standard specifies that one access point – the one connected to the Internet – will be a “Multi-AP” Controller; the other access
points are called Agents. The EasyMesh standard also incorporates parts of the IEEE 1905.1 standard for home networks, which
simplifies initial configuration.

3.7.11 https://eng.libretexts.org/@go/page/11107
3.7.5 Wi-Fi Security
Unencrypted Wi-Fi traffic is visible to anyone nearby with an appropriate receiver; this eavesdropping zone can be expanded by
use of a larger antenna. Because of this, Wi-Fi security is important, and Wi-Fi supports several types of traffic encryption.
The original – and now obsolete – Wi-Fi encryption standard was Wired-Equivalent Privacy, or WEP. It involved a 5-byte key,
later sometimes extended to 13 bytes. The encryption algorithm was based on RC4, 22.7.4.1 RC4. The key was a pre-shared key,
manually configured into each station.
Because of the specific way WEP made use of the RC4 cipher, it contained a fatal (and now-classic) flaw. Bytes of the key could be
could be “broken” – that is, guessed – sequentially. Knowing bytes 0 through i−1 would allow an attacker to guess byte i with a
relatively small amount of data, and so on through the entire key. See 22.7.7 Wi-Fi WEP Encryption Failure for details.
WEP was replaced with Wi-Fi Protected Access, or WPA. This used the so-called TKIP encryption algorithm that, like WEP, was
ultimately based on RC4, but which was immune to the sequential attack that made WEP so vulnerable. WPA was later replaced by
WPA2 as part of the IEEE 802.11i amendment, which uses the presumptively stronger AES encryption (22.7.2 Block Ciphers); the
variant used by WPA2 is known as CCMP. WPA2 encryption is believed to be quite secure, although there was a vulnerability in
the associated Wi-Fi Protected Setup protocol. In the 802.11i standard, WPA2 is known as the robust security network protocol.
Access points supporting WPA or WPA2 declare this in their beacon and probe-response packets; these packets also include a list
of acceptable ciphers.
WPA2 (and WPA) comes in two flavors: WPA2-Personal and WPA2-Enterprise. These use the same AES encryption, but differ
in how keys are managed. WPA2-Personal, appropriate for many smaller sites, uses a pre-shared master key, known as the PSK.
This key must be entered into the Access Point (ideally not over the air) and into each connecting station. The key is usually a
secure hash (22.6 Secure Hashes) of a passphrase. The use of a single key for multiple stations makes changing the key, or
revoking the key for a particular user, difficult.
In 2018, the IEEE introduced WPA3, intended to fix a host of accumulated issues. Perhaps the most important change is that
WPA3-Personal switches from the WPA2 four-way handshake to the SAE mutual-password-authentication mechanism, 22.8.2
Simultaneous Authentication of Equals. We return to WPA3 below, at 3.7.5.3 WPA3.

3.7.5.1 WPA2 Four-way handshake


In any secure Wi-Fi authentication protocol, the station must authenticate to the access point and the access point must authenticate
to the station; without the latter part, stations might inadvertently connect to rogue access points, which would then likely gain at
least partial access to private data. This bidirectional authentication is achieved through the so-called four-way handshake, which
also generates a session key, known as the pairwise transient key or PTK, that is independent of the master key. Compromise of
the PTK should not allow an attacker to determine the master key. To further improve security, the PTK is used to generate the
temporal key, TK, used to encrypt data messages, a separate message-signing key used in the MIC code, below, and some
management-packet keys.
In WPA2-Personal, the master key is the pre-shared key (PSK); in WPA2-Enterprise, below, the master key is the negotiated
“pairwise master key”, or PMK. The four-way handshake begins immediately after association and, for WPA2-Enterprise, the
selection of the PMK. None of the four packets that are part of the handshake are encrypted.
Both station and access point begin by each selecting a random string, called a nonce, typically 32 bytes long. In the diagram
below, the access point (authenticator) has chosen ANonce and the station (supplicant) has chosen SNonce . The PTK will be
a secure hash of the master key, both nonces, and both MAC addresses. The first packet of the four-way handshake is sent by the
access point to the station, and contains its nonce, unencrypted. This packet also contains a replay counter, RC; the access point
assigns these sequentially and the station echoes them back.

3.7.12 https://eng.libretexts.org/@go/page/11107
Station Access Point

ANonce, RC=r
Station
knows PTK

SNonce, RC=r, MIC


Access Point
knows PTK

RC=r+1, GTK, MIC

ACK, RC=r+1, MIC


Install Install
PTK PTK

Four-way WPA2 handshake


RC = replay counter

At this point the station has enough information to compute the PTK; in the second message of the handshake it now sends its own
nonce to the access point. The nonce is again sent in the clear, but this second message also includes a digital signature. This
signature is sometimes called a Message Integrity Code, or MIC, and in the 802.11i standard is officially named Michael. It is
calculated in a manner similar to the HMAC mechanism of 22.6.1 Secure Hashes and Authentication, and uses its own key derived
from the PTK.
Upon receipt of the station’s nonce, the access point too is able to compute the PTK. With the PTK now in hand, the access point
verifies the attached signature. If it checks out, that proves to the access point that the station did in fact know the master key, as a
valid signature could not have been constructed without it. The station has now authenticated itself to the access point.
For the third stage of the handshake, the access point, now also in possession of the PTK, sends a signed message to the station.
The replay counter is incremented, and an optional group temporal key, GTK, may be included for encrypting non-unicast
messages. If the GTK is included, it is encrypted with the PTK, though the entire message is not encrypted. When this third
message is received and verified, the access point has authenticated itself to the station. The fourth and final step is simply an
acknowledgment from the client.
Four-way-handshake packets are sent in the EAPOL format, described in the following section. This format can be used to identify
the handshake packets in WireShark scans.
One significant vulnerability of the four-way handshake when WPA2-Personal is used is that if an eavesdropper records the
messages, then it can attempt an offline brute-force attack on the key. Different values of the passphrase used to generate the PSK
can be tried until the MIC values computed for the second and third packets match the values in the corresponding recorded
packets. At this point the attacker can not only authenticate to the network, but can also decrypt packets. This attack is harder with
WPA2-Enterprise, as each user has a different key.
Other WPA2-Personal stations on the same network can also eavesdrop, given that all stations share the same PSK, and that the
PTK is generated from the PSK and information transmitted without encryption. The Diffie-Hellman-Merkle key-exchange
mechanism, 22.8 Diffie-Hellman-Merkle Exchange, would avoid this difficulty; keys produced this way are not easily determined
by an eavesdropper, even one with inside information about master keys. However, this was not used, in part because WPA needed
to be rushed into service after the failure of WEP.

3.7.5.1.1 KRACK Attack


The purpose of the replay counter, RC in the diagram above, is to prevent an attacker from reusing an old handshake packet.
Despite this effort, replayed or regenerated instances of the third handshake packet can sometimes be used to seriously weaken the
underlying encryption. The attack, known as the Key Reinstallation Attack, or KRACK, is documented in [VP17]. The attack has
several variations, some of which address a particular implementation’s interpretation of the IEEE standard, and some of which
address other Wi-Fi keys (eg the group temporal key) and key handshakes (eg the handshake used by 3.7.4.3 Roaming). We
consider only the most straightforward form here.
The ciphers used by WPA2 are all “stream” ciphers (22.7.4 Stream Ciphers), meaning that, for each packet, the key is used to
generate a keystream of pseudorandom bits, the same length as the packet; the packet is then XORed with this keystream to encrypt
it. It is essential for this scheme’s security that the keystreams of different packets are unrelated; to achieve this, the keystream
algorithm incorporates an encryption nonce, initially 1 and incremented for each successive packet.

3.7.13 https://eng.libretexts.org/@go/page/11107
The core observation of KRACK is that, whenever the station installs or reinstalls the PTK, it also resets this encryption nonce to 1.
This has the effect of resetting the keystream, so that, for a while, each new packet will be encrypted with exactly the same
keystream as an earlier packet.
This key reinstallation at the station side occurs whenever an instance of the third handshake packet arrives. Because of the
possibility of lost packets, the handshake protocol must allow duplicates of any packet.
The basic strategy of KRACK is now to force key reinstallation, by arranging for either the access point or the attacker to deliver
duplicates of the third handshake packet to the station.
In order to interfere with packet delivery, the attacker must be able to block and delay packets from the access point to the station,
and be able to send its own packets to the station. The easiest way to accomplish this is for the attacker to be set up as a “clone” of
the real access point, with the same MAC address, but operating on a different Wi-Fi channel. The attacker receives messages from
the real access point on the original channel, and is able to selectively retransmit them to the station on the new channel. This can
be described as a channel-based man-in-the-middle attack; cf 22.9.3 Trust and the Man in the Middle. Alternatively, the attacker
may also be able to selectively jam messages from the access point.
If the attacker can block the fourth handshake packet, from station to access point, then the access point will eventually time out
and retransmit a duplicate third packet, complete with properly updated replay counter. The attacker can delay the duplicate third
packet, if desired, in order to prolong the interval of keystream reuse. The station’s response to this duplicate third packet will be
encrypted, but the attacker can usually generate a forged, unencrypted version.
Forcing reuse of the keystream does not automatically break the encryption. However, in many cases the plaintext of a few packets
can be guessed by context, and hence, by XORing, the keystream used to encrypt the packet can be determined. This allows trivial
decryption of any later packet encrypted with the same keystream.
Other possibilities depend on the cipher. When the TKIP cipher is used, a vulnerability in the MIC algorithm may allow
determination of the key used in the MIC; this in turn would allow the attacker to inject new packets, properly signed, into the
connection. These new packets can be encrypted with one of the broken keystreams. This strategy does not work with AES
(CCMP) encryption.
The KRACK vulnerability was fixed in wpa_supplicant by disallowing reinstallation of the same key. That is, if a
retransmission of the third handshake packet is received, it is ignored; the encryption nonce is not reset.

3.7.5.2 WPA2-Enterprise
The WPA2-Enterprise alternative allows each station to have its own separate key. In fact, it largely separates the encryption
mechanisms from the Wi-Fi protocols, allowing sites great freedom in choosing the former. Despite the “enterprise” in the name, it
is also well suited for smaller sites. WPA2-Enterprise is based rather closely on the 802.1X framework, which supports arbitrary
authentication protocols as plug-in modules.
In principle, the only improvement WPA2-Enterprise offers over WPA2-Personal is the ability to assign individual Wi-Fi
passwords. In practice, this is an enormously important feature. It prevents, for example, one user from easily decrypting packets
sent by another user.
The keys are all held by a single common system known as the authentication server, usually unrelated to the access point. The
client node (that is, the Wi-Fi station) is known as the supplicant, and the access point is known as the authenticator.
To begin the authentication process, the supplicant contacts the authenticator using the Extensible Authentication Protocol, or EAP,
with what amounts to a request to authenticate to that access point. EAP is a generic message framework meant to support multiple
specific types of authentication; see RFC 3748 and RFC 5247. The EAP request is forwarded to the authentication server, which
may exchange (via the authenticator) several challenge/response messages with the supplicant. No secret credentials should be sent
in the clear.
EAP is usually used in conjunction with the RADIUS (Remote Authentication Dial-In User Service) protocol (RFC 2865), which
is a specific (but flexible) authentication-server protocol. WPA2-Enterprise is sometimes known as 802.1X mode, EAP mode or
RADIUS mode (though WPA2-Personal is also based on 802.1X, and uses EAP in its four-way handshake).
EAP communication takes place before the supplicant is given an IP address; thus, a mechanism must be provided to support
exchange of EAP packets between supplicant and authenticator. This mechanism is known as EAPOL, for EAP Over LAN. EAP
messages between the authenticator and the authentication server, on the other hand, can travel via IP; in fact, sites may choose to

3.7.14 https://eng.libretexts.org/@go/page/11107
have the authentication server hosted remotely. Specific protocols using the EAP/RADIUS framework often use packet formats
other than EAPOL, but EAPOL will be used in the concluding four-way handshake.
Once the authentication server (eg RADIUS server) is set up, specific per-user authentication methods can be entered. This can
amount to ⟨username,password⟩ pairs (below), or some form of security certificate, or sometimes both. The authentication server
will generally allow different encryption protocols to be used for different supplicants, thus allowing for the possibility that there is
not a common protocol supported by all stations.
In WPA2-Enterprise, the access point no longer needs to know anything about what authentication protocol is actually used; it is
simply the middleman forwarding EAP packets between the supplicant and the authentication server. In particular, the access point
does not need to support any specific authentication protocol. The access point allows the supplicant to connect to the network once
it receives permission to do so from the authentication server.
At the end of the authentication process, the supplicant and the authentication server will, as part of that process, also have
established a shared secret. In WPA2-Enterprise terminology this is known as the pairwise master key or PMK. The
authentication server then communicates the PMK securely to the access point (using any standard protocol; see 22.10 SSH and
TLS). The next step is for the supplicant and the access point to negotiate their session key. This is done using the four-way-
handshake mechanism of the previous section, with the PMK as the master key. The resultant PTK is, as with WPA2-Personal,
used as the session key.
WPA2-Enterprise authentication typically does require that the access point have an IP address, in order to be able to contact the
authentication server. An access point using WPA2-Personal authentication does not need an IP address, though it may have one
simply to enable configuration.

3.7.5.2.1 Enabling WPA2-Enterprise


Configuring a Wi-Fi network to use WPA2-Enterprise authentication is relatively straightforward, as long as an authentication
server running RADIUS is available. We here give an outline of setting up WPA2-Enterprise authentication using FreeRADIUS
(version 2.1.12, 2018). We want to enable per-user passwords, but not per-user certificates. Passwords will be stored on the server
using SHA-1 hashing (22.6 Secure Hashes). This is not necessarily strong enough for production use; see 22.6.2 Password
Hashesfor other options. Because passwords will be hashed, the client will have to communicate the actual password to the
authentication server; authentication methods such as those in 22.6.3 CHAP are not an option.
The first step is to set up the access point. This is generally quite straightforward; WPA2-Enterprise is supported even on
inexpensive access points. After selecting the option to enable WPA2-Enterprise security, we will need to enter the IP address of
the authentication server, and also a “shared secret” password for authenticating messages between the access point and the server
(see 22.6.1 Secure Hashes and Authentication for message-authentication techniques).
Configuration of the RADIUS server is a bit more complex, as both RADIUS and EAP are both quite general; both were developed
long before 802.1X, and both are used in many other settings as well. Because we have decided to use hashed passwords – which
implies the client station will send the plaintext password to the authentication server – we will need to use an authentication
method that creates an encrypted tunnel. The Protected EAP method is well-suited here; it encrypts its traffic using TLS (22.10.2
TLS, though here without TCP). (There is also an EAP TLS method, using TLS directly and traditionally requiring client-side
certificates, and a TTLS method, for Tunneled TLS.)
Within the PEAP encrypted tunnel, we want to use plaintext password authentication. Here we want the Password Authentication
Protocol, PAP, which basically just asks for the username and password. FreeRADIUS does not allow PAP to run directly within
PEAP, so we interpose the Generic Token Card protocol, GTC. (There is no “token card” device anywhere in sight, but GTC is
indeed quite generic.)
We probably also have to tell the RADIUS server the IP address of the access point. The access point here must have an IP address,
specifically for this communication.
We enable all these things by editing the eap.conf file so as to contain the following entries:

default_eap_type = peap

...

3.7.15 https://eng.libretexts.org/@go/page/11107
peap {
default_eap_type = gtc
...
}

...

gtc {
auth_type = PAP
...
}

The next step is to create a (username, hashed_password) credential pair on the server. To keep things simple, we will store
credentials in the users file. The username will be “alice”, with password “snorri”. Per the FreeRADIUS rules, we need to
convert the password to its SHA-1 hash, encoded using base64. There are several ways to do this; we will here make use of the
OpenSSL command library:

echo -n "snorri" | openssl dgst -binary -sha1 | openssl base64

This returns the string 7E6FbhrN2TYOkrBti+8W8weC2W8= which we then enter into the users file as follows:

alice SHA1-Password := "7E6FbhrN2TYOkrBti+8W8weC2W8="

Other options include Cleartext-Password , MD5-Password and SSHA1-Password , with the latter being for
salted passwords (which are recommended).
With this approach, Alice will have difficulty changing her password, unless she is administrator of the authentication server. This
is not necessarily worse than WPA2-Personal, where Alice shares her password with other users. However, if we want to support
user-controlled password changing, we can configure the RADIUS server to look for the (username, hashed_password) credentials
in a database instead of the users file. It is then relatively straightforward to create a web-based interface for allowing users to
change their passwords.
Now, finally, we try to connect. Any 802.1X client should ask for the username and password, before communication with the
authentication server begins. Some may also ask for a preferred authentication method (though our RADIUS server here is only
offering one), an optional certificate (which we are not using), or an “anonymous identity”, which is a way for a client to specify a
particular authentication server if there are several. If all goes well, connection should be immediate. If not, FreeRADIUS has an
authentication-testing tool, and copious debugging output.

3.7.5.3 WPA3
In 2018 the Wi-Fi Alliance introduced WPA3, a replacement for WPA2. The biggest change is that, when both parties are WPA3-
aware, the WPA2 four-way handshake is replaced with SAE, 22.8.2 Simultaneous Authentication of Equals. The advantage of SAE
here is that an eavesdropper can get nowhere with an offline, dictionary-based, brute-force password attack; recall from the end of
3.7.5.1 WPA2 Four-way handshake that WPA2 is quite vulnerable in this regard. An attacker can still attempt an online brute-force
attack on WPA3, eg by parking a van within Wi-Fi range and trying one password after another, but this is slow.
Another consequence of SAE is forward secrecy (22.9.2 Forward Secrecy). This means that if an attacker obtains the encryption
key for one session, it will not help decrypt older (or newer) sessions. In fact, even if an attacker obtains the master password, it
will not be able to obtain any session keys (although the attacker will be able to connect to the network). Under WPA2, if an
attacker obtains the PMK, then all session keys can be calculated from the nonce values exchanged in the four-way handshake.
As with WPA2, WPA3 requires that both the station and the access point maintain the password cleartext (or at least the key
derived from the password). Because each side must authenticate to the other, it is hard to see how this could be otherwise.

3.7.16 https://eng.libretexts.org/@go/page/11107
WPA3 encrypts even connections to “open” access points, through what is called Opportunistic Wireless Encryption; see RFC
8110. WPA3 also introduces longer key lengths, and adds some new ciphers.
Although it is not strictly part of WPA3, the EasyConnect feature was announced at the same time. This allows easier connection of
devices that lack screens or keyboards, which makes entering a password difficult. The EasyConnect device should come with a
QR code; scanning the code allows the device to be connected.
Finally, WPA3 contains an official fix to the KRACK attack.

3.7.5.4 Encryption Coverage


Originally, encryption covered only the data packets. A common attack involved forging management packets, eg to force stations
to disassociate from their access point. Sometimes this was done continuously as a denial-of-service attack; it might also be done to
force a station to reassociate and thus reveal a hidden SSID, or to reveal key information to enable a brute-force decryption attack.
The 2009 IEEE 802.11w amendment introduced the option for a station and access point to negotiate management frame
protection, which encrypts (and digitally signs) essential management packets exchanged after the authentication phase is
completed. This includes those station-to-access-point packets requesting deauthentication or disassociation, effectively preventing
the above attacks. However, management frame protection is (as of 2015) seldom enabled by default by consumer-grade Wi-Fi
access points, even when data encryption is in effect.

3.7.6 Wi-Fi Monitoring


Again depending on ones driver, it is sometimes possible to monitor all Wi-Fi traffic on a given channel. Special tools exist for this,
including aircrack-ng and kismet, but often plain WireShark will suffice if one can get the underlying driver into so-called
“monitor” mode. On Linux systems the command iwconfig wlan0 mode monitor should do this (where wlan0 is
the name of the wireless network interface). It may be necessary to first kill other processes that have the wlan0 interface open,
eg with serviceNetworkManager stop . It may also be necessary to bring the interface down, with
ifconfig wlan0 down , in which case the interface needs to be brought back up after entering monitor mode. Finally, the
receive channel can be set with, eg, iwconfig wlan0 channel 6 . (On some systems the interface name may change after
the transition to monitor mode.)
After the mode and channel are set, Wireshark will report the 802.11 management-frame headers, and also the so-called radiotap
header containing information about the transmission data rate, channel, and received signal strength.
One useful experiment is to begin monitoring and then to power up a Wi-Fi enabled device. The WireShark display filter
wlan.addr == device-MAC-address helps focus on the relevant packets(or, better yet, the capture filter ether host
device-MAC-address). The WireShark screenshot below is an example.

we see node SamsungE_03:3f:ad broadcast a probe request, which is answered by the access point Cisco-Li_d1:24:40. The next
two packets represent the open-authentication process, followed by two packets representing the association process. The last four
packets, of type EAPOL , represent the WPA2-Personal four-way authentication handshake.

3.7.7 Wi-Fi Polling Mode


Wi-Fi also includes a “polled” mechanism, where one station (the Access Point) determines which stations are allowed to send.
While it is not often used, it has the potential to greatly reduce collisions, or even eliminate them entirely. This mechanism is
known as “Point Coordination Function”, or PCF, versus the collision-oriented mechanism which is then known as “Distributed
Coordination Function”. The PCF name refers to the fact that in this mode it is the Access Point that is in charge of coordinating
which stations get to send when.
The PCF option offers the potential for regular traffic to receive improved throughput due to fewer collisions. However, it is often
seen as intended for real-time Wi-Fi traffic, such as voice calls over Wi-Fi.

3.7.17 https://eng.libretexts.org/@go/page/11107
The idea behind PCF is to schedule, at regular intervals, a contention-free period, or CFP. During this period, the Access Point
may
send Data packets to any receiver
send Poll packets to any receiver, allowing that receiver to reply with its own data packet
send a combination of the two above (not necessarily to the same receiver)
send management packets, including a special packet marking the end of the CFP
None of these operations can result in a collision (unless an unrelated but overlapping Wi-Fi domain is involved).
Stations receiving data from the Access Point send the usual ACK after a SIFS interval. A data packet from the Access Point
addressed to station B may also carry, piggybacked in the Wi-Fi header, a Poll request to another station C; this saves a
transmission. Polled stations that send data will receive an ACK from the Access Point; this ACK may be combined in the same
packet with the Poll request to the next station.
At the end of the CFP, the regular “contention period” or CP resumes, with the usual CSMA/CA strategy. The time interval
between the start times of consecutive CFP periods is typically 100 ms, short enough to allow some real-time traffic to be
supported.
During the CFP, all stations normally wait only the Short IFS, SIFS, between transmissions. This works because normally there is
only one station designated to respond: the Access Point or the polled station. However, if a station is polled and has nothing to
send, the Access Point waits for time interval PIFS (PCF Inter-Frame Spacing), of length midway between SIFS and IFS above
(our previous IFS should now really be known as DIFS, for DCF IFS). At the expiration of the PIFS, any non-Access-Point station
that happens to be unaware of the CFP will continue to wait the full DIFS, and thus will not transmit. An example of such a CFP-
unaware station might be one that is part of an entirely different but overlapping Wi-Fi network.
The Access Point generally maintains a polling list of stations that wish to be polled during the CFP. Stations request inclusion on
this list by an indication when they associate or (more likely) reassociate to the Access Point. A polled station with nothing to send
simply remains quiet.
PCF mode is not supported by many lower-end Wi-Fi routers, and often goes unused even when it is available. Note that PCF mode
is collision-free, so long as no other Wi-Fi access points are active and within range. While the standard has some provisions for
attempting to deal with the presence of other Wi-Fi networks, these provisions are somewhat imperfect; at a minimum, they are not
always supported by other access points. The end result is that polling is not quite as useful as it might be.

3.7.8 MANETs
The MANET acronym stands for mobile ad hoc network; in practice, the term generally applies to ad hoc wireless networks of
sufficient complexity that some internal routing mechanism is needed to enable full connectivity. A mesh network in the sense of
3.7.4.4 Mesh Networks qualifies as a MANET, though MANETs also include networks with much less centralized control, and in
which the routing nodes may be highly mobile. MANETs are also potentially much larger, with some designs intended to handle
many hundreds of routing nodes, while a typical Wi-Fi mesh network may have only a handful of access points. While MANETs
be built with any wireless mechanism, we will assume here that Wi-Fi is used.
MANET nodes communicate by radio signals with a finite range, as in the diagram below.

C E

A B
G

D
F

Typical MANET in which the radio range for each node


is represented by a circle around that node. A can reach G
either by the route A—B—C—E—G or by A—B—D—F—G.

3.7.18 https://eng.libretexts.org/@go/page/11107
Each node’s radio range is represented by a circle centered about that node. In general, two MANET nodes may be able to
communicate only by relaying packets through intermediate nodes, as is the case for nodes A and G in the diagram above. Finding
the optimal route through those intermediate nodes is a significant problem.

MANETs for the People


In the early years of MANETs, many designs focused on a decentralized, communitarian approach, eg wireless community
networks. During the 2010 Arab Spring, MANETs were often proposed (in conjunction with a few users having satellite-
Internet access) as a way to bypass government censorship of the Internet. Fast forward to 2018, and much press discussion of
“mesh networks” is oriented towards those with exceptionally large private residences. Nothing endures but change.

In the field, the radio range of each node may not be very circular at all, due to among other things signal reflection and blocking
from obstructions. An additional complication arises when the nodes (or even just obstructions) are moving in real time (hence the
“mobile” of MANET); this means that a working route may stop working a short time later. For this reason, and others, routing
within MANETs is a good deal more complex than routing in an Ethernet. A switched Ethernet, for example, is required to be loop-
free, so there is never a choice among multiple alternative routes.
Note that, without successful LAN-layer routing, a MANET does not have full node-to-node connectivity and thus does not meet
the definition of a LAN given in 1.9 LANs and Ethernet. With either LAN-layer or IP-layer routing, one or more MANET nodes
may serve as gateways to the Internet.
Note also that MANETs in general do not support broadcast or multicast, unless the forwarding of broadcast and multicast
messages throughout the MANET is built in to the routing mechanism. This can complicate the operation of IPv4 and IPv6
networks, even assuming that the MANET routing mechanism replaces the need for broadcast/multicast protocols like IPv4’s ARP
(7.9 Address Resolution Protocol: ARP) and IPv6’s Neighbor Discovery (8.6 Neighbor Discovery) that otherwise play important
roles in local packet delivery. For example, the common IPv4 address-assignment mechanism we will describe in 7.10 Dynamic
Host Configuration Protocol (DHCP) relies on broadcast and so often needs adaptation. Similarly, IPv6 relies on multicast for
several ancillary services, including address assignment (8.7.3 DHCPv6) and duplicate address detection (8.7.1 Duplicate Address
Detection).
MANETs are simplest when all the nodes are under common, coordinated management, as in the mesh Wi-Fi described above. Life
is much more complicated when nodes are individually owned, and each owner wishes to place limits on the amount of “transit
traffic” – traffic passing through the owner’s node – that is acceptable. Yet this is often the situation faced by schemes to offer Wi-
Fi-based community Internet access.
Finally, we observe that while MANETs are of great theoretical interest, their practical impact has been modest; they are almost
unknown, for example, in corporate environments, beyond the mesh networks of 3.7.4.4 Mesh Networks. They appear most useful
in emergency situations, rural settings, and settings where the conventional infrastructure network has failed or been disabled.

3.7.8.1 Routing in MANETs


Routing in MANETs can be done either at the LAN layer, using physical addresses, or at the IP layer with some minor bending
(below) of the rules.
Either way, nodes must find out about the existence of other nodes, and appropriate routes must then be selected. Route selection
can use any of the mechanisms we describe later in 9 Routing-Update Algorithms.
Routing at the LAN layer is much like routing by Ethernet switches; each node will construct an appropriate forwarding table.
Unlike Ethernet, however, there may be multiple paths to a destination, direct connectivity between any particular pair of nodes
may come and go, and negotiation may be required even to determine which MANET nodes will serve as forwarders.
Routing at the IP layer involves the same issues, but at least IP-layer routing-update algorithms have always been able to handle
multiple paths. There are some minor issues, however. When we initially presented IP forwarding in 1.10 IP - Internet Protocol, we
assumed that routers made their decisions by looking only at the network prefix of the address; if another node had the same
network prefix it was assumed to be reachable directly via the LAN. This model usually fails badly in MANETs, where direct
reachability has nothing to do with addresses. At least within the MANET, then, a modified forwarding algorithm must be used
where every address is looked up in the forwarding table. One simple way to implement this is to have the forwarding tables
contain only host-specific entries as were discussed in 3.1 Virtual Private Networks.

3.7.19 https://eng.libretexts.org/@go/page/11107
Multiple routing algorithms have been proposed for MANETs. Performance of a given algorithm may depend on the following
factors:
The size of the network
How many nodes have agreed to serve as routers
The degree of node mobility, especially of routing-node mobility if applicable
Whether the nodes (especially routing nodes) are under common administration, and thus may agree to defer their own
transmission interests for the common good
per-node storage and power availability

This page titled 3.7: Wi-Fi is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

3.7.20 https://eng.libretexts.org/@go/page/11107
3.8: WiMAX and LTE
WiMAX and LTE are both wireless network technologies suitable for data connections to mobile (and sometimes stationary)
devices.
WiMAX is an IEEE standard, 802.16; its original name is WirelessMAN (for Metropolitan Area Network), and this name appears
intermittently in the IEEE standards. In its earlier versions it was intended for stationary subscribers (802.16d), but was later
expanded to support mobile subscribers (802.16e). The stationary-subscriber version is often used to provide residential Internet
connectivity, in both urban and rural areas.
LTE (the acronym itself stands for Long Term Evolution) is a product of the mobile telecom world; it was designed for mobile
subscribers from the beginning. Its official name – at least for its radio protocols – is Evolved UTRA, or E-UTRA, where UTRA
in turn stands for UMTS Terrestrial Radio Access. UMTS stands for Universal Mobile Telecommunications System, a core mobile-
device data-network mechanism with standards dating from the year 2000.
4G Capacity
A medium-level wireless data plan often comes with a 5 GB monthly cap. At the 100 Mbps 4G data rate, that allotment can be
downloaded in under six minutes. Data rate isn’t everything.
Both LTE and the mobile version of WiMAX are often marketed as fourth generation (or 4G) networking technology. The ITU
has a specific definition for 4G developed in 2008, officially named IMT-Advanced and including a 100 Mbps download rate to
moving devices and a 1 Gbps download rate to more-or-less-stationary devices. Neither WiMAX nor LTE quite qualified
technically, but to marketers that was no impediment. In any event, in December 2010 the ITU issued a statement in which it
“recognized that [the term 4G], while undefined, may also be applied to the forerunners of these technologies, LTE and WiMax”.
So-called Advanced LTE and WiMAX2 are true IMT-Advanced protocols.
As in 3.6.4 Band Width we will use the term “data rate” for what is commonly called “bandwidth” to avoid confusion with the
radio-specific meaning of the latter term.
WiMAX can use unlicensed frequencies, like Wi-Fi, but its primary use is over licensed radio spectrum; LTE is used almost
exclusively over licensed spectrum.
WiMAX and LTE both support a number of options for the width of the frequency band; the wider the band, the higher the data
rate. Downlink (base station to subscriber) data rates can be well over 100 Mbps (uplink rates are usually smaller). Most LTE bands
are either in the range 700-900 MHz or are above 1700 MHz; the lower frequencies tend to be better at penetrating trees and walls.
Like Wi-Fi, WiMAX and LTE subscriber stations connect to a central access point. The WiMAX standard prefers the term base
station which we will use henceforth for both protocols; LTE officially prefers the term “evolved NodeB” or eNB.
The coverage radius for LTE and mobile-subscriber WiMAX might be one to ten kilometers, versus less (sometimes much less)
than 100 meters for Wi-Fi. Stationary-subscriber WiMAX can operate on a larger scale; the coverage radius can be several tens of
kilometers. As distances increase, the data rate is reduced.
Large-radius base stations are typically mounted in towers; smaller-radius base-stations, generally used only in areas densely
populated with subscribers, may use lower antennas integrated discretely into the local architecture. Subscriber stations are not
expected to be able to hear other stations; they interact only with the base station.

3.8.1 Uplink Scheduling


As distances increase, the subscriber-to-base RTT becomes non-negligible. At 10 kilometers, this RTT is 66 µsec, based on the
speed of light of about 300 m/µsec. At 100 Mbps this is enough time to send 800 bytes, making it a priority to reduce the number
of RTTs. To this end, it is no longer practical to use Wi-Fi-style collisions to resolve access contention; it is not even practical to
use the Wi-Fi PCF mode of 3.7.7 Wi-Fi Polling Mode because polling requires additional RTTs. Instead, WiMAX and LTE rely on
base-station-regulated scheduling of transmissions.
The base station has no difficulty scheduling downlink transmissions, from base to subscriber: the base station simply sends the
packets sequentially (or in parallel on different sets of subcarriers if OFDM is used). If beamforming MISO antennas are used, or
multiple physically directional antennas, the base station will take this into account.

3.8.1 https://eng.libretexts.org/@go/page/11108
It is the uplink transmissions – from subscriber to base – that are more complicated to coordinate. Once a subscriber station
completes the network entry process to connect to a base station (3.8.3 Network Entry), it is assigned regular transmission slots,
including times and frequencies. These transmission slots may vary in size over time; the base station may regularly issue new
transmission schedules. Each subscriber station is told in effect that it may transmit on its assigned frequencies starting at an
assigned time and for an assigned length; LTE lengths start at 1 ms and WiMAX lengths at 2 ms. The station synchronizes its clock
with that of the base station as part of the network entry process.
Each subscriber station is scheduled to transmit so that one transmission finishes arriving at the base station just before the next
station’s same-frequency transmission begins arriving. Only minimal “guard intervals” need be included between consecutive
transmissions. Two (or more) consecutive uplink transmissions may in fact be “in the air” simultaneously, as far-away stations need
to begin transmitting early so their signals will arrive at the base station at the expected time.

Sta2

Sta1

Base

Sta3

Three packets in transit from stations Sta1, Sta2 and Sta3. The packets propagate
outwards from the stations at the speed of light, like ripples, spatially confined between
two concentric dot-dash circles (circles around Sta3 are not shown). The packet portion
along the straight line from the station to the Base is represented as a heavy arrow.
The three packets will arrive at Base sequentially and without overlap.

The diagram above illustrates this for stations separated by relatively large physical distances (as may be typical for long-range
WiMAX). This strategy for uplink scheduling eliminates the full RTT that Wi-Fi polling mode (3.7.7 Wi-Fi Polling Mode) entails.
Scheduled timeslots may be periodic (as is would be appropriate for voice) or may occur at varying intervals. Quality-of-Service
requests may also enter into the schedule; LTE focuses on end-to-end QoS while WiMAX focuses on subscriber-to-base QoS.
When a station has data to send, it may include in its next scheduled transmission a request for a longer transmission interval; if the
request is granted, the station may send its data (or at least some of its data) in its next scheduled transmission slot. When a station
is done transmitting, its timeslot may shrink back to the minimum, and may be scheduled less frequently as well, but it does not
disappear. Stations without data to send remain connected to the base station by sending “empty” messages during these slots.

3.8.2 Ranging
The uplink scheduling of the previous section requires that each subscriber station know the distance to the base station. If a
subscriber station is to transmit so that its message arrives at the base station at a certain time, it must actually begin transmission
early by an amount equal to the one-way station-to-base propagation delay. This distance/delay measurement process is called
ranging.
Ranging can be accomplished through any RTT measurement. Any base-station delay in replying, once a subscriber message is
received, simply needs to be subtracted from the total RTT. Of course, that base-station delay needs also to be communicated back
to the subscriber.
The distance to the base station is used not only for the subscriber station’s transmission timing, but also to determine its power
level; signals from each subscriber station, no matter where located, should arrive at the base station with about the same power.

3.8.3 Network Entry


The scheduling process eliminates the potential for collisions between normal data transmissions. But there remains the issue of
initial entry to the network. If a handoff is involved, the new base station can be informed by the old base station, and send an
appropriate schedule to the moving subscriber station. But if the subscriber station was just powered on, or is arriving from an area

3.8.2 https://eng.libretexts.org/@go/page/11108
without LTE/WiMAX coverage, potential transmission collisions are unavoidable. Fortunately, network entry is infrequent, and so
collisions are even less frequent.
A subscriber station begins the network-entry connection process to a base station by listening for the base station’s transmissions;
these message streams contain regular management messages containing, among other things, information about available data
rates in each direction. Also included in the base station’s message stream is information about when network-entry attempts can be
made.
In WiMAX these entry-attempt timeslots are called ranging intervals; the subscriber station waits for one of these intervals and
sends a “range-request” message to the base station. These ranging intervals are open to all stations attempting network entry, and
if another station transmits at the same time there will be a collision. An Ethernet/Wi-Fi-like exponential-backoff process is used if
a collision does occur.
In LTE the entry process is known as RACH, for Random Access CHannel. The base station designates certain 1 ms timeslots for
network entry. During one of these slots an entry-seeking subscriber chooses at random one of up to 64 predetermined random
access preambles (some preambles may be reserved for a second, contention-free form of RACH), and transmits it. The 1-ms
timeslot corresponds to 300 kilometers, much larger than any LTE cell, so the fact that the subscriber does not yet know its distance
to the base does not matter.
The preambles are mathematically “orthogonal”, in such a way that as long as no two RACH-participating subscribers choose the
same preamble, the base station can decode overlapping preambles and thus receive the set of all preambles transmitted during the
RACH timeslot. The base station then sends a reply, listing the preambles received and, in effect, an initial schedule indexed by
preamble of when each newly entering subscriber station can transmit actual data. This reply is sent to a special temporary
multicast address known as a radio network temporary identifier, or RNTI, as the base station does not yet know the actual identity
of any new subscriber. Those identities are learned as the new subscribers transmit to the base station according to this initial
schedule.
A collision occurs when two LTE subscriber stations have the misfortune of choosing the same preamble in the same RACH
timeslot, in which case the chosen preamble will not appear in the initial schedule transmitted by the base station. As for WiMAX,
collisions are rare because network entry is rare. Subscribers experiencing a collision try again during the next RACH timeslot,
choosing at random a new preamble.
For both WiMAX and LTE, network entry is the only time when collisions can occur; afterwards, all subscriber-station
transmissions are scheduled by the base station.
If there is no collision, each subscriber station is able to use the base station’s initial-response transmission to make its first ranging
measurement. Subscribers must have a ranging measurement in hand before they can send any scheduled transmission.

3.8.4 Mobility
There are some significant differences between stationary and mobile subscribers. First, mobile subscribers will likely expect some
sort of handoff from one base station to another as the subscriber moves out of range of the first. Second, moving subscribers mean
that the base-to-subscriber ranging information may change rapidly; see exercise 7.0. If the subscriber does not update its ranging
information often enough, it may transmit too early or too late. If the subscriber is moving fast enough, the Doppler effect may also
alter frequencies.

This page titled 3.8: WiMAX and LTE is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

3.8.3 https://eng.libretexts.org/@go/page/11108
3.9: Fixed Wireless
This category includes all wireless-service-provider systems where the subscriber’s location does not change. Often, but not
always, the subscriber will have an outdoor antenna for improved reception and range. Fixed-wireless systems can involve relay
through satellites, or can be terrestrial.

3.9.1 Terrestrial Wireless


Terrestrial wireless – also called terrestrial broadband or fixed-wireless broadband – involves direct (non-satellite) radio
communication between subscribers and a central access point. Access points are usually tower-mounted and serve multiple
subscribers, though single-subscriber point-to-point “microwave links” also exist. A multi-subscriber access point may serve an
area with radius up to several tens of miles, depending on the technology, though more common ranges are under ten miles.
WiMAX 802.16d is one form of terrestrial wireless, but there are several others. Frequencies may be either licensed or unlicensed.
Unlicensed frequency bands are available at around 900 MHz, 2.4 GHz, and 5 GHz. Nominally all three bands require that line-of-
sight transmission be used, though that requirement becomes stricter as the frequency increases. Lower frequencies tend to be
better at “seeing” through trees and other obstructions.

Trees vs Signal

Photo of the author attempting to improve his 2.4 GHz terrestrial-wireless signal via tree trimming.

Terrestrial fixed wireless was originally popularized for rural areas, where residential density is too low for economical cable
connections. However, some fixed-wireless ISPs now operate in urban areas, often using WiMAX. One advantage of terrestrial
fixed-wireless in remote areas is that the antennas covers a much smaller geographical area than a satellite, generally meaning that
there is more data bandwidth available per user and the cost per megabyte is much lower.
Outdoor subscriber antennas often use a parabolic dish to improve reception; sizes range from 10 to 50 cm in diameter. The size of
the dish may depend on the distance to the central tower.
While there are standardized fixed-wireless systems, such as WiMAX, there are also a number of proprietary alternatives, including
systems from Trango and Canopy. Fixed-wireless systems might, in fact, be considered one of the last bastions of proprietary LAN
protocols. This lack of standardization is due to a variety of factors; two primary ones are the relatively modest overall demand for
this service and the the fact that most antennas need to be professionally installed by the ISP to ensure that they are “properly
mounted, aligned, grounded and protected from lightning”.

3.9.2 Satellite Internet


An extreme case of fixed wireless is satellite Internet, in which signals pass through a satellite in geosynchronous orbit (35,786
km above the earth’s surface). Residential customers have parabolic antennas typically from 70 to 100 cm in diameter, larger than
those used for terrestrial wireless but smaller than the dish antennas used at access points. The geosynchronous satellite orbit means

3.9.1 https://eng.libretexts.org/@go/page/11109
that the antennas need to be pointed only once, at installation. Transmitter power is typically 1-2 watts, remarkably low for a signal
that travels 35,786 km.
The primary problem associated with satellite Internet is very long RTTs. The the speed-of-light round-trip propagation delay is
about 500 ms to which must be added queuing delays for the often-backlogged access point (my own personal experience
suggested that RTTs of close to 1,000 ms were the norm). These long delays affect real-time traffic such as VoIP and gaming, but as
we shall see in 14.11 The Satellite-Link TCP Problem bulk TCP transfers also perform poorly with very long RTTs. To provide
partial compensation for the TCP issue, many satellite ISPs provide some sort of “acceleration” for bulk downloads: a web page,
for example, would be downloaded rapidly by the access point and streamed to the satellite and back down to the user via a
proprietary mechanism. Acceleration, however, cannot help interactive connections such as VPNs.
Another common feature of satellite Internet is a low daily utilization cap, typically in the hundreds of megabytes. Utilization caps
are directly tied to the cost of maintaining satellites, but also to the fact that one satellite covers a great deal of ground, and so its
available capacity is shared by a large number of users.
The delay issues associated with satellite Internet would go away if satellites were in so-called low-earth orbits, a few hundred km
above the earth. RTTs would then be comparable with terrestrial Internet. Fixed-direction antennas could no longer be used. A large
number of satellites would need to be launched to provide 24-hour coverage even at one location. To data (2016), such a network of
low-earth satellites has been proposed, but not yet launched.

This page titled 3.9: Fixed Wireless is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

3.9.2 https://eng.libretexts.org/@go/page/11109
3.10: Epilog and Exercises
Along with a few niche protocols, we have focused primarily here on wireless and on virtual circuits. Wireless, of course, is
enormously important: it is the enabler for mobile devices, and has largely replaced traditional Ethernet for home and office
workstations.
While it is sometimes tempting (in the IP world at least) to write off ATM as a niche technology, virtual circuits are a serious
conceptual alternative to datagram forwarding. As we shall see in 20 Quality of Service, IP has problems handling real-time traffic,
and virtual circuits offer a solution. The Internet has so far embraced only small steps towards virtual circuits (such as MPLS, 20.12
Multi-Protocol Label Switching (MPLS)), but they remain a tantalizing strategy.

3.11 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 4.5 is distinct, for
example, from exercises 4.0 and 5.0. Exercises marked with a ♢ have solutions or hints at 24.3 Solutions for Other LANs.
1.0. Suppose remote host A uses a VPN connection to connect to host B, with IP address 200.0.0.7. A’s normal Internet connection
is via device eth0 with IP address 12.1.2.3; A’s VPN connection is via device ppp0 with IP address 10.0.0.44. Whenever A
wants to send a packet via ppp0 , it is encapsulated and forwarded over the connection to B at 200.0.0.7.
(a). Suppose A’s IP forwarding table is set up so that all traffic to 200.0.0.7 uses eth0 and all traffic to anywhere else uses
ppp0 . What happens if an intruder M attempts to open a connection to A at 12.1.2.3? What route will packets from A to M
take?
(b). Suppose A’s IP forwarding table is (mis)configured so that all outbound traffic uses ppp0 . Describe what will happen when
A tries to send a packet.
2.0. Suppose remote host A wishes to use a TCP-based VPN connection to connect to host B, with IP address 200.0.0.7. However,
the VPN software is not available for host A. Host A is, however, able to run that software on a virtual machine V hosted by A; A
and V have respective IP addresses 10.0.0.1 and 10.0.0.2 on the virtual network connecting them. V reaches the outside world
through network address translation (7.7 Network Address Translation), with A acting as V’s NAT router. When V runs the VPN
software, it forwards packets addressed to B the usual way, through A using NAT. Traffic to any other destination it encapsulates
over the VPN.
Can A configure its IP forwarding table so that it can make use of the VPN? If not, why not? If so, how? (If you prefer, you may
assume V is a physical host connecting to a second interface on A; A still acts as V’s NAT router.)
3.0. Token Bus was a proprietary Ethernet-based network. It worked like Token Ring in that a small token packet was sent from
one station to the next in agreed-upon order, and a station could transmit only when it had just received the token.
(a). If the data rate is 10 Mbps and the token is 64 bytes long (the 10-Mbps Ethernet minimum packet size), what is the average
wait to receive the token on an idle network with 40 stations? (The average number of stations the token must pass through is 40/2
= 20.) Ignore the propagation delay and the gap Ethernet requires between packets.
(b)♢. Sketch a protocol by which stations can sort themselves out to decide the order of token transmission; that is, an order of the
stations S0 … Sn-1 where station Sisends the token to station S(i+1) mod n.
4.0. A seemingly important part of the IEEE 801.11 Wi-Fi standard is that stations do not transmit when another station is
transmitting; this is meant to reduce collisions. And yet the standard states “transmission of the ACK frame shall commence after a
SIFS period, without regard to the busy/idle state of the medium”; that is, the ACK sender does not listen first for an idle network.
Give a scenario in which the transmission of an ACK while the medium is not idle does not result in a collision! That is, station A
has just finished transmitting a packet to station C, but before C can begin sending its ACK, another station B starts transmitting.
Hint: this is another example of the hidden-node problem, 3.7.1.4 Hidden-Node Problem, with station C again the “middle” station.
Recall also that simultaneous transmission results in a collision only if some node fails to be able to read either signal as a result.
(Also note that, if C does not send its ACK, despite B, the packet just sent from A has to all intents and purposes been lost.)
4.5.♢ Give an example of a three-sender hidden-node collision (3.7.1.4 Hidden-Node Problem); that is, three nodes A, B and C, no
two of which can see one another, where all can reach a fourth node D. Can you do this for more than three sending nodes?

3.10.1 https://eng.libretexts.org/@go/page/11110
5.0. Suppose the average contention interval in a Wi-Fi network (802.11g) is 64 SlotTimes. The average packet size is 1 kB, and
the data rate is 54 Mbps. At that data rate, it takes about (8×1024)/54 = 151 µsec to transmit a packet.
(a). How long is the average contention interval, in µsec?
(b)♢ . What fraction of the total potential bandwidth is lost to contention? (See 2.1.11 Analysis of Classic Ethernet for a similar
example).
6.0. WiMAX and LTE subscriber stations are not expected to hear one another at all. For Wi-Fi non-access-point stations in an
infrastructure (access-point) setting, on the other hand, listening to other non-access-point transmissions is encouraged.
(a). List some ways in which Wi-Fi non-access-point stations in an infrastructure (access-point) network do sometimes respond to
packets sent by other non-access-point stations. The responses need not be in the form of transmissions.
(b). Explain why Wi-Fi stations cannot be required to respond as in part (a).
7.0. Suppose WiMAX subscriber stations can be moving, at speeds of up to 33 meters/sec (the maximum allowed under 802.16e).
(a). How much earlier (or later) can one subscriber packet arrive? Assume that the ranging process updates the station’s
propagation delay once a minute. The speed of light is about 300 meters/µsec.
(b). With 5000 senders per second, how much time out of each second must be spent on “guard intervals” accommodating the
early/late arrivals above? You will need to double the time from part (a), as the base station cannot tell whether the signal from a
moving subscriber will arrive earlier or later.
8.0. [SM90] contained a proposal for sending IP packets over ATM as N cells as in AAL-5, followed by one cell containing the
XOR of all the previous cells. This way, the receiver can recover from the loss of any one cell. Suppose N=20 here; with the SM90
mechanism, each packet would require 21 cells to transmit; that is, we always send 5% more. Suppose the cell loss-rate is p
(presumably very small). If we send 20 cells without the SM90 mechanism, we have a probability of about 20p that any one cell
will be lost, and we will have to retransmit the entire 20 again. This gives an average retransmission amount of about 20p extra
packets. For what value of p do the with-SM90 and the without-SM90 approaches involve about the same total number of cell
transmissions?
9.0. In the example in 3.4 Virtual Circuits, give the VCI table for switch S5.
10.0. Suppose we have the following network:

A───S1────────S2───B
│ │
│ │
│ │
C───S3────────S4───D

The virtual-circuit switching tables are below. Ports are identified by the node at the other end. Identify all the connections. Give
the path for each connection and the VCI on each link of the path.
Switch S1:

VCIin portin VCIout portout

1 A 2 S3

2 A 2 S2

3 A 3 S2

Switch S2:

VCIin portin VCIout portout

2 S4 1 B

2 S1 3 S4

3.10.2 https://eng.libretexts.org/@go/page/11110
VCIin portin VCIout portout

3 S1 4 S4

Switch S3:

VCIin portin VCIout portout

2 S1 2 S4

3 S4 2 C

Switch S4:

VCIin portin VCIout portout

2 S3 2 S2

3 S2 3 S3

4 S2 1 D

10.5♢ We have the same network as the previous exercise:

A───S1────────S2───B
│ │
│ │
│ │
C───S3────────S4───D

The virtual-circuit switching tables are below. Ports are identified by the node at the other end. Identify all the connections. Give
the path for each connection and the VCI on each link of the path.
Switch S1:

VCIin portin VCIout portout

1 A 2 S2

3 S3 2 A

Switch S2:

VCIin portin VCIout portout

2 S1 3 S4

1 B 2 S4

Switch S3:

VCIin portin VCIout portout

2 S1 2 S4

1 S4 3 S1

Switch S4:

VCIin portin VCIout portout

3 S2 2 D

3.10.3 https://eng.libretexts.org/@go/page/11110
VCIin portin VCIout portout

2 S2 1 S3

11.0. Suppose we have the following network:

A───S1────────S2───B
│ │
│ │
│ │
C───S3────────S4───D

Give virtual-circuit switching tables for the following connections. Route via a shortest path.
(a). A–D
(b). C–B, via S4
(c). B–D
(d). A–D, via whichever of S2 or S3 was not used in part (a)
12.0. Below is a set of switches S1 through S4. Define VCI-table entries so the virtual circuit from A to B follows the path

A ⟶ S1 ⟶ S2 ⟶ S4 ⟶ S3 ⟶ S1 ⟶ S2 ⟶ S4 ⟶ S3 ⟶ B
That is, each switch is visited twice.

A──S1─────S2
│ │
│ │
B──S3─────S4

13.0. In this exercise we outline the two-ray ground model of wireless transmission in which the signal power is inversely
proportional to the fourth power of the distance, rather than following the usual inverse-square law. Some familiarity with
trigonometric (or complex-exponential) manipulations is necessary.
Suppose the signal near the transmitter is A sin(2πf t), where A is the amplitude, f the frequency and t the time. Signal power is
proportional to A2. At distance r≥1, the amplitude is reduced by a factor of 1/r (so the power is reduced by 1/r2) and the signal is
delayed by a time r/c, where c is the speed of light, giving
(A/r) sin(2πf (t − r/c)) = (A/r) sin(2πf t − 2πλr) (3.10.1)

where λ = f /v is the wavelength.


The received signal is the superposition of the line-of-sight signal path and its reflection from the ground, as in the following
diagram:

line-of-sight r arriving signals

height h
reflected path rʹ
r/2 r/2

Sender and receiver are shown at equal heights above the ground, for simplicity. We assume 100% ground reflectivity (this is
reasonable for very shallow angles). The phase of the ground signal is reversed 180° by the reflection, and then is delayed slightly
more by the slightly longer path.
(a). Find a formula for the length of the reflected-signal path rʹ, in terms of r and h. Eliminate the square root, using the
approximation (1+x)1/2 ≃ 1 + x/2 for small x. (You will need to factor r out of the square-root expression for rʹ first.)

3.10.4 https://eng.libretexts.org/@go/page/11110
(b). Simplify the difference (because of the 180° reflection phase-reversal) of the line-of-sight and reflected-signal paths. Use the
approximation
sin(X) − sin(Y ) = 2 sin((X − Y )/2) cos((X + Y )/2) ≃ (X − Y ) cos((X + Y )/2) ≃ (X − Y ) cos(X) (3.10.2)

for X ≃ Y (or else use complex exponentials, noting sin(X) is the real part of i eiX). You may assume rʹ−r is smaller than the
wavelength ��. Start with
′ ′
(A/r) sin(2πf t − 2πλr) − (A/ r ) sin(2πf t − 2πλ r ) (3.10.3)

,it helps to isolate the r→rʹ change to one subexpression at a time by writing this as follows (adding and subtracting the identical
middle terms):
′ ′ ′
((A/r) sin(2πf t − 2πλr) − (A/ r ) sin(2πf t − 2πλr)) + ((A/r) sin(2πf t − 2πλr) − (A/ r sin(2πf t − 2πλ r ))
′ ′ ′
= (A/r − A/ r ) sin(2πf t − 2πλr) + (A/ r ) (sin(2πt) − 2πλr) − sin(2πf t − 2πλ r ))

(c). Show that the approximate amplitude of this difference is proportional to 1/r2, making the relative power proportional to 1/r4.
14.0. In the four-way handshake of 3.7.5 Wi-Fi Security, suppose station B (for Bad) records the successful handshake of station A
and the access point. A then leaves the network, and B attempts a replay attack: B uses A’s packets in the handshake. At exactly
what point does the handshake break down?

This page titled 3.10: Epilog and Exercises is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

3.10.5 https://eng.libretexts.org/@go/page/11110
CHAPTER OVERVIEW

4: Links
At the lowest (logical) level, network links look like serial lines. In this chapter we address how packet structures are built on
top of serial lines, via encoding and framing. Encoding determines how bits and bytes are represented on a serial line; framing
allows the receiver to identify the beginnings and endings of packets. We then conclude with the high-speed serial lines offered by
the telecommunications industry, T-carrier and SONET, upon which almost all long-haul point-to-point links that tie the Internet
together are based.
4.1: Encoding and Framing
4.2: Time-Division Multiplexing
4.E: Links (Exercises)
Index

This page titled 4: Links is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1
4.1: Encoding and Framing
A typical serial line is ultimately a stream of bits, not bytes. How do we identify byte boundaries? This is made slightly more
complicated by the fact that, beneath the logical level of the serial line, we generally have to avoid transmitting long runs of
identical bits, because the receiver may simply lose count; this is the clock synchronization problem (sometimes called the clock
recovery problem). This means that, one way or another, we cannot always just send the desired bits sequentially; for example,
extra bits are often inserted to break up long runs. Exactly how we do this is the encoding mechanism.
Once we have settled the transmission of bits, the next step is to determine how the receiver identifies the start of each new packet.
Ethernet packets are separated by physical gaps, but for most other link mechanisms packets are sent end-to-end, with no breaks.
How we tell when one packet stops and the next begins is the framing problem.
To summarize:
encoding: correctly recognizing all the bits in a stream
framing: correctly recognizing packet boundaries
These are related, though not the same.
For long (multi-kilometer) electrical serial lines, in addition to the clock-related serial-line requirements we also want the average
voltage to be zero; that is, we want no DC component. We will mostly concern ourselves here, however, only with lines short
enough for this not to be a major concern.

This page titled 4.1: Encoding and Framing is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

4.1.1 https://eng.libretexts.org/@go/page/11134
4.2: Time-Division Multiplexing
Classical circuit switching means a separate wire for each connection. This is still in common use for residential telephone
connections: each subscriber has a dedicated wire to the Central Office. But a separate physical line for each connection is not a
solution that scales well.
Once upon a time it was not uncommon to link computers with serial lines, rather than packet networks. This was most often done
for file transfers, but telnet logins were also done this way. The problem with this approach is that the line had to be dedicated to
one application (or one user) at a time.
Packet switching naturally implements multiplexing (sharing) on links; the demultiplexer is the destination address. Port numbers
allow demultiplexing of multiple streams to same destination host.
There are other ways for multiple channels to share a single wire. One approach is frequency-division multiplexing, or putting
each channel on a different carrier frequency. Analog cable television did this. Some fiber-optic protocols also do this, calling it
wavelength-division multiplexing.
But perhaps the most pervasive alternative to packets is the voice telephone system’s time division multiplexing, or TDM,
sometimes prefixed with the adjective synchronous. The idea is that we decide on a number of channels, N, and the length of a
timeslice, T, and allow each sender to send over the channel for time T, with the senders taking turns in round-robin style. Each
sender gets to send for time T at regular intervals of NT, thus receiving 1/N of the total bandwidth. The timeslices consume no
bandwidth on headers or addresses, although sometimes there is a small amount of space dedicated to maintaining synchronization
between the two endpoints. Here is a diagram of sending with N=8:

... A B C D E F G H A B C D E F G H A B ...

Time-Division Multiplexing

Note, however, that if a sender has nothing to send, its timeslice cannot be used by another sender. Because so much data traffic is
bursty, involving considerable idle periods, TDM has traditionally been rejected for data networks.

This page titled 4.2: Time-Division Multiplexing is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

4.2.1 https://eng.libretexts.org/@go/page/11133
4.E: Links (Exercises)

Your page has been created!


Remove this content and add your own.

Edit page
Click the Edit page button in your user bar. You will see a suggested structure for your content. Add your content and hit
Save.

Tips:

Drag and drop


Drag one or more image files from your computer and drop them onto your browser window to add them to your page.

Classifications
Tags are used to link pages to one another along common themes. Tags are also used as markers for the dynamic organization
of content in the CXone Expert framework.

Working with templates


CXone Expert templates help guide and organize your documentation, making it flow easier and more uniformly. Edit
existing templates or create your own.

Visit for all help topics.

This page titled 4.E: Links (Exercises) is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

4.E.1 https://eng.libretexts.org/@go/page/11132
5: Packets
In this chapter we address a few abstract questions about packets, and take a close look at transmission times. We also consider
how big packets should be, and how to detect transmission errors. These issues are independent of any particular set of protocols.

5.1 Packet Delay


There are several contributing sources to the delay encountered in transmitting a packet. On a LAN, the most significant is usually
what we will call bandwidth delay: the time needed for a sender to get the packet onto the wire. This is simply the packet size
divided by the bandwidth, after everything has been converted to common units (either all bits or all bytes). For a 1500-byte packet
on 100 Mbps Ethernet, the bandwidth delay is 12,000 bits / (100 bits/µsec) = 120 µsec.
There is also propagation delay, relating to the propagation of the bits at the speed of light (for the transmission medium in
question). This delay is the distance divided by the speed of light; for 1,000 m of Ethernet cable, with a signal propagation speed of
about 230 m/µsec, the propagation delay is about 4.3 µsec. That is, if we start transmitting the 1500 byte packet of the previous
paragraph at time T=0, then the first bit arrives at a destination 1,000 m away at T = 4.3 µsec, and the last bit is transmitted at 120
µsec, and the last bit arrives at T = 124.3 µsec.

Minimizing Delay
Back in the last century, gamers were sometimes known to take advantage of players with slow (as in dialup) links; an
opponent could be eliminated literally before he or she could respond. As an updated take on this, some financial-trading firms
have set up microwave-relay links between trading centers, say New York and Chicago, in order to reduce delay. In
computerized trading, milliseconds count. A direct line of sight from New York to Chicago – which we round off to 1200 km –
takes about 4 ms in air, where signals propagate at essentially the speed of light c = 300 km/ms. But fiber is slower; even an
absolutely straight run would take 6 ms at glass fiber’s propagation speed of 200 km/ms. In the presence of high-speed trading,
this 2 ms savings is of considerable financial significance.

Bandwidth delay, in other words, tends to dominate within a LAN.


But as networks get larger, propagation delay begins to dominate. This also happens as networks get faster: bandwidth delay goes
down, but propagation delay remains unchanged.
An important difference between bandwidth delay and propagation delay is that bandwidth delay is proportional to the amount of
data sent while propagation delay is not. If we send two packets back-to-back, then the bandwidth delay is doubled but the
propagation delay counts only once.
The introduction of switches leads to store-and-forward delay, that is, the time spent reading in the entire packet before any of it
can be retransmitted. Store-and-forward delay can also be viewed as an additional bandwidth delay for the second link.
Finally, a switch may or may not also introduce queuing delay; this will often depend on competing traffic. We will look at this in
more detail in 14 Dynamics of TCP, but for now note that a steady queuing delay (eg due to a more-or-less constant average queue
utilization) looks to each sender more like propagation delay than bandwidth delay, in that if two packets are sent back-to-back and
arrive that way at the queue, then the pair will experience only a single queuing delay.

5.1.1 Delay examples


Case 1: A──────B
Propagation delay is 40 µsec
Bandwidth is 1 byte/µsec (1 MB/sec, 8 Mbit/sec)
Packet size is 200 bytes (200 µsec bandwidth delay)
Then the total one-way transmit time for one packet is 240 µsec = 200 µsec + 40 µsec. To send two back-to-back packets, the time
rises to 440 µsec: we add one more bandwidth delay, but not another propagation delay.
Case 2: A──────────────────B

5.1 https://eng.libretexts.org/@go/page/11014
Like the previous example except that the propagation delay is increased to 4 ms
The total transmit time for one packet is now 4200 µsec = 200 µsec + 4000 µsec. For two packets it is 4400 µsec.
Case 3: A──────R──────B

We now have two links, each with propagation delay 40 µsec; bandwidth and
packet size as in Case 1.
The total transmit time for one 200-byte packet is now 480 µsec = 240 + 240. There are two propagation delays of 40 µsec each; A
introduces a bandwidth delay of 200 µsec and R introduces a store-and-forward delay (or second bandwidth delay) of 200 µsec.
Case 4: A──────R──────B

The same as 3, but with data sent as two 100-byte packets


The total transmit time is now 380 µsec = 3x100 + 2x40. There are still two propagation delays, but there is only 3/4 as much
bandwidth delay because the transmission of the first 100 bytes on the second link overlaps with the transmission of the second 100
bytes on the first link.

T=40 T=40 T=40

T=140
T=180
T=240 T=240 T=240

T=280 T=280

T=380

T=480

Case 1: one link Case 3: two links Case 4: two half-sized packets

Bandwidth delay 200 µsec, per-link propagation delay 40 µsec

These ladder diagrams represent the full transmission; a snapshot state of the transmission at any one instant can be obtained by
drawing a horizontal line. In the middle, case 3, diagram, for example, at no instant are both links active. Note that sending two
smaller packets is faster than one large packet. We expand on this important point below.
Now let us consider the situation when the propagation delay is the most significant component. The cross-continental US
roundtrip delay is typically around 50-100 ms (propagation speed 200 km/ms in cable, 5,000-10,000 km cable route, or about 3-
6000 miles); we will use 100 ms in the examples here. At a bandwidth of 1.0 Mbps, 100ms is about 12 kB, or eight full-sized
Ethernet packets. At this bandwidth, we would have four packets and four returning ACKs strung out along the path. At 1.0 Gbit/s,
in 100ms we can send 12,000 kB, or 800 Ethernet packets, before the first ACK returns.
At most non-LAN scales, the delay is typically simplified to the round-trip time, or RTT: the time between sending a packet and
receiving a response.
Different delay scenarios have implications for protocols: if a network is bandwidth-limited then protocols are easier to design.
Extra RTTs do not cost much, so we can build in a considerable amount of back-and-forth exchange. However, if a network is
delay-limited, the protocol designer must focus on minimizing extra RTTs. As an extreme case, consider wireless transmission to
the moon (0.3 sec RTT), or to Jupiter (1 hour RTT).
At my home I formerly had satellite Internet service, which had a roundtrip propagation delay of ~600 ms. This is remarkably high
when compared to purely terrestrial links.
When dealing with reasonably high-bandwidth “large-scale” networks (eg the Internet), to good approximation most of the non-
queuing delay is propagation, and so bandwidth and total delay are effectively independent. Only when propagation delay is small
are the two interrelated. Because propagation delay dominates at this scale, we can often make simplifications when diagramming.
In the illustration below, A sends a data packet to B and receives a small ACK in return. In (a), we show the data packet traversing
several switches; in (b) we show the data packet as if it were sent along one long unswitched link, and in (c) we introduce the

5.2 https://eng.libretexts.org/@go/page/11014
idealization that bandwidth delay (and thus the width of the packet line) no longer matters. (Most later ladder diagrams in this book
are of this type.)
A B A B A B

Long-distance packet transmission A→B and ACK B→A


(a) through switches (b) over a long cable (c) idealized

This page titled 5: Packets is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

5.3 https://eng.libretexts.org/@go/page/11014
CHAPTER OVERVIEW

6: Abstract Sliding Windows


In this chapter we take a general look at how to build reliable data-transport layers on top of unreliable lower layers. This is
achieved through a retransmit-on-timeout policy; that is, if a packet is transmitted and there is no acknowledgment received
during the timeout interval then the packet is resent. As a class, protocols where one side implements retransmit-on-timeout are
known as ARQ protocols, for Automatic Repeat reQuest.
In addition to reliability, we also want to keep as many packets in transit as the network can support. The strategy used to achieve
this is known as sliding windows. It turns out that the sliding-windows algorithm is also the key to managing congestion; we return
to this in 13 TCP Reno and Congestion Management.
The End-to-End principle, 12.1 The End-to-End Principle, suggests that trying to achieve a reliable transport layer by building
reliability into a lower layer is a misguided approach; that is, implementing reliability at the endpoints of a connection – as is
described here – is in fact the correct mechanism.
6.1: Building Reliable Transport - Stop-and-Wait
6.2: Sliding Windows
6.3: Linear Bottlenecks
6.4: Epilog and Exercises
Index

This page titled 6: Abstract Sliding Windows is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1
6.1: Building Reliable Transport - Stop-and-Wait
Retransmit-on-timeout generally requires sequence numbering for the packets, though if a network path is guaranteed not to reorder
packets then it is safe to allow the sequence numbers to wrap around surprisingly quickly (for stop-and-wait, a single-bit sequence
number will work; see exercise 8.5). However, as the no-reordering hypothesis does not apply to the Internet at large, we will
assume conventional numbering. Data[N] will be the Nth data packet, acknowledged by ACK[N].
In the stop-and-wait version of retransmit-on-timeout, the sender sends only one outstanding packet at a time. If there is no
response, the packet may be retransmitted, but the sender does not send Data[N+1] until it has received ACK[N]. Of course, the
receiving side will not send ACK[N] until it has received Data[N]; each side has only one packet in play at a time. In the absence
of packet loss, this leads to the following:

Data 1

Ack 1

Data 2

Ack 2

Data 3

Ack 3

Data 4

Ack 4

Stop and Wait

6.1.1 Packet Loss


Lost packets, however, are a reality. The left half of the diagram below illustrates a lost Data packet, where the sender is the host
sending Data and the Receiver is the host sending ACKs. The receiver is not aware of the loss; it sees Data[N] as simply slow to
arrive.
Sender Receiver Sender Receiver
Data[N] Data[N]

ACK[N]

Timeout Data[N] Timeout Data[N]

ACK[N] ACK[N]

Lost Data Lost ACK

The right half of the diagram, by comparison, illustrates the case of a lost ACK. The receiver has received a duplicate Data[N]. We
have assumed here that the receiver has implemented a retransmit-on-duplicate strategy, and so its response upon receipt of the
duplicate Data[N] is to retransmit ACK[N].

6.1.1 https://eng.libretexts.org/@go/page/11136
As a final example, note that it is possible for ACK[N] to have been delayed (or, similarly, for the first Data[N] to have been
delayed) longer than the timeout interval. Not every packet that times out is actually lost!
Sender Receiver
Data[N]

ACK[N]

Timeout Data[N]

Data[N+1]

Expecting
ACK[N+1] ACK[N]

Late ACK

In this case we see that, after sending Data[N], receiving a delayed ACK[N] (rather than the expected ACK[N+1]) must be
considered a normal event.
In principle, either side can implement retransmit-on-timeout if nothing is received. Either side can also implement retransmit-on-
duplicate; this was done by the receiver in the second example above but not by the sender in the third example (the sender
received a second ACK[N] but did not retransmit Data[N+1]).
At least one side must implement retransmit-on-timeout; otherwise a lost packet leads to deadlock as the sender and the receiver
both wait forever. The other side must implement at least one of retransmit-on-duplicate or retransmit-on-timeout; usually the
former alone. If both sides implement retransmit-on-timeout with different timeout values, generally the protocol will still work.

6.1.2 Sorcerer’s Apprentice Bug


Sorcerer’s Apprentice
The Sorcerer’s Apprentice bug is named for the legend (in which the apprentice casts a spell on a broom to carry water, one
bucket at a time. When the basin is full, the apprentice chops the broom in half, only to find both halves carrying water. An
animated version of this appears in Disney’s _Fantasia_, set to the music of Paul Dukas
[en.Wikipedia.org/wiki/The_So...ntice_(Dukas)]. I used to post a YouTube link here to the video, but Disney has blocked it. It
may still be findable online, though. Mickey Mouse chops the broom about five and a half minutes in from the start of the
music.

A strange thing happens if one side implements retransmit-on-timeout but both sides implement retransmit-on-duplicate, as can
happen if the implementer takes the naive view that retransmitting on duplicates is “safer”; the moral here is that too much
redundancy can be the Wrong Thing. Let us imagine that an implementation uses this strategy (with the sender retransmitting on
timeouts), and that the initial ACK[3] is delayed until after Data[3] is retransmitted on timeout. In the following diagram, the only
packet retransmitted due to timeout is the second Data[3]; all the other duplications are due to the bilateral retransmit-on-duplicate
strategy.

6.1.2 https://eng.libretexts.org/@go/page/11136
sender receiver
First Data[3]

First ACK[3], delayed

Second Data[3]
TIMEOUT
Second ACK[3]

First Data[4]
First ACK[4]

Second Data[4]
Second ACK[4]

First Data[5]

First ACK[5]

Second Data[5]

Second ACK[5]

...

The Sorcerer's Apprentice bug


First transmissions are in black
Second transmissions are in blue

All packets are sent twice from Data[3] on. The transfer completes normally, but takes double the normal bandwidth. The usual fix
is to have one side (usually the sender) retransmit on timeout only. TCP does this; see 12.19 TCP Timeout and Retransmission. See
also exercise 1.5.

6.1.3 Flow Control


Stop-and-wait also provides a simple form of flow control to prevent data from arriving at the receiver faster than it can be
handled. Assuming the time needed to process a received packet is less than one RTT, the stop-and-wait mechanism will prevent
data from arriving too fast. If the processing time is slightly larger than RTT, all the receiver has to do is to wait to send ACK[N]
until Data[N] has not only arrived but also been processed, and the receiver is ready for Data[N+1].
For modest per-packet processing delays this works quite well, but if the processing delays are long it introduces a new problem:
Data[N] may time out and be retransmitted even though it has successfully been received; the receiver cannot send an ACK until it
has finished processing. One approach is to have two kinds of ACKs: ACKWAIT[N] meaning that Data[N] has arrived but the
receiver is not yet ready for Data[N+1], and ACKGO[N] meaning that the sender may now send Data[N+1]. The receiver will send
ACKWAIT[N] when Data[N] arrives, and ACKGO[N] when it is done processing it.
Presumably we want the sender not to time out and retransmit Data[N] after ACKWAIT[N] is received, as a retransmission would be
unnecessary. This introduces a new problem: if the subsequent ACKGO[N] is lost and neither side times out, the connection is
deadlocked. The sender is waiting for ACKGO[N], which is lost, and the receiver is waiting for Data[N+1], which the sender will
not send until the lost ACKGO[N] arrives. One solution is for the receiver to switch to a timeout model, perhaps until Data[N+1] is
received.
TCP has a fix to the flow-control problem involving sender-side polling; see 12.17 TCP Flow Control.

This page titled 6.1: Building Reliable Transport - Stop-and-Wait is shared under a CC BY-NC-ND license and was authored, remixed, and/or
curated by Peter Lars Dordal.

6.1.3 https://eng.libretexts.org/@go/page/11136
6.2: Sliding Windows
Stop-and-wait is reliable but it is not very efficient (unless the path involves neither intermediate switches nor significant
propagation delay; that is, the path involves a single LAN link). Most links along a multi-hop stop-and-wait path will be idle most
of the time. During a file transfer, ideally we would like zero idleness (at least along the slowest link; see 6.3 Linear Bottlenecks).
We can improve overall throughput by allowing the sender to continue to transmit, sending Data[N+1] (and beyond) without
waiting for ACK[N]. We cannot, however, allow the sender get too far ahead of the returning ACKs. Packets sent too fast, as we
shall see, simply end up waiting in queues, or, worse, dropped from queues. If the links of the network have sufficient bandwidth,
packets may also be dropped at the receiving end.
Now that, say, Data[3] and Data[4] may be simultaneously in transit, we have to revisit what ACK[4] means: does it mean that the
receiver has received only Data[4], or does it mean both Data[3] and Data[4] have arrived? We will assume the latter, that is, ACKs
are cumulative: ACK[N] cannot be sent until Data[K] has arrived for all K≤N. With this understanding, if ACK[3] is lost then a
later-arriving ACK[4] makes up for it; without it, if ACK[3] is lost the only recovery is to retransmit Data[3].
The sender picks a window size, winsize. The basic idea of sliding windows is that the sender is allowed to send this many packets
before waiting for an ACK. More specifically, the sender keeps a state variable last_ACKed, representing the last packet for which
it has received an ACK from the other end; if data packets are numbered starting from 1 then initially last_ACKed = 0.

Window Size
In this chapter we will assume winsize does not change. TCP, however, varies winsize up and down with the goal of making it
as large as possible without introducing congestion; we will return to this in 13 TCP Reno and Congestion Management.

At any instant, the sender may send packets numbered last_ACKed + 1 through last_ACKed + winsize; this packet range is known
as the window. Generally, if the first link in the path is not the slowest one, the sender will most of the time have sent all these.
If ACK[N] arrives with N > last_ACKed (typically N = last_ACKed+1), then the window slides forward; we set last_ACKed = N.
This also increments the upper edge of the window, and frees the sender to send more packets. For example, with winsize = 4 and
last_ACKed = 10, the window is [11,12,13,14]. If ACK[11] arrives, the window slides forward to [12,13,14,15], freeing the sender
to send Data[15]. If instead ACK[13] arrives, then the window slides forward to [14,15,16,17] (recall that ACKs are cumulative),
and three more packets become eligible to be sent. If there is no packet reordering and no packet losses (and every packet is
ACKed individually) then the window will slide forward in units of one packet at a time; the next arriving ACK will always be
ACK[last_ACKed+1].
Note that the rate at which ACKs are returned will always be exactly equal to the rate at which the slowest link is delivering
packets. That is, if the slowest link (the “bottleneck” link) is delivering a packet every 50 ms, then the receiver will receive those
packets every 50 ms and the ACKs will return at a rate of one every 50 ms. Thus, new packets will be sent at an average rate
exactly matching the delivery rate; this is the sliding-windows self-clocking property. Self-clocking has the effect of reducing
congestion by automatically reducing the sender’s rate whenever the available fraction of the bottleneck bandwidth is reduced.
Below is a video of sliding windows in action, with winsize = 5. (A link is here [vimeo.com/150452468], if the embedded video
does not display properly, which will certainly be the case with non-html formats.) The nodes are labeled 0, 1 and 2. The second
link, 1–2, has a capacity of five packets in transit either way, so one “flight” (windowful) of five packets can exactly fill this link.
The 0–1 link has a capacity of one packet in transit either way. The video was prepared using the network animator, “nam”,
described further in 16 Network Simulations: ns-2.

6.2.1 https://eng.libretexts.org/@go/page/11137
sliding windows, with fixed window size of 5.
The first flight of five data packets leaves node 0 just after T=0, and leaves node 1 at around T=1 (in video time). Subsequent
flights are spaced about seven seconds apart. The tiny packets moving leftwards from node 2 to node 0 represent ACKs; at the very
beginning of the video one can see five returning ACKs from the previous windowful. At any moment (except those instants where
packets have just been received) there are in principle five packets in transit, either being transmitted on a link as data, or being
transmitted as an ACK, or sitting in a queue (this last does not happen in this video). Due to occasional video artifacts, in some
frames not all the ACK packets are visible.

6.2.1 Bandwidth × Delay


As indicated previously (5.1 Packet Delay), the bandwidth × RTT product represents the amount of data that can be sent before the
first response is received. It plays a large role in the analysis of transport protocols. In the literature the bandwidth×delay product is
often abbreviated BDP.
The bandwidth × RTT product is generally the optimum value for the window size. There is, however, one catch: if a sender
chooses winsize larger than this, then the RTT simply grows – due to queuing delays – to the point that bandwidth × RTT matches
the chosen winsize. That is, a connection’s own traffic can inflate RTTactual to well above RTTnoLoad; see 6.3.1.3 Case 3: winsize =
6 below for an example. For this reason, a sender is often more interested in bandwidth × RTTnoLoad, or, at the very least, the RTT
before the sender’s own packets had begun contributing to congestion.
We will sometimes refer to the bandwidth × RTTnoLoad product as the transit capacity of the route. As will become clearer below,
a winsize smaller than this means underutilization of the network, while a larger winsize means each packet spends time waiting in
a queue somewhere.
Below are simplified diagrams for sliding windows with window sizes of 1, 4 and 6, each with a path bandwidth of 6 packets/RTT
(so bandwidth × RTT = 6 packets). The diagram shows the initial packets sent as a burst; these then would be spread out as they
pass through the bottleneck link so that, after the first burst, packet spacing is uniform. (Real sliding-windows protocols such as
TCP generally attempt to avoid such initial bursts.)

6.2.2 https://eng.libretexts.org/@go/page/11137
Data 1-4 Data 1-6
Data 1

Ack 1

Ack 1 / Data 7
Ack 1 / Data 5

Ack 2 / Data 8
Data 2 Ack 2 / Data 6

Ack 3 / Data 9
Ack 3 / Data 7

Ack 4 / Data 10
Ack 4 / Data 8

Ack 5 / Data 11
Ack 2

Ack 6 / Data 12

Ack 5 / Data 9 Ack 7 / Data 13

Data 3 Ack 6 / Data 10 Ack 8 / Data 14

Ack 7 / Data 11 Ack 9 / Data 15

Ack 8 / Data 12 Ack 10 / Data 16

Ack 3

WinSize = 1 WinSize = 4 WinSize = 6

Sliding Windows, bandwidth 6 packets/RTT

With winsize=1 we send 1 packet per RTT; with winsize=4 we always average 4 packets per RTT. To put this another way, the
three window sizes lead to bottle-neck link utilizations of 1/6, 4/6 and 6/6 = 100%, respectively.
While it is tempting to envision setting winsize to bandwidth × RTT, in practice this can be complicated; neither bandwidth nor
RTT is constant. Available bandwidth can fluctuate in the presence of competing traffic. As for RTT, if a sender sets winsize too
large then the RTT is simply inflated to the point that bandwidth × RTT matches winsize; that is, a connection’s own traffic can
inflate RTTactual to well above RTTnoLoad. This happens even in the absence of competing traffic.

6.2.2 The Receiver Side


Perhaps surprisingly, sliding windows can work pretty well with the receiver assuming that winsize=1, even if the sender is in fact
using a much larger value. Each of the receivers in the diagrams above receives Data[N] and responds with ACK[N]; the only
difference with the larger sender winsize is that the Data[N] arrive faster.
If we are using the sliding-windows algorithm over single links, we may assume packets are never reordered, and a receiver
winsize of 1 works quite well. Once switches are introduced, however, life becomes more complicated (though some links may do
link-level sliding-windows for per-link throughput optimization).
If packet reordering is a possibility, it is common for the receiver to use the same winsize as the sender. This means that the
receiver must be prepared to buffer a full window full of packets. If the window is [11,12,13,14,15,16], for example, and Data[11]
is delayed, then the receiver may have to buffer Data[12] through Data[16].
Like the sender, the receiver will also maintain the state variable last_ACKed, though it will not be completely synchronized with
the sender’s version. At any instant, the receiver is willing to accept Data[last_ACKed+1] through Data[last_ACKed+winsize]. For
any but the first of these, the receiver must buffer the arriving packet. If Data[last_ACKed+1] arrives, then the receiver should
consult its buffers and send back the largest cumulative ACK it can for the data received; for example, if the window is [11-16] and
Data[12], Data[13] and Data[15] are in the buffers, then on arrival of Data[11] the correct response is ACK[13]. Data[11] fills the
“gap”, and the receiver has now received everything up through Data[13]. The new receive window is [14-19], and as soon as the
ACK[13] reaches the sender that will be the new send window as well.

6.2.3 https://eng.libretexts.org/@go/page/11137
6.2.3 Loss Recovery Under Sliding Windows
Suppose winsize = 4 and packet 5 is lost. It is quite possible that packets 6, 7, and 8 may have been received. However, the only
(cumulative) acknowledgment that can be sent back is ACK[4]; the sender does not know how much of the windowful made it
through. Because of the possibility that only Data[5] (or more generally Data[last_ACKed+1]) is lost, and because losses are
usually associated with congestion, when we most especially do not wish to overburden the network, the sender will usually
retransmit only the first lost packet, eg packet 5. If packets 6, 7, and 8 were also lost, then after retransmission of Data[5] the sender
will receive ACK[5], and can assume that Data[6] now needs to be sent. However, if packets 6-8 did make it through, then after
retransmission the sender will receive back ACK[8], and so will know 6-8 do not need retransmission and that the next packet to
send is Data[9].
Normally Data[6] through Data[8] would time out shortly after Data[5] times out. After the first timeout, however, sliding windows
protocols generally suppress further timeout/retransmission responses until recovery is more-or-less complete.
Once a full timeout has occurred, usually the sliding-windows process itself has ground to a halt, in that there are usually no
packets remaining in flight. This is sometimes described as pipeline drain. After recovery, the sliding-windows process will have
to start up again. Most implementations of TCP, as we shall see later, implement a mechanism (“fast recovery”) for early detection
of packet loss, before the pipeline has fully drained.

This page titled 6.2: Sliding Windows is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

6.2.4 https://eng.libretexts.org/@go/page/11137
6.3: Linear Bottlenecks
6.3 Linear Bottlenecks
Consider the simple network path shown below, with bandwidths shown in packets/ms. The minimum bandwidth, or path
bandwidth, is 3 packets/ms.

10 pkts/ms 6 pkts/ms 3 pkts/ms 3 pkts/ms 8 pkts/ms


A R1 R2 R3 R4 B

The slow links are R2–R3 and R3–R4. We will refer to the slowest link as the bottleneck link; if there are (as here) ties for the
slowest link, then the first such link is the bottleneck. The bottleneck link is where the queue will form. If traffic is sent at a rate of
4 packets/ms from A to B, it will pile up in an ever-increasing queue at R2. Traffic will not pile up at R3; it arrives at R3 at the
same rate by which it departs.
Furthermore, if sliding windows is used (rather than a fixed-rate sender), traffic will eventually not queue up at any router other
than R2: data cannot reach B faster than the 3 packets/ms rate, and so B will not return ACKs faster than this rate, and so A will
eventually not send data faster than this rate. At this 3 packets/ms rate, traffic will not pile up at R1 (or R3 or R4).
There is a significant advantage in speaking in terms of winsize rather than transmission rate. If A sends to B at any rate greater
than 3 packets/ms, then the situation is unstable as the bottleneck queue grows without bound and there is no convergence to a
steady state. There is no analogous instability, however, if A uses sliding windows, even if the winsize chosen is quite large
(although a large-enough winsize will overflow the bottleneck queue). If a sender specifies a sending window size rather than a
rate, then the network will converge to a steady state in relatively short order; if a queue develops it will be steadily replenished at
the same rate that packets depart, and so will be of fixed size.

6.3.1 Simple fixed-window-size analysis


We will analyze the effect of window size on overall throughput and on RTT. Consider the following network path, with
bandwidths now labeled in packets/second.

infinitely fast 1 pkt/sec 1 pkt/sec 1 pkt/sec 1 pkt/sec


A R1 R2 R3 R4 B

We will assume that in the backward B⟶A direction, all connections are infinitely fast, meaning zero delay; this is often a good
approximation because ACK packets are what travel in that direction and they are negligibly small. In the A⟶B direction, we will
assume that the A⟶R1 link is infinitely fast, but the other four each have a bandwidth of 1 packet/second (and no propagation-
delay component). This makes the R1⟶R2 link the bottleneck link; any queue will now form at R1. The “path bandwidth” is 1
packet/second, and the RTT is 4 seconds.
As a roughly equivalent alternative example, we might use the following:

infinitely fast 1 pkt/sec 1 pkt/sec


C S1 S2 D

with the following assumptions: the C–S1 link is infinitely fast (zero delay), S1⟶S2 and S2⟶D each take 1.0 sec bandwidth
delay (so two packets take 2.0 sec, per link, etc), and ACKs also have a 1.0 sec bandwidth delay in the reverse direction.
In both scenarios, if we send one packet, it takes 4.0 seconds for the ACK to return, on an idle network. This means that the no-load
delay, RTTnoLoad, is 4.0 seconds.
(These models will change significantly if we replace the 1 packet/sec bandwidth delay with a 1-second propagation delay; in the
former case, 2 packets take 2 seconds, while in the latter, 2 packets take 1 second. See exercise 4.0.)
We assume a single connection is made; ie there is no competition. Bandwidth × delay here is 4 packets (1 packet/sec × 4 sec RTT)

6.3.1 https://eng.libretexts.org/@go/page/11138
6.3.1.1 Case 1: winsize = 2
In this case winsize < bandwidth×delay (where delay = RTT). The table below shows what is sent by A and each of R1-R4 for each
second. Every packet is acknowledged 4 seconds after it is sent; that is, RTTactual = 4 sec, equal to RTTnoLoad; this will remain true
as the winsize changes by small amounts (eg to 1 or 3). Throughput is proportional to winsize: when winsize = 2, throughput is 2
packets in 4 seconds, or 2/4 = 1/2 packet/sec. During each second, two of the routers R1-R4 are idle. The overall path will have less
than 100% utilization.

Time A R1 R1 R2 R3 R4 B

T sends queues sends sends sends sends ACKs

0 1,2 2 1

1 2 1

2 2 1

3 2 1

4 3 3 2 1

5 4 4 3 2

6 4 3

7 4 3

8 5 5 4 3

9 6 6 5 4

Note the brief pile-up at R1 (the bottleneck link!) on startup. However, in the steady state, there is no queuing. Real sliding-
windows protocols generally have some way of minimizing this “initial pileup”.

6.3.1.2 Case 2: winsize = 4


When winsize=4, at each second all four slow links are busy. There is again an initial burst leading to a brief surge in the queue;
RTTactual for Data[4] is 7 seconds. However, RTTactual for every subsequent packet is 4 seconds, and there are no queuing delays
(and nothing in the queue) after T=2. The steady-state connection throughput is 4 packets in 4 seconds, ie 1 packet/second. Note
that overall path throughput now equals the bottleneck-link bandwidth, so this is the best possible throughput.

T A sends R1 queues R1 sends R2 sends R3 sends R4 sends B ACKs

0 1,2,3,4 2,3,4 1

1 3,4 2 1

2 4 3 2 1

3 4 3 2 1

4 5 5 4 3 2 1

5 6 6 5 4 3 2

6 7 7 6 5 4 3

7 8 8 7 6 5 4

8 9 9 8 7 6 5

At T=4, R1 has just finished sending Data[4] as Data[5] arrives from A; R1 can begin sending packet 5 immediately. No queue will
develop.
Case 2 is the “congestion knee” of Chiu and Jain [CJ89], defined here in 1.7 Congestion.

6.3.2 https://eng.libretexts.org/@go/page/11138
6.3.1.3 Case 3: winsize = 6
T A sends R1 queues R1 sends R2 sends R3 sends R4 sends B ACKs

0 1,2,3,4,5,6 2,3,4,5,6 1

1 3,4,5,6 2 1

2 4,5,6 3 2 1

3 5,6 4 3 2 1

4 7 6,7 5 4 3 2 1

5 8 7,8 6 5 4 3 2

6 9 8,9 7 6 5 4 3

7 10 9,10 8 7 6 5 4

8 11 10,11 9 8 7 6 5

9 12 11,12 10 9 8 7 6

10 13 12,13 11 10 9 8 7

Note that packet 7 is sent at T=4 and the acknowledgment is received at T=10, for an RTT of 6.0 seconds. All later packets have
the same RTTactual. That is, the RTT has risen from RTTnoLoad = 4 seconds to 6 seconds. Note that we continue to send one
windowful each RTT; that is, the throughput is still winsize/RTT, but RTT is now 6 seconds.
One might initially conjecture that if winsize is greater than the bandwidth×RTTnoLoad product, then the entire window cannot be in
transit at one time. In fact this is not the case; the sender does usually have the entire window sent and in transit, but RTT has been
inflated so it appears to the sender that winsize equals the bandwidth×RTT product.
In general, whenever winsize > bandwidth×RTTnoLoad, what happens is that the extra packets pile up at a router somewhere along
the path (specifically, at the router in front of the bottleneck link). RTTactual is inflated by queuing delay to winsize/bandwidth,
where bandwidth is that of the bottleneck link; this means winsize = bandwidth×RTTactual. Total throughput is equal to that
bandwidth. Of the 6 seconds of RTTactual in the example here, a packet spends 4 of those seconds being transmitted on one link or
another because RTTnoLoad=4. The other two seconds, therefore, must be spent in a queue; there is no other place for packets wait.
Looking at the table, we see that each second there are indeed two packets in the queue at R1.
If the bottleneck link is the very first link, packets may begin returning before the sender has sent the entire windowful. In this case
we may argue that the full windowful has at least been queued by the sender, and thus has in this sense been “sent”. Suppose the
network, for example, is

1 pkt/sec 1 pkt/sec 1 pkt/sec 1 pkt/sec


A R1 R2 R3 B

where, as before, each link transports 1 packet/sec from A to B and is infinitely fast in the reverse direction. Then, if A sets winsize
= 6, a queue of 2 packets will form at A.

6.3.2 RTT Calculations


We can make some quantitative observations of sliding windows behavior, and about queue utilization. First, we note that
RTTnoLoad is the physical “travel” time (subject to the limitations addressed in 5.2 Packet Delay Variability); any time in excess of
RTTnoLoad is spent waiting in a queue somewhere. Therefore, the following holds regardless of competing traffic, and even for
individual packets:

1. queue_time = RTTactual − RTTnoLoad

6.3.3 https://eng.libretexts.org/@go/page/11138
When the bottleneck link is saturated, that is, is always busy, the number of packets actually in transit (not queued) somewhere
along the path will always be bandwidth × RTTnoLoad.
Second, we always send one windowful per actual RTT, assuming no losses and each packet is individually acknowledged. This is
perhaps best understood by consulting the diagrams above, but here is a simple non-visual argument: if we send Data[N] at time
TD, and ACK[N] arrives at time TA, then RTT = TA − TD, by definition. At time TA the sender is allowed to send Data[N+winsize],
so during the RTT interval TD ≤ T < TA the sender must have sent Data[N] through Data[N+winsize-1]; that is, winsize many
packets in time RTT. Therefore (whether or not there is competing traffic) we always have

2. throughput = winsize/RTTactual
where “throughput” is the rate at which the connection is sending packets.
This relationship holds even if winsize or the bottleneck bandwidth changes suddenly, though in that case RTTactual might change
from one packet to the next, and the throughput here must be seen as a measurement averaged over the RTT of one specific packet.
If the sender doubles its winsize, those extra packets will immediately end up in a queue somewhere (perhaps a queue at the sender
itself, though this is why in examples it is often clearer if the first link has infinite bandwidth so as to prevent this). If the bottleneck
bandwidth is cut in half without changing winsize, eventually the RTT must rise due to queuing. See exercise 12.0.
In the sliding windows steady state, where throughput and RTTactual are reasonably constant, the average number of packets in the
queue is just throughput×queue_time (where throughput is measured in packets/sec):

3. queue_usage = throughput × (RTTactual − RTTnoLoad)


= winsize × (1 − RTTnoLoad/RTTactual)
To give a little more detail making the averaging perhaps clearer, each packet spends time (RTTactual − RTTnoLoad) in the queue,
from equation 1 above. The total time spent by a windowful of packets is winsize × (RTTactual − RTTnoLoad), and dividing this by
RTTactual thus gives the average number of packets in the queue over the RTT interval in question.
In the presence of competing traffic, the throughput referred to above is simply the connection’s current share of the total
bandwidth. It is the value we get if we measure the rate of returning ACKs. If there is no competing traffic and winsize is below the
congestion knee – winsize < bandwidth × RTTnoLoad – then winsize is the limiting factor in throughput. Finally, if there is no
competition and winsize ≥ bandwidth × RTTnoLoad then the connection is using 100% of the capacity of the bottleneck link and
throughput is equal to the bottleneck-link physical bandwidth. To put this another way,

4. RTTactual = winsize/bottleneck_bandwidth

queue_usage = winsize − bandwidth × RTTnoLoad


Dividing the first equation by RTTnoLoad, and noting that bandwidth × RTTnoLoad = winsize - queue_usage = transit_capacity, we
get

5. RTTactual/RTTnoLoad = winsize/transit_capacity
= (transit_capacity + queue_usage) / transit_capacity

Regardless of the value of winsize, in the steady state the sender never sends faster than the bottleneck bandwidth. This is because
the bottleneck bandwidth determines the rate of packets arriving at the far end, which in turn determines the rate of ACKs arriving
back at the sender, which in turn determines the continued sending rate. This illustrates the self-clocking nature of sliding windows.
We will return in 14 Dynamics of TCP to the issue of bandwidth in the presence of competing traffic. For now, suppose a sliding-
windows sender has winsize > bandwidth × RTTnoLoad, leading as above to a fixed amount of queue usage, and no competition.
Then another connection starts up and competes for the bottleneck link. The first connection’s effective bandwidth will thus
decrease. This means that bandwidth × RTTnoLoad will decrease, and hence the connection’s queue usage will increase.

6.3.4 https://eng.libretexts.org/@go/page/11138
6.3.3 Graphs at the Congestion Knee
Consider the following graphs of winsize versus
1. throughput
2. delay
3. queue utilization
T
h
r
o queue
u delay
utilization
g
h
p
u
t

winsize winsize winsize

Graphs of winsize versus throughput, delay and queue utilization.


Vertical dashed line represents winsize = bandwidth x no-load delay

The critical winsize value is equal to bandwidth × RTTnoLoad; this is known as the congestion knee. For winsize below this, we
have:
throughput is proportional to winsize
delay is constant
queue utilization in the steady state is zero
For winsize larger than the knee, we have
throughput is constant (equal to the bottleneck bandwidth)
delay increases linearly with winsize
queue utilization increases linearly with winsize
Ideally, winsize will be at the critical knee. However, the exact value varies with time: available bandwidth changes due to the
starting and stopping of competing traffic, and RTT changes due to queuing. Standard TCP makes an effort to stay well above the
knee much of the time, presumably on the theory that maximizing throughput is more important than minimizing queue use.
The power of a connection is defined to be throughput/RTT. For sliding windows below the knee, RTT is constant and power is
proportional to the window size. For sliding windows above the knee, throughput is constant and delay is proportional to winsize;
power is thus proportional to 1/winsize. Here is a graph, akin to those above, of winsize versus power:

power

winsize

6.3.4 Simple Packet-Based Sliding-Windows Implementation


Here is a pseudocode outline of the receiver side of a sliding-windows implementation, ignoring lost packets and timeouts. We
abbreviate as follows:

W: winsize
LA: last_ACKed
Thus, the next packet expected is LA+1 and the window is [LA+1, …, LA+W]. We have a data structure EarlyArrivals in which we
can place packets that cannot yet be delivered to the receiving application.
Upon arrival of Data[M]:

6.3.5 https://eng.libretexts.org/@go/page/11138
if M≤LA or M>LA+W, ignore the packet
if M>LA+1, put the packet into EarlyArrivals.
if M==LA+1:
deliver the packet (that is, Data[LA+1]) to the application
LA = LA+1 (slide window forward by 1)
while (Data[LA+1] is in EarlyArrivals) {
output Data[LA+1]
LA = LA+1
}
send ACK[LA]
A possible implementation of EarlyArrivals is as an array of packet objects, of size W. We always put packet Data[M] into position
M % W.
At any point between packet arrivals, Data[LA+1] is not in EarlyArrivals, but some later packets may be present.
For the sender side, we begin by sending a full windowful of packets Data[1] through Data[W], and setting LA=0. When ACK[M]
arrives, LA<M≤LA+W, the window slides forward from [LA+1…LA+W] to [M+1…M+W], and we are now allowed to send
Data[LA+W+1] through Data[M+W]. The simplest case is M=LA+1.
Upon arrival of ACK[M]:

if M≤LA or M>LA+W, ignore the packet


otherwise:
set K = LA+W+1, the first packet just above the old window
set LA = M, just below the bottom of the new window
for (i=K; i≤LA+W; i++) send Data[i]
Note that new ACKs may arrive while we are in the loop at the last line. We assume here that the sender stolidly sends what it may
send and only after that does it start to process additional arriving ACKs. Some implementations may take a more asynchronous
approach, perhaps with one thread processing arriving ACKs and incrementing LA and another thread sending everything it is
allowed to send.
To add support for timeout and retransmission, each transmitted packet would need to be stored, together with the time it was sent.
Periodically this collection of stored packets must then be scanned, looking for packets for which send_time +
timeout_interval ≤ current_time ; those packets get retransmitted. When a packet Data[N] is acknowledged
(perhaps by an ACK[M] for M>N), it can be deleted.

This page titled 6.3: Linear Bottlenecks is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

6.3.6 https://eng.libretexts.org/@go/page/11138
6.4: Epilog and Exercises

This completes our discussion of the sliding-windows algorithm in the abstract setting. We will return to concrete
implementations of this in 11.4.1 TFTP and the Sorcerer (stop-and-wait) and in 12.14 TCP Sliding Windows; the latter is one of the
most important mechanisms on the Internet.

Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 1.5 is distinct, for
example, from exercises 1.0 and 2.0. Exercises marked with a ♢ have solutions or hints at 24.6 Solutions for Sliding Windows.
1.0 Sketch a ladder diagram for stop-and-wait if Data[3] is lost the first time it is sent, assuming no sender timeout (but the sender
retransmits on duplicate), and a receiver timeout of 2 seconds. Continue the diagram to the point where Data[4] is successfully
transmitted. Assume an RTT of 1 second.
1.5 Re-draw the Sorcerer’s Apprentice diagram of 6.1.2 Sorcerer’s Apprentice Bug, assuming the sender now does not retransmit
on duplicates, though the receiver still does. ACK[3] is, as before, delayed until the sender retransmits Data[3].
2.0 Suppose a stop-and-wait receiver has an implementation flaw. When Data[1] arrives, ACK[1] and ACK[2] are sent, separated
by a brief interval; after that, the receiver transmits ACK[N+1] when Data[N] arrives, rather than the correct ACK[N].
(a). Draw a diagram, including at least three RTTs.
(b). What is the average throughput, in data packets per RTT? (For normal stop-and-wait, the average throughput is 1.)
(c). Is there anything the sender can do to detect this receiver behavior before the final packet, assuming the sender must respond to
each ACK as soon as it arrives?
2.5♢ Consider the alternative model of 6.3.1 Simple fixed-window-size analysis:

infinitely fast 1 pkt/sec 1 pkt/sec


C S1 S2 D

(a). Using the formulas of 6.3.2 RTT Calculations, calculate the steady-state queue usage for a window size of 6.
(b). Again for a window size of 6, create a table like those in 6.3.1 Simple fixed-window-size analysis up through T=8 seconds.
3.0 Create a table as in 6.3.1 Simple fixed-window-size analysis for the original A───R1───R2───R3───R4───B network
with winsize = 8. As in the text examples, assume 1 packet/sec bandwidth delay for the R1⟶R2, R2⟶R3, R3⟶R4 and
R4⟶B links. The A–R1 link and all reverse links (from B to A) are infinitely fast. Carry out the table for 10 seconds.
4.0 Create a table as in 6.3.1 Simple fixed-window-size analysis for a network A───R1───R2───B. The A–R1 ink is infinitely
fast; the R1–R2 and R2–B each have a 1-second propagation delay, in each direction, and zero bandwidth delay (that is, one
packet takes 1.0 sec to travel from R1 to R2; two packets also take 1.0 sec to travel from R1 to R2). Assume winsize=6. Carry out
the table for 8 seconds. Note that with zero bandwidth delay, multiple packets sent together will remain together until the
destination; propagation delay behaves very differently from bandwidth delay!
5.0 Suppose RTTnoLoad = 4 seconds and the bottleneck bandwidth is 1 packet every 2 seconds.
(a). What window size is needed to remain just at the knee of congestion?
(b). Suppose winsize=6. What is the eventual value of RTTactual?
(c). Again with winsize=6, how many packets are in the queue at the steady state?
6.0 Create a table as in 6.3.1 Simple fixed-window-size analysis for a network A───R1───R2───R3───B. The A–R1 link is
infinitely fast. The R1–R2 and R3–B links have a bandwidth delay of 1 packet/second with no additional propagation delay. The
R2–R3 link has a bandwidth delay of 1 packet / 2 seconds, and no propagation delay. The reverse B⟶A direction (for ACKs) is
infinitely fast. Assume winsize = 6.
(a). Carry out the table for 10 seconds. Note that you will need to show the queue for both R1 and R2.
(b). Continue the table at least partially until T=18, in sufficient detail that you can verify that RTTactual for packet 8 is as calculated
in exercise 5.0. To do this you will need more than 10 packets, but fewer than 16; the use of hex labels A, B, C for packets 10, 11,
12 is a convenient notation.
Hint: The column for “R2 sends” (or, more accurately, “R2 is in the process of sending”) should look like this:

6.4.1 https://eng.libretexts.org/@go/page/11139
T R2 sends

1 1

2 1

3 2

4 2

5 3

6 3

… …

7.0 Argue that, if A sends to B using sliding windows, and in the path from A to B the slowest link is not the first link out of A,
then eventually A will have the entire window outstanding (except at the instant just after each new ACK comes in).
7.5♢ Suppose RTTnoLoad is 100 ms and the available bandwidth is 1,000 packets/sec. Sliding windows is used.
(a). What is the transit capacity for the connection?
(b). If RTTactual rises to 130 ms (due to use of a larger winsize), how many packets are in a queue at any one time?
(c). If winsize increases by 50, what is RTTactual?
8.0 Suppose RTTnoLoad is 50 ms and the available bandwidth is 2,000 packets/sec. Sliding windows is used for transmission.
(a). What window size is needed to remain just at the knee of congestion?
(b). If RTTactual rises to 60 ms (due to use of a larger winsize), how many packets are in a queue at any one time?
(c). What value of winsize would lead to RTTactual = 60 ms?
(d). What value of winsize would make RTTactual rise to 100 ms?
8.5 Suppose stop-and-wait is used (winsize=1), and assume that while packets may be lost, they are never reordered (that is, if two
packets P1 and P2 are sent in that order, and both arrive, then they arrive in that order). Show that at the point the receiver is
waiting for Data[N], the only two packet arrivals possible are Data[N] and Data[N-1]. (A consequence is that, in the absence of
reordering, stop-and-wait can make do with 1-bit packet sequence numbers.) Hint: if the receiver is waiting for Data[N], it must
have just received Data[N-1] and sent ACK[N-1]. Also, once the sender has sent Data[N], it will never transmit a Data[K] with
K<N.
✰9.0♢ Suppose winsize=4 in a sliding-windows connection, and assume that while packets may be lost, they are never reordered
(that is, if two packets P1 and P2 are sent in that order, and both arrive, then they arrive in that order). Show that if Data[8] is in the
receiver’s window (meaning that everything up through Data[4] has been received and acknowledged), then it is no longer possible
for even a late Data[0] to arrive at the receiver. (A consequence of the general principle here is that – in the absence of reordering –
we can replace the packet sequence number with (sequence_number) mod (2×winsize+1) without ambiguity.)
10.0 Suppose winsize=4 in a sliding-windows connection, and assume as in the previous exercise that while packets may be lost,
they are never reordered. Give an example in which Data[8] is in the receiver’s window (so the receiver has presumably sent
ACK[4]), and yet Data[1] legitimately arrives. (Thus, the late-packet bound in the previous exercise is the best possible.)
11.0 Suppose the network is A───R1───R2───B, where the A–R1 ink is infinitely fast and the R1–R2 link has a bandwidth of
1 packet/second each way, for an RTTnoLoad of 2 seconds. Suppose also that A begins sending with winsize = 6. By the analysis in
6.3.1.3 Case 3: winsize = 6, RTT should rise to winsize/bandwidth = 6 seconds. Give the RTTs of the first eight packets. How long
does it take for RTT to rise to 6 seconds?
12.0 In this exercise we look at the relationship between bottleneck bandwidth and winsize/RTTactual when the former changes
suddenly. Suppose the network is as follows

A───R1───R2───R3───B
The A──R1 link is infinitely fast. The R1→R2 link has a 1 packet/sec bandwidth delay in the R1→R2 direction. The remaining
links R2→R3 and R3→B have a 1 sec bandwidth delay in the direction indicated. ACK packets, being small, travel instantaneously
from B back to A.

6.4.2 https://eng.libretexts.org/@go/page/11139
A sends to B using a winsize of three. Three packets P0, P1 and P2 are sent at times T=0, T=1 and T=2 respectively.
At T=3, P0 arrives at B. At this instant the R1→R2 bandwidth is suddenly halved to 1 packet / 2 sec; P3 is transmitted at T=3 and
arrives at R2 at T=5. It will arrive at B at T=7.
(a). Complete the following table of packet arrival times
T A sends R1’s queue R1 sends R2 sends R3 sends B recvs/ACKs

2 P2 P2 P1 P0

3 P3 P3 P2 P1 P0

4 P4 P4 P3 cont P2 P1

5 P5 P5 P4 P3 P2

6 P5 P4 cont P3

7 P6 P3

9 P7

10

11 P8

(b). For each of P2, P3, P4 and P5, calculate the througput given by winsize/RTT over the course of that packet’s round trip. Obtain
each packet’s RTT from the completed table above.
(c). Once the steady state is reached in which RTTactual = 6, how much time does each packet spend in transit? How much time
does each packet spend in R1’s queue?

This page titled 6.4: Epilog and Exercises is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

6.4.3 https://eng.libretexts.org/@go/page/11139
CHAPTER OVERVIEW

7: IP version 4
There are multiple LAN protocols below the IP layer and multiple transport protocols above, but IP itself stands alone. The
Internet is essentially the IP Internet. If you want to run your own LAN protocol somewhere, or if you want to run your own
transport protocol, the Internet backbone will still work just fine for you. But if you want to change the IP layer, you will encounter
difficulty. (Just talk to the IPv6 people, or the IP-multicasting or IP-reservations groups.)
7.1: Prelude to IP version 4
7.2: The IPv4 Header
7.3: Interfaces
7.4: Special Addresses
7.5: Fragmentation
7.6: The Classless IP Delivery Algorithm
7.7: IPv4 Subnets
7.8: Network Address Translation
7.9: DNS
7.10: Address Resolution Protocol - ARP
7.11: Dynamic Host Configuration Protocol (DHCP)
7.12: Internet Control Message Protocol
7.13: Unnumbered Interfaces
7.14: Mobile IP
7.15: Epilog and Exercises
Index

This page titled 7: IP version 4 is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1
7.1: Prelude to IP version 4
Currently the Internet uses (mostly, but no longer quite exclusively) IP version 4, with its 32-bit address size. As the Internet has
run out of new large blocks of IPv4 addresses (1.10 IP - Internet Protocol), there is increasing pressure to convert to IPv6, with its
128-bit address size. Progress has been slow, however, and delaying tactics such as IPv4-address markets and NAT (7.7 Network
Address Translation) – by which multiple hosts can share a single public IPv4 address – have allowed IPv4 to continue. Aside from
the major change in address structure, there are relatively few differences in the routing models of IPv4 and IPv6. We will study
IPv4 in this chapter and IPv6 in the following; at points where the IPv4/IPv6 difference doesn’t much matter we will simply write
“IP”.
IPv4 (and IPv6) is, in effect, a universal routing and addressing protocol. Routing and addressing are developed together; every
node has an IP address and every router knows how to handle IP addresses. IP was originally seen as a way to interconnect
multiple LANs, but it may make more sense now to view IP as a virtual LAN overlaying all the physical LANs.
A crucial aspect of IP is its scalability. As the Internet has grown to ~109 hosts, the forwarding tables are not much larger than 105
(perhaps now 105.5). Ethernet, in comparison, scales poorly.
Furthermore, IP, unlike Ethernet, offers excellent support for multiple redundant links. If the network below were an IP network,
each node would communicate with each immediate neighbor via their shared direct link. If, on the other hand, this were an
Ethernet network with the spanning-tree algorithm, then one of the four links would simply be disabled completely.
A B

C D

The IP network service model is to act like a giant LAN. That is, there are no acknowledgments; delivery is generally described as
best-effort. This design choice is perhaps surprising, but it has also been quite fruitful.
If you want to provide a universal service for delivering any packet anywhere, what else do you need besides routing and
addressing? Every network (LAN) needs to be able to carry any packet. The protocols spell out the use of octets (bytes), so the only
possible compatibility issue is that a packet is too large for a given network. IPv4 handles this by supporting fragmentation: a
network may break a too-large packet up into units it can transport successfully. While IPv4 fragmentation is inefficient and
clumsy, it does guarantee that any packet can potentially be delivered to any node. (Note, however, that IPv6 has given up on
universal fragmentation; 8.5.4 IPv6 Fragment Header.)

This page titled 7.1: Prelude to IP version 4 is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

7.1.1 https://eng.libretexts.org/@go/page/11163
7.2: The IPv4 Header
The IPv4 Header needs to contain the following information:
destination and source addresses
indication of ipv4 versus ipv6
a Time To Live (TTL) value, to prevent infinite routing loops
a field indicating what comes next in the packet (eg TCP v UDP)
fields supporting fragmentation and reassembly.
The header is organized as a series of 32-bit words as follows:
0 8 16 24 32

Version IHL DS field ECN Total Length

Identification Flags Fragment Offset

Time to Live Protocol Header Checksum

Source Address

Destination Address

IPv4 Options (0-10 rows) Padding

The IPv4 header, and basics of IPv4 protocol operation, were originally defined in RFC 791
[https://tools.ietf.org/html/rfc791.html]; some minor changes have since occurred. Most of these changes were documented in RFC
1122 [https://tools.ietf.org/html/rfc1122.html], though the DS field was defined in RFC 2474
[https://tools.ietf.org/html/rfc2474.html] and the ECN bits were first proposed in RFC 2481
[https://tools.ietf.org/html/rfc2481.html].
The Version field is, for IPv4, the number 4: 0100. The IHL field represents the total IPv4 Header Length, in 32-bit words; an IPv4
header can thus be at most 15 words long. The base header takes up five words, so the IPv4 Options can consist of at most ten
words. If one looks at IPv4 packets using a packet-capture tool that displays the packets in hex, the first byte will most often be
0x45.
The Differentiated Services (DS) field is used by the Differentiated Services suite to specify preferential handling for designated
packets, eg those involved in VoIP or other real-time protocols. The Explicit Congestion Notification bits are there to allow
routers experiencing congestion to mark packets, thus indicating to the sender that the transmission rate should be reduced. We will
address these in 14.8.3 Explicit Congestion Notification (ECN). These two fields together replace the old 8-bit Type of Service
field.
The Total Length field is present because an IPv4 packet may be smaller than the minimum LAN packet size (see Exercise 1) or
larger than the maximum (if the IPv4 packet has been fragmented over several LAN packets. The IPv4 packet length, in other
words, cannot be inferred from the LAN-level packet size. Because the Total Length field is 16 bits, the maximum IPv4 packet size
is 216 bytes. This is probably much too large, even if fragmentation were not something to be avoided (though see IPv6
“jumbograms” in 8.5.1 Hop-by-Hop Options Header).
The second word of the header is devoted to fragmentation, discussed below.
The Time-to-Live (TTL) field is decremented by 1 at each router; if it reaches 0, the packet is discarded. A typical initial value is
64; it must be larger than the total number of hops in the path. In most cases, a value of 32 would work. The TTL field is there to
prevent routing loops – always a serious problem should they occur – from consuming resources indefinitely. Later we will look at
various IP routing-table update protocols and how they minimize the risk of routing loops; they do not, however, eliminate it. By
comparison, Ethernet headers have no TTL field, but Ethernet also disallows cycles in the underlying topology.
The Protocol field contains a value to identify the contents of the packet body. A few of the more common values are
1: an ICMP packet, 7.11 Internet Control Message Protocol
4: an encapsulated IPv4 packet, 7.13.1 IP-in-IP Encapsulation
6: a TCP packet
17: a UDP packet
41: an encapsulated IPv6 packet, 8.13 IPv6 Connectivity via Tunneling

7.2.1 https://eng.libretexts.org/@go/page/11140
50: an Encapsulating Security Payload, 22.11 IPsec
A list of assigned protocol numbers is maintained by the IANA [www.iana.org/assignments/prot...numbers.xhtml].
The Header Checksum field is the “Internet checksum” applied to the header only, not the body. Its only purpose is to allow the
discarding of packets with corrupted headers. When the TTL value is decremented the router must update the header checksum.
This can be done “algebraically” by adding a 1 in the correct place to compensate, but it is not hard simply to re-sum the 8
halfwords of the average header. The header checksum must also be updated when an IPv4 packet header is rewritten by a NAT
router.
The Source and Destination Address fields contain, of course, the IPv4 addresses. These would normally be updated only by NAT
firewalls.
The source-address field is supposed to be the sender’s IPv4 address, but hardly any ISP checks that traffic they send out has a
source address matching one of their customers, despite the call to do so in RFC 2827 [https://tools.ietf.org/html/rfc2827.html]. As
a result, IP spoofing – the sending of IP packets with a faked source address – is straightforward. For some examples, see 12.10.1
ISNs and spoofing, and SYN flooding at 12.3 TCP Connection Establishment.
IP-address spoofing also facilitates an all-too-common IP-layer denial-of-service attack in which a server is flooded with a huge
volume of traffic so as to reduce the bandwidth available to legitimate traffic to a trickle. This flooding traffic typically originates
from a large number of compromised machines. Without spoofing, even a lengthy list of sources can be blocked, but, with
spoofing, this becomes quite difficult.
One IPv4 option is the Record Route option, in which routers are to insert their own IPv4 address into the IPv4 header option area.
Unfortunately, with only ten words available, there is not enough space to record most longer routes (but see 7.11.1 Traceroute and
Time Exceeded, below). The Timestamp option is related; intermediate routers are requested to mark packets with their address
and a local timestamp (to save space, the option can request only timestamps). There is room for only four ⟨address,timestamp⟩
pairs, but addresses can be prespecified; that is, the sender can include up to four IPv4 addresses and only those routers will fill in
a timestamp.
Another option, now deprecated as security risk, is to support source routing. The sender would insert into the IPv4 header option
area a list of IPv4 addresses; the packet would be routed to pass through each of those IPv4 addresses in turn. With strict source
routing, the IPv4 addresses had to represent adjacent neighbors; no router could be used if its IPv4 address were not on the list.
With loose source routing, the listed addresses did not have to represent adjacent neighbors and ordinary IPv4 routing was used to
get from one listed IPv4 address to the next. Both forms are essentially never used, again for security reasons: if a packet has been
source-routed, it may have been routed outside of the at-least-somewhat trusted zone of the Internet backbone.
Finally, the IPv4 header was carefully laid out with memory alignment in mind. The 4-byte address fields are aligned on 4-byte
boundaries, and the 2-byte fields are aligned on 2-byte boundaries. All this was once considered important enough that incoming
packets were stored following two bytes of padding at the head of their containing buffer, so the IPv4 header, starting after the 14-
byte Ethernet header, would be aligned on a 4-byte boundary. Today, however, the architectures for which this sort of alignment
mattered have mostly faded away; alignment is a non-issue for ARM [en.Wikipedia.org/wiki/ARM_architecture] and Intel x86
[en.Wikipedia.org/wiki/X86] processors.

This page titled 7.2: The IPv4 Header is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.2.2 https://eng.libretexts.org/@go/page/11140
7.3: Interfaces
IP addresses (both IPv4 and IPv6) are, strictly speaking, assigned not to hosts or nodes, but to interfaces. In the most common
case, where each node has a single LAN interface, this is a distinction without a difference. In a room full of workstations each
with a single Ethernet interface eth0 (or perhaps Ethernet adapter Local Area Connection ), we might as well
view the IP address assigned to the interface as assigned to the workstation itself.
Each of those workstations, however, likely also has a loopback interface (at least conceptually), providing a way to deliver IP
packets to other processes on the same machine. On many systems, the name “localhost” resolves to the IPv4 loopback address
127.0.0.1 (the IPv6 address ::1 is also used); see 7.3 Special Addresses. Delivering packets to the loopback interface is simply a
form of interprocess communication; a functionally similar alternative is named pipes [en.Wikipedia.org/wiki/Named_pipe].
Loopback delivery avoids the need to use the LAN at all, or even the need to have a LAN. For simple client/server testing, it is
often convenient to have both client and server on the same machine, in which case the loopback interface is a convenient (and
fast) standin for a “real” network interface. On unix-based machines the loopback interface represents a genuine logical interface,
commonly named lo . On Windows systems the “interface” may not represent an actual operating-system entity, but this is of
practical concern only to those interested in “sniffing” all loopback traffic; packets sent to the loopback address are still delivered
as expected.
Workstations often have special other interfaces as well. Most recent versions of Microsoft Windows have a Teredo Tunneling
pseudo-interface and an Automatic Tunneling pseudo-interface; these are both intended (when activated) to support IPv6
connectivity when the local ISP supports only IPv4. The Teredo protocol is documented in RFC 4380
[https://tools.ietf.org/html/rfc4380.html].
When VPN connections are created, as in 3.1 Virtual Private Networks, each end of the logical connection typically terminates at a
virtual interface (one of these is labeled tun0 in the diagram of 3.1 Virtual Private Networks). These virtual interfaces appear, to
the systems involved, to be attached to a point-to-point link that leads to the other end.
When a computer hosts a virtual machine, there is almost always a virtual network to connect the host and virtual systems. The host
will have a virtual interface to connect to the virtual network. The host may act as a NAT router for the virtual machine, “hiding”
that virtual machine behind its own IP address, or it may act as an Ethernet switch, in which case the virtual machine will need an
additional public IP address.

What’s My IP Address?
This simple-seeming question is in fact not very easy to answer, if by “my IP address” one means the IP address assigned to
the interface that connects directly to the Internet. One strategy is to find the address of the default router, and then iterate
through all interfaces (eg with the Java NetworkInterface class) to find an IP address with a matching network prefix;
a Python3 example of this approach appears in 18.5.1 Multicast Programming. Unfortunately, finding the default router (to
identify the primary interface) is hard to do in an OS-independent way, and even then this approach can fail if the Wi-Fi and
Ethernet interfaces both are assigned IP addresses on the same network, but only one is actually connected.

Routers always have at least two interfaces on two separate IP networks. Generally this means a separate IP address for each
interface, though some point-to-point interfaces can be used without being assigned any IP address (7.12 Unnumbered Interfaces).

7.2.1 Multihomed hosts


A non-router host with multiple non-loopback network interfaces is often said to be multihomed. Many laptops, for example, have
both an Ethernet interface and a Wi-Fi interface. Both of these can be used simultaneously, with different IP addresses assigned to
each. On residential networks the two interfaces will often be on the same IP network (eg the same bridged Wi-Fi/Ethernet LAN);
at more security-conscious sites the Ethernet and Wi-Fi interfaces are often on quite different IP networks (though see 7.9.5 ARP
and multihomed hosts).
Multiple physical interfaces are not actually needed here; it is usually possible to assign multiple IP addresses to a single interface.
Sometimes this is done to allow two IP networks (two distinct prefixes) to share a single physical LAN; in this case the interface
would be assigned one IP address for each IP network. Other times a single interface is assigned multiple IP addresses on the same

7.3.1 https://eng.libretexts.org/@go/page/11141
IP network; this is often done so that one physical machine can act as a server (eg a web server) for multiple distinct IP addresses
corresponding to multiple distinct domain names.
Multihoming raises some issues with packets addressed to one interface, A, with IP address AIP, but which arrive via another
interface, B, with IP address BIP. Strictly speaking, such arriving packets should be discarded unless the host is promoted to
functioning as a router. In practice, however, the strict interpretation often causes problems; a typical user understanding is that the
IP address AIP should work to reach the host even if the physical connection is to interface B. A related issue is whether the host
receiving such a packet addressed to AIP on interface B is allowed to send its reply with source address AIP, even though the reply
must be sent via interface B.
RFC 1122 [https://tools.ietf.org/html/rfc1122.html], §3.3.4, defines two alternatives here:
The Strong End-System model: IP addresses – incoming and outbound – must match the physical interface.
The Weak End-System model: A match is not required: interface B can accept packets addressed to AIP, and send packets with
source address AIP.
Linux systems generally use the weak model by default. See also 7.9.5 ARP and multihomed hosts.
While it is important to be at least vaguely aware of the special cases that multihoming presents, we emphasize again that in most
ordinary contexts each end-user workstation has one IP address that corresponds to a LAN connection.

This page titled 7.3: Interfaces is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.3.2 https://eng.libretexts.org/@go/page/11141
7.4: Special Addresses
A few IPv4 addresses represent special cases.
While the standard IPv4 loopback address is 127.0.0.1, any IPv4 address beginning with 127 can serve as a loopback address.
Logically they all represent the current host. Most hosts are configured to resolve the name “localhost” to 127.0.0.1. However, any
loopback address – eg 127.255.37.59 – should work, eg with ping . For an example using 127.0.1.0, see 7.8 DNS.
Private addresses are IPv4 addresses intended only for site internal use, eg either behind a NAT firewall or intended to have no
Internet connectivity at all. If a packet shows up at any non-private router (eg at an ISP router), with a private IPv4 address as either
source or destination address, the packet should be dropped. Three standard private-address blocks have been defined:
10.0.0.0/8
172.16.0.0/12
192.168.0.0/16
The last block is the one from which addresses are most commonly allocated by DHCP servers (7.10.1 NAT, DHCP and the Small
Office) built into NAT routers.
Broadcast addresses are a special form of IPv4 address intended to be used in conjunction with LAN-layer broadcast. The most
common forms are “broadcast to this network”, consisting of all 1-bits, and “broadcast to network D”, consisting of D’s network-
address bits followed by all 1-bits for the host bits. If you try to send a packet to the broadcast address of a remote network D, the
odds are that some router involved will refuse to forward it, and the odds are even higher that, once the packet arrives at a router
actually on network D, that router will refuse to broadcast it. Even addressing a broadcast to one’s own network will fail if the
underlying LAN does not support LAN-level broadcast (eg ATM).
The highly influential early Unix implementation Berkeley 4.2 BSD used 0-bits for the broadcast bits, instead of 1’s. As a result, to
this day host bits cannot be all 1-bits or all 0-bits in order to avoid confusion with the IPv4 broadcast address. One consequence of
this is that a Class C network has 254 usable host addresses, not 256.

7.3.1 Multicast addresses


Finally, IPv4 multicast addresses remain as the last remnant of the Class A/B/C strategy: multicast addresses are Class D, with
first byte beginning 1110 (meaning that the first byte is, in decimal, 224-239). Multicasting means delivering to a specified set of
addresses, preferably by some mechanism more efficient than sending to each address individually. A reasonable goal of multicast
would be that no more than one copy of the multicast packet traverses any given link.
Support for IPv4 multicast requires considerable participation by the backbone routers involved. For example, if hosts A, B and C
each connect to different interfaces of router R1, and A wishes to send a multicast packet to B and C, then it is up to R1 to receive
the packet, figure out that B and C are the intended recipients, and forward the packet twice, once for B’s interface and once for
C’s. R1 must also keep track of what hosts have joined the multicast group and what hosts have left. Due to this degree of router
participation, backbone router support for multicasting has not been entirely forthcoming. A discussion of IPv4 multicasting
appears in 20 Quality of Service.

This page titled 7.4: Special Addresses is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.4.1 https://eng.libretexts.org/@go/page/11142
7.5: Fragmentation
If you are trying to interconnect two LANs (as IP does), what else might be needed besides Routing and Addressing? IPv4 (and
IPv6) explicitly assumes all packets are composed on 8-bit bytes (something not universally true in the early days of IP; to this day
the RFCs refer to “octets” to emphasize this requirement). IP also defines bit-order within a byte, and it is left to the networking
hardware to translate properly. Neither byte size nor bit order, therefore, can interfere with packet forwarding.
There is one more feature IPv4 must provide, however, if the goal is universal connectivity: it must accommodate networks for
which the maximum packet size, or Maximum Transfer Unit, MTU, is smaller than the packet that needs forwarding. Otherwise,
if we were using IPv4 to join Token Ring (MTU = 4kB, at least originally) to Ethernet (MTU = 1500B), the token-ring packets
might be too large to deliver to the Ethernet side, or to traverse an Ethernet backbone en route to another Token Ring. (Token Ring,
in its day, did commonly offer a configuration option to allow Ethernet interoperability.)
So, IPv4 must support fragmentation, and thus also reassembly. There are two potential strategies here: per-link fragmentation and
reassembly, where the reassembly is done at the opposite end of the link (as in ATM), and path fragmentation and reassembly,
where reassembly is done at the far end of the path. The latter approach is what is taken by IPv4, partly because intermediate
routers are too busy to do reassembly (this is as true today as it was in 1981 when RFC 791 [https://tools.ietf.org/html/rfc791.html]
was published), partly because there is no absolute guarantee that all fragments will go to the same next-hop router, and partly
because IPv4 fragmentation has always been seen as the strategy of last resort.
An IPv4 sender is supposed to use a different value for the IDENT field for different packets, at least up until the field wraps
around. When an IPv4 datagram is fragmented, the fragments keep the same IDENT field, so this field in effect indicates which
fragments belong to the same packet.
After fragmentation, the Fragment Offset field marks the start position of the data portion of this fragment within the data portion
of the original IPv4 packet. Note that the start position can be a number up to 216, the maximum IPv4 packet length, but the
FragOffset field has only 13 bits. This is handled by requiring the data portions of fragments to have sizes a multiple of 8 (three
bits), and left-shifting the FragOffset value by 3 bits before using it.
As an example, consider the following network, where MTUs are excluding the LAN header:

MTU 1500 MTU 1000 MTU 400 MTU 1500


A R1 R2 R3 B

Suppose A addresses a packet of 1500 bytes to B, and sends it via the LAN to the first router R1. The packet contains 20 bytes of
IPv4 header and 1480 of data.
R1 fragments the original packet into two packets of sizes 20+976 = 996 and 20+504=544. Having 980 bytes of payload in the first
fragment would fit, but violates the rule that the sizes of the data portions be divisible by 8. The first fragment packet has
FragOffset = 0; the second has FragOffset = 976.
R2 refragments the first fragment into three packets as follows:
first: size = 20+376=396, FragOffset = 0
second: size = 20+376=396, FragOffset = 376
third: size = 20+224 = 244 (note 376+376+224=976), FragOffset = 752.
R2 refragments the second fragment into two:
first: size = 20+376 = 396, FragOffset = 976+0 = 976
second: size = 20+128 = 148, FragOffset = 976+376=1352
R3 then sends the fragments on to B, without reassembly.
Note that it would have been slightly more efficient to have fragmented into four fragments of sizes 376, 376, 376, and 352 in the
beginning. Note also that the packet format is designed to handle fragments of different sizes easily. The algorithm is based on
multiple fragmentation with reassembly only at the final destination.
Each fragment has its IPv4-header Total Length field set to the length of that fragment.
We have not yet discussed the three flag bits. The first bit is reserved, and must be 0. The second bit is the Don’t Fragment, or DF,
bit. If it is set to 1 by the sender then a router must not fragment the packet and must drop it instead; see 12.13 Path MTU

7.5.1 https://eng.libretexts.org/@go/page/11143
Discovery for an application of this. The third bit is set to 1 for all fragments except the final one (this bit is thus set to 0 if no
fragmentation has occurred). The third bit tells the receiver where the fragments stop.
The receiver must take the arriving fragments and reassemble them into a whole packet. The fragments may not arrive in order –
unlike in ATM networks – and may have unrelated packets interspersed. The reassembler must identify when different arriving
packets are fragments of the same original, and must figure out how to reassemble the fragments in the correct order; both these
problems were essentially trivial for ATM.
Fragments are considered to belong to the same packet if they have the same IDENT field and also the same source and destination
addresses and same protocol.
As all fragment sizes are a multiple of 8 bytes, the receiver can keep track of whether all fragments have been received with a
bitmap in which each bit represents one 8-byte fragment chunk. A 1 kB packet could have up to 128 such chunks; the bitmap
would thus be 16 bytes.
If a fragment arrives that is part of a new (and fragmented) packet, a buffer is allocated. While the receiver cannot know the final
size of the buffer, it can usually make a reasonable guess. Because of the FragOffset field, the fragment can then be stored in the
buffer in the appropriate position. A new bitmap is also allocated, and a reassembly timer is started.
As subsequent fragments arrive, not necessarily in order, they too can be placed in the proper buffer in the proper position, and the
appropriate bits in the bitmap are set to 1.
If the bitmap shows that all fragments have arrived, the packet is sent on up as a completed IPv4 packet. If, on the other hand, the
reassembly timer expires, then all the pieces received so far are discarded.
TCP connections usually engage in Path MTU Discovery, and figure out the largest packet size they can send that will not entail
fragmentation (12.13 Path MTU Discovery). But it is not unusual, for example, for UDP protocols to use fragmentation, especially
over the short haul. In the Network File System (NFS) protocol, for example, UDP is used to carry 8 kB disk blocks. These are
often sent as a single 8+ kB IPv4 packet, fragmented over Ethernet to five full packets and a fraction. Fragmentation works
reasonably well here because most of the time the packets do not leave the Ethernet they started on. Note that this is an example of
fragmentation done by the sender, not by an intermediate router.
Finally, any given IP link may provide its own link-layer fragmentation and reassembly; we saw in 3.5.1 ATM Segmentation and
Reassembly that ATM does just this. Such link-layer mechanisms are, however, generally invisible to the IP layer.

This page titled 7.5: Fragmentation is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.5.2 https://eng.libretexts.org/@go/page/11143
7.6: The Classless IP Delivery Algorithm
Recall from Chapter 1 that any IPv4 address can be divided into a net portion IPnet and a host portion IPhost; the division point was
determined by whether the IPv4 address was a Class A, a Class B, or a Class C. We also indicated in Chapter 1 that the division
point was not always so clear-cut; we now present the delivery algorithm, for both hosts and routers, that does not assume a
globally predeclared division point of the input IPv4 address into net and host portions. We will, for the time being, punt on the
question of forwarding-table lookup and assume there is a lookup() method available that, when given a destination address,
returns the next_hop neighbor.
Instead of class-based divisions, we will assume that each of the IPv4 addresses assigned to a node’s interfaces is configured with
an associated length of the network prefix; following the slash notation of 1.10 IP - Internet Protocol, if B is an address and the
prefix length is k = kB then the prefix itself is B/k. As usual, an ordinary host may have only one IP interface, while a router will
always have multiple interfaces.
Let D be the given IPv4 destination address; we want to decide if D is local or nonlocal. The host or router involved may have
multiple IP interfaces, but for each interface the length of the network portion of the address will be known. For each network
address B/k assigned to one of the host’s interfaces, we compare the first k bits of B and D; that is, we ask if D matches B/k.
If one of these comparisons yields a match, delivery is local; the host delivers the packet to its final destination via the LAN
connected to the corresponding interface. This means looking up the LAN address of the destination, if applicable, and sending
the packet to that destination via the interface.
If there is no match, delivery is nonlocal, and the host passes D to the lookup() routine of the forwarding table and sends
to the associated next_hop (which must represent a physically connected neighbor). It is now up to lookup() routine to
make any necessary determinations as to how D might be split into Dnet and Dhost; the split cannot be made outside of
lookup() .
The forwarding table is, abstractly, a set of network addresses – now also with lengths – each of the form B/k, with an associated
next_hop destination for each. The lookup() routine will, in principle, compare D with each table entry B/k, looking for a
match (that is, equality of the first k = kB bits). As with the local-delivery interfaces check above, the net/host division point (that
is, k) will come from the table entry; it will not be inferred from D or from any other information borne by the packet. There is, in
fact, no place in the IPv4 header to store a net/host division point, and furthermore different routers along the path may use
different values of k with the same destination address D. Routers receive the prefix length /k for a destination B/k as part of the
process by which they receive ⟨destination,next_hop⟩ pairs; see 9 Routing-Update Algorithms.
In 10 Large-Scale IP Routing we will see that in some cases multiple matches in the forwarding table may exist, eg 147.0.0.0/8 and
147.126.0.0/16. The longest-match rule will be introduced for such cases to pick the best match.
Here is a simple example for a router with immediate neighbors A-E:

destination next_hop

10.3.0.0/16 A

10.4.1.0/24 B

10.4.2.0/24 C

10.4.3.0/24 D

10.3.37.0/24 E

The IPv4 addresses 10.3.67.101 and 10.3.59.131 both route to A. The addresses 10.4.1.101, 10.4.2.157 and 10.4.3.233 route to B,
C and D respectively. Finally, 10.3.37.103 matches both A and E, but the E match is longer so the packet is routed that way.
The forwarding table may also contain a default entry for the next_hop, which it may return in cases when the destination D does
not match any known network. We take the view here that returning such a default entry is a valid result of the routing-table
lookup() operation, rather than a third option to the algorithm above; one approach is for the default entry to be the next_hop
corresponding to the destination 0.0.0.0/0, which does indeed match everything (use of this would definitely require the above
longest-match rule, though).

7.6.1 https://eng.libretexts.org/@go/page/11144
Default routes are hugely important in keeping leaf forwarding tables small. Even backbone routers sometimes expend
considerable effort to keep the network address prefixes in their forwarding tables as short as possible, through consolidation.
At a site with a single ISP and with no Internet customers (that is, which is not itself an ISP for others), the top-level forwarding
table usually has a single external route: its default route to its ISP. If a site has more than one ISP, however, the top-level
forwarding table can expand in a hurry. For example, Internet2 [en.Wikipedia.org/wiki/Internet2] is a consortium of research sites
with very-high-bandwidth internal interconnections, acting as a sort of “parallel Internet”. Before Internet2, Loyola’s top-level
forwarding table had the usual single external default route. After Internet2, we in effect had a second ISP and had to divide traffic
between the commercial ISP and the Internet2 ISP. The default route still pointed to the commercial ISP, but the top-level
forwarding table now had to have an entry for every individual Internet2 site, so that traffic to any of these sites would be
forwarded via the Internet2 ISP. See exercise 5.0.
Routers may also be configured to allow passing quality-of-service information to the lookup() method, as mentioned in
Chapter 1, to support different routing paths for different kinds of traffic (eg bulk file-transfer versus real-time).
For a modest exception to the local-delivery rule described here, see below in 7.12 Unnumbered Interfaces.

This page titled 7.6: The Classless IP Delivery Algorithm is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

7.6.2 https://eng.libretexts.org/@go/page/11144
7.7: IPv4 Subnets
Subnets were the first step away from Class A/B/C routing: a large network (eg a class A or B) could be divided into smaller IPv4
networks called subnets. Consider, for example, a typical Class B network such as Loyola University’s (originally 147.126.0.0/16);
the underlying assumption is that any packet can be delivered via the underlying LAN to any internal host. This would require a
rather large LAN, and would require that a single physical LAN be used throughout the site. What if our site has more than one
physical LAN? Or is really too big for one physical LAN? It did not take long for the IP world to run into this problem.
Subnets were first proposed in RFC 917 [https://tools.ietf.org/html/rfc917.html], and became official with RFC 950
[https://tools.ietf.org/html/rfc950.html].
Getting a separate IPv4 network prefix for each subnet is bad for routers: the backbone forwarding tables now must have an entry
for every subnet instead of just for every site. What is needed is a way for a site to appear to the outside world as a single IP
network, but for further IP-layer routing to be supported inside the site. This is what subnets accomplish.
Subnets introduce hierarchical routing: first we route to the primary network, then inside that site we route to the subnet, and
finally the last hop delivers to the host.
Routing with subnets involves in effect moving the IPnet division line rightward. (Later, when we consider CIDR, we will see the
complementary case of moving the division line to the left.) For now, observe that moving the line rightward within a site does not
affect the outside world at all; outside routers are not even aware of site-internal subnetting.
In the following diagram, the outside world directs traffic addressed to 147.126.0.0/16 to the router R. Internally, however, the site
is divided into subnets. The idea is that traffic from 147.126.1.0/24 to 147.126.2.0/24 is routed, not switched; the two LANs
involved may not even be compatible. Most of the subnets shown are of size /24, meaning that the third byte of the IPv4 address
has become part of the network portion of the subnet’s address; one /20 subnet is also shown. RFC 950
[https://tools.ietf.org/html/rfc950.html] would have disallowed the subnet with third byte 0, but having 0 for the subnet bits
generally does work.

Internet

147.126.0.0/24 R2 147.126.2.0/24
147.126.1.0/24
A

147.126.3.0/24
147.126.16.0/20

What we want is for the internal routing to be based on the extended network prefixes shown, while externally continuing to use
only the single routing entry for 147.126.0.0/16.
To implement subnets, we divide the site’s IPv4 network into some combination of physical LANs – the subnets –, and assign each
a subnet address: an IPv4 network address which has the site’s IPv4 network address as prefix. To put this more concretely,
suppose the site’s IPv4 network address is A, and consists of n network bits (so the site address may be written with the slash
notation as A/n); in the diagram above, A/n = 147.126.0.0/16. A subnet address is an IPv4 network address B/k such that:

The address B/k is within the site: the first n bits of B are the same as A/n’s
B/k extends A/n: k≥n
An example B/k in the diagram above is 147.126.1.0/24. (There is a slight simplification here in that subnet addresses do not
absolutely have to be prefixes; see below.)
We now have to figure out how packets will be routed to the correct subnet. For incoming packets we could set up some proprietary
protocol at the entry router to handle this. However, the more complicated situation is all those existing internal hosts that, under

7.7.1 https://eng.libretexts.org/@go/page/11145
the class A/B/C strategy, would still believe they can deliver via the LAN to any site host, when in fact they can now only do that
for hosts on their own subnet. We need a more general solution.
We proceed as follows. For each subnet address B/k, we create a subnet mask for B consisting of k 1-bits followed by enough 0-
bits to make a total of 32. We then make sure that every host and router in the site knows the subnet mask for every one of its
interfaces. Hosts usually find their subnet mask the same way they find their IP address (by static configuration if necessary, but
more likely via DHCP, below).
Hosts and routers now apply the IP delivery algorithm of the previous section, with the proviso that, if a subnet mask for an
interface is present, then the subnet mask is used to determine the number of address bits rather than the Class A/B/C mechanism.
That is, we determine whether a packet addressed to destination D is deliverable locally via an interface with subnet address B/k
and corresponding mask M by comparing D&M with B&M, where & represents bitwise AND; if the two match, the packet is
local. This will generally involve a match of more bits than if we used the Class A/B/C strategy to determine the network portion of
addresses D and B.
As stated previously, given an address D with no other context, we will not be able to determine the network/host division point in
general (eg for outbound packets). However, that division point is not in fact what we need. All that is needed is a way to tell if a
given destination host address D belongs to the current subnet, say B; that is, we need to compare the first k bits of D and B where
k is the (known) length of B.
In the diagram above, the subnet mask for the /24 subnets would be 255.255.255.0; bitwise ANDing any IPv4 address with the
mask is the same as extracting the first 24 bits of the IPv4 address, that is, the subnet portion. The mask for the /20 subnet would be
255.255.240.0 (240 in binary is 1111 0000).
In the diagram above none of the subnets overlaps or conflicts: the subnets 147.126.0.0/24 and 147.126.1.0/24 are disjoint. It takes
a little more effort to realize that 147.126.16.0/20 does not overlap with the others, but note that an IPv4 address matches this
network prefix only if the first four bits of the third byte are 0001, so the third byte itself ranges from decimal 32 to decimal 63 =
binary 0001 1111.
Note also that if host A = 147.126.0.1 wishes to send to destination D = 147.126.1.1, and A is not subnet-aware, then delivery will
fail: A will infer that the interface is a Class B, and therefore compare the first two bytes of A and D, and, finding a match, will
attempt direct LAN delivery. But direct delivery is now likely impossible, as the subnets are not joined by a switch. Only with the
subnet mask will A realize that its network is 147.126.0.0/24 while D’s is 147.126.1.0/24 and that these are not the same. A would
still be able to send packets to its own subnet. In fact A would still be able to send packets to the outside world: it would realize that
the destination in that case does not match 147.126.0.0/16 and will thus forward to its router. Hosts on other subnets would be the
only unreachable ones.
Properly, the subnet address is the entire prefix, eg 147.126.65.0/24. However, it is often convenient to identify the subnet address
with just those bits that represent the extension of the site IPv4-network address; we might thus say casually that the subnet address
here is 65.
The class-based IP-address strategy allowed any host anywhere on the Internet to properly separate any address into its net and host
portions. With subnets, this division point is now allowed to vary; for example, the address 147.126.65.48 divides into 147.126 |
65.48 outside of Loyola, but into 147.126.65 | 48 inside. This means that the net-host division is no longer an absolute property of
addresses, but rather something that depends on where the packet is on its journey.
Technically, we also need the requirement that given any two subnet addresses of different, disjoint subnets, neither is a proper
prefix of the other. This guarantees that if A is an IP address and B is a subnet address with mask M (so B = B&M), then A&M = B
implies A does not match any other subnet. Regardless of the net/host division rules, we cannot possibly allow subnet
147.126.16.0/20 to represent one LAN while 147.126.16.0/24 represents another; the second subnet address block is a subset of the
first. (We can, and sometimes do, allow the first LAN to correspond to everything in 147.126.16.0/20 that is not also in
147.126.16.0/24; this is the longest-match rule.)
The strategy above is actually a slight simplification of what the subnet mechanism actually allows: subnet address bits do not in
fact have to be contiguous, and masks do not have to be a series of 1-bits followed by 0-bits. The mask can be any bit-mask; the
subnet address bits are by definition those where there is a 1 in the mask bits. For example, we could at a Class-B site use the
fourth byte as the subnet address, and the third byte as the host address. The subnet mask would then be 255.255.0.255. While this
generality was once sometimes useful in dealing with “legacy” IPv4 addresses that could not easily be changed, life is simpler
when the subnet bits precede the host bits.

7.7.2 https://eng.libretexts.org/@go/page/11145
7.6.1 Subnet Example
As an example of having different subnet masks on different interfaces, let us consider the division of a class-C network into
subnets of size 70, 40, 25, and 20. The subnet addresses will of necessity have different lengths, as there is not room for four
subnets each able to hold 70 hosts.
A: size 70
B: size 40
C: size 25
D: size 20
Because of the different subnet-address lengths, division of a local IPv4 address LA into net versus host on subnets cannot be done
in isolation, without looking at the host bits. However, that division is not in fact what we need. All that is needed is a way to tell if
the local address LA belongs to a given subnet, say B; that is, we need to compare the first n bits of LA and B, where n is the length
of B’s subnet mask. We do this by comparing LA&M to B&M, where M is the mask corresponding to n. LA&M is not necessarily
the same as LAnet, if LA actually belongs to one of the other subnets. However, if LA&M = B&M, then LA must belong subnet B,
in which case LA&M is in fact LAnet.
We will assume that the site’s IPv4 network address is 200.0.0.0/24. The first three bytes of each subnet address must match
200.0.0. Only some of the bits of the fourth byte will be part of the subnet address, so we will switch to binary for the last byte, and
use both the /n notation (for total number of subnet bits) and also add a vertical bar | to mark the separation between subnet and
host.
Example: 200.0.0.10 | 00 0000 / 26
Note that this means that the 0-bit following the 1-bit in the fourth byte is “significant” in that for a subnet to match, it must match
this 0-bit exactly. The remaining six 0-bits are part of the host portion.
To allocate our four subnet addresses above, we start by figuring out just how many host bits we need in each subnet. Subnet sizes
are always powers of 2, so we round up the subnets to the appropriate size. For subnet A, this means we need 7 host bits to
accommodate 27 = 128 hosts, and so we have a single bit in the fourth byte to devote to the subnet address. Similarly, for B we will
need 6 host bits and will have 2 subnet bits, and for C and D we will need 5 host bits each and will have 8-5=3 subnet bits.
We now start choosing non-overlapping subnet addresses. We have one bit in the fourth byte to choose for A’s subnet; rather
arbitrarily, let us choose this bit to be 1. This means that every other subnet address must have a 0 in the first bit position of the
fourth byte, or we would have ambiguity.
Now for B’s subnet address. We have two bits to work with, and the first bit must be 0. Let us choose the second bit to be 0 as well.
If the fourth byte begins 00, the packet is part of subnet B, and the subnet addresses for C and D must therefore not begin 00.
Finally, we choose subnet addresses for C and D to be 010 and 011, respectively. We thus have

subnet size address bits in fourth byte host bits in 4th byte decimal range

A 128 1 7 128-255

B 64 00 6 0-63

C 32 010 5 64-95

D 32 011 5 96-127

As desired, none of the subnet addresses in the third column is a prefix of any other subnet address.
The end result of all of this is that routing is now hierarchical: we route on the site IP address to get to a site, and then route on the
subnet address within the site.

7.7.3 https://eng.libretexts.org/@go/page/11145
7.6.2 Links between subnets
Suppose the Loyola CS department subnet (147.126.65.0/24) and a department at some other site, we will say 147.100.100.0/24,
install a private link. How does this affect routing?
Each department router would add an entry for the other subnet, routing along the private link. Traffic addressed to the other subnet
would take the private link. All other traffic would go to the default router. Traffic from the remote department to 147.126.64.0/24
would take the long route, and Loyola traffic to 147.100.101.0/24 would take the long route.

Subnet anecdote
A long time ago I was responsible for two hosts, abel and borel. One day I was informed that machines in computer lab 1 at the
other end of campus could not reach borel, though they could reach abel. Machines in lab 2, adjacent to lab 1, however, could
reach both borel and abel just fine. What was the difference?
It turned out that borel had a bad (/16 instead of /24) subnet mask, and so it was attempting local delivery to the labs. This
should have meant it could reach neither of the labs, as both labs were on a different subnet from my machines; I was still
perplexed. After considerably more investigation, it turned out that between abel/borel and the lab building was a bridge-
router: a hybrid device that properly routed subnet traffic at the IP layer, but which also forwarded Ethernet packets directly,
the latter feature apparently for the purpose of backwards compatibility. Lab 2 was connected directly to the bridge-router and
thus appeared to be on the same LAN as borel, despite the apparently different subnet; lab 1 was connected to its own router
R1 which in turn connected to the bridge-router. Lab 1 was thus, at the LAN level, isolated from abel and borel.
Moral 1: Switching and routing are both great ideas, alone. But switching at one layer mixed with routing at another is not.
Moral 2: Test thoroughly! The reason the problem wasn’t noticed earlier was that previously borel communicated only with
other hosts on its own subnet and with hosts outside the university entirely. Both of these worked with the bad subnet mask;
it was different-subnet local hosts that were the problem.

How would nearby subnets at either endpoint decide whether to use the private link? Classical link-state or distance-vector theory
(9 Routing-Update Algorithms) requires that they be able to compare the private-link route with the going-around-the-long-way
route. But this requires a global picture of relative routing costs, which, as we shall see, almost certainly does not exist. The two
departments are in different routing domains; if neighboring subnets at either end want to use the private link, then manual
configuration is likely the only option.

7.6.3 Subnets versus Switching


A frequent network design question is whether to have many small subnets or to instead have just a few (or even only one) larger
subnet. With multiple small subnets, IP routing would be used to interconnect them; the use of larger subnets would replace much
of that routing with LAN-layer communication, likely Ethernet switching. Debates on this route-versus-switch question have gone
back and forth in the networking community, with one aphorism summarizing a common view:

Switch when you can, route when you must


This aphorism reflects the idea that switching is faster, cheaper and easier to configure, and that subnet boundaries should be drawn
only where “necessary”.
Ethernet switching equipment is indeed generally cheaper than routing equipment, for the same overall level of features and
reliability. And traditional switching requires relatively little configuration, while to implement subnets not only must the subnets
be created by hand but one must also set up and configure the routing-update protocols. However, the price difference between
switching and routing is not always significant in the big picture, and the configuration involved is often straightforward.
Somewhere along the way, however, switching has acquired a reputation – often deserved – for being faster than routing. It is true
that routers have more to do than switches: they must decrement TTL, update the header checksum, and attach a new LAN header.
But these things are relatively minor: a larger reason many routers are slower than switches may simply be that they are inevitably
asked to serve as firewalls. This means “deep inspection” of every packet, eg comparing every packet to each of a large number of
firewall rules. The firewall may also be asked to keep track of connection state. All this drives down the forwarding rate, as
measured in packets-per-second.

7.7.4 https://eng.libretexts.org/@go/page/11145
Traditional switching scales remarkably well, but it does have limitations. First, broadcast packets must be forwarded throughout a
switched network; they do not, however, pass to different subnets. Second, LAN networks do not like redundant links (that is,
loops); while one can rely on the spanning-tree algorithm to eliminate these, that algorithm too becomes less efficient at larger
scales.
The rise of software-defined networking (2.8 Software-Defined Networking) has blurred the distinction between routing and
switching. The term “Layer 3 switch” is sometimes used to describe routers that in effect do not support all the usual firewall bells
and whistles. These are often SDN Ethernet switches (2.8 Software-Defined Networking) that are making forwarding decisions
based on the contents of the IP header. Such streamlined switch/routers may also be able to do most of the hard work in specialized
hardware, another source of speedup.
But SDN can do much more than IP-layer forwarding, by taking advantage of site-specific layout information. One application, of
a switch hierarchy for traffic entering a datacenter, appears in 2.8.1 OpenFlow Switches. Other SDN applications include enabling
Ethernet topologies with loops, offloading large-volume flows to alternative paths, and implementing policy-based routing as in 9.6
Routing on Other Attributes. Some SDN solutions involve site-specific programming, but others work more-or-less out of the box.
Locations with switch-versus-route issues are likely to turn increasingly to SDN in the future.

This page titled 7.7: IPv4 Subnets is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.7.5 https://eng.libretexts.org/@go/page/11145
7.8: Network Address Translation
What do you do if your ISP assigns to you a single IPv4 address and you have two computers? The solution is Network Address
Translation, or NAT. NAT’s ability to “multiplex” an arbitrarily large number of individual hosts behind a single IPv4 address (or
small number of addresses) makes it an important tool in the conservation of IPv4 addresses. It also, however, enables an important
form of firewall-based security. It is documented in RFC 3022 [https://tools.ietf.org/html/rfc3022.html], where this is called NAPT,
or Network Address Port Translation.
The basic idea is that, instead of assigning each host at a site a publicly visible IPv4 address, just one such address is assigned to a
special device known as a NAT router. A NAT router sold for residential or small-office use is commonly simply called a “router”,
or (somewhat more precisely) a “residential gateway”. One side of the NAT router connects to the Internet; the other connects to
the site’s internal network. Hosts on the internal network are assigned private IP addresses (7.3 Special Addresses), typically of the
form or 192.168.x.y or 10.x.y.z. Connections to internal hosts that originate in the outside world are banned. When an internal
machine wants to connect to the outside, the NAT router intercepts the connection, and forwards the connection’s packets after
rewriting the source address to make it appear they came from the NAT router’s own IP address, shown below as 200.1.2.37.

10.0.0.10
host A

200.1.2.37 NAT 10.0.0.1 10.0.0.11


Internet host B
router

10.0.0.12
host C

The remote machine responds, sending its responses to the NAT router’s public IPv4 address. The NAT router remembers the
connection, having stored the connection information in a special forwarding table, and forwards the data to the correct internal
host, rewriting the destination-address field of the incoming packets.
The NAT forwarding table also includes port numbers. That way, if two internal hosts attempt to connect to the same external host,
the NAT router can tell which packets belong to which. For example, suppose internal hosts A and B each connect from port 3000
to port 80 on external hosts S and T, respectively. Here is what the NAT forwarding table might look like. No columns for the NAT
router’s own IPv4 addresses are needed; we shall let NR denote the router’s external address.

remote host remote port outside source port inside host inside port

S 80 3000 A 3000

T 80 3000 B 3000

A packet to S from ⟨A,3000⟩ would be rewritten so that the source was ⟨NR,3000⟩. A packet from ⟨S,80⟩ addressed to ⟨NR,3000⟩
would be rewritten and forwarded to ⟨A,3000⟩. Similarly, a packet from ⟨T,80⟩ addressed to ⟨NR,3000⟩ would be rewritten and
forwarded to ⟨B,3000⟩; the NAT table takes into account the source host and port as well as the destination.
Sometimes it is necessary for the NAT router to rewrite the internal-side port number as well; this happens if two internal hosts
want to connect, each from the same port, to the same external host and port. For example, suppose B now opens a connection to
⟨S,80⟩, also from inside port 3000. This time the NAT router must remap the port number, because that is the only way to
distinguish between packets from ⟨S,80⟩ back to A and to B. With B’s second connection’s internal port remapped from 3000 to
3001, the new table is

remote host remote port outside source port inside host inside port

S 80 3000 A 3000

T 80 3000 B 3000

S 80 3001 B 3000

The NAT router does not create TCP connections between itself and the external hosts; it simply forwards packets (with rewriting).
The connection endpoints are still the external hosts S and T and the internal hosts A and B. However, NR might very well monitor

7.8.1 https://eng.libretexts.org/@go/page/11146
the TCP connections to know when they have closed, and so can be removed from the table. For UDP connections, NAT routers
typically remove the forwarding entry after some period of inactivity; see 11 UDP Transport, exercise 14.0.
NAT still works for some traffic without port numbers, such as network pings, though the above table is then not quite the whole
story. See 7.11 Internet Control Message Protocol.
Done properly, NAT improves the security of a site, by making it impossible for an external host to probe or connect to any of the
internal hosts. While this firewall feature is of great importance, essentially the same effect can be achieved without address
translation, and with public IPv4 addresses for all internal hosts, by having the router refuse to forward incoming packets that are
not part of existing connections. The router still needs to maintain a table like the NAT table above, in order to recognize such
packets. The address translation itself, in other words, is not the source of the firewall security. That said, it is hard for a NAT router
to “fail open”; ie to fail in a way that lets outside connections in. It is much easier for a non-NAT firewall to fail open.
For the common residential form of NAT router, see 7.10.1 NAT, DHCP and the Small Office.

7.7.1 NAT Problems


NAT router’s refusal to allow inbound connections is a source of occasional frustration. We illustrate some of these frustrations
here, using Voice-over-IP (VoIP) and the call-setup protocol SIP (RFC 3261 [https://tools.ietf.org/html/rfc3261.html]). The basic
strategy is that each phone is associated with a remote phone server. These phone servers, because they have to be able to accept
incoming connections from anywhere, must not be behind NAT routers. The phones themselves, however, usually will be:

Server1

NAT1 Internet NAT2

phone1 phone2

Server2

For phone1 to call phone2, phone1 first contacts Server1, which then contacts Server2. So far, all is well. The final step is for
Server2 to contact phone2, which, however, cannot be done normally as NAT2 allows no inbound connections.
One common solution is for phone2 to maintain a persistent connection to Server2 (and ditto for phone1 and Server1). By having
these persistent phone-to-server connections, we can arrange for the phone to ring on incoming calls.
As a second issue, somewhat particular to the SIP protocol, is that it is common for server and phone to prefer to use UDP port
5060 at both ends. For a single internal phone, it is likely that port 5060 will pass through without remapping, so the phone will
appear to be connecting from the desired port 5060. However, if there are two phones inside (not shown above), one of them will
appear to be connecting to the server from an alternative port. The solution here is to have the server tolerate such port remapping.
VoIP systems run into a much more serious problem with NAT, however. Once the call between phone1 and phone2 is set up, the
servers would prefer to step out of the loop, and have the phones exchange voice packets directly. The SIP protocol was designed to
handle this by having each phone report to its respective server the UDP socket (⟨IP address,port⟩ pair) it intends to use for the
voice exchange; the servers then report these phone sockets to each other, and from there to the opposite phones. This socket
information is rendered incorrect by NAT, however, certainly the IP address and quite likely the port as well. If only one of the
phones is behind a NAT firewall, it can initiate the voice connection to the other phone, but the other phone will see the voice
packets arriving from a different socket than promised and will likely not recognize them as part of the call. If both phones are
behind NAT firewalls, they will not be able to connect directly to one another at all. The common solution is for the VoIP server of
a phone behind a NAT firewall to remain in the communications path, forwarding packets to its hidden partner. This works, but
represents an unwanted server workload.
If a site wants to make it possible to allow external connections to hosts behind a NAT router or other firewall, one option is
tunneling. This is the creation of a “virtual LAN link” that runs on top of a TCP connection between the end user and one of the
site’s servers; the end user can thus appear to be on one of the organization’s internal LANs; see 3.1 Virtual Private Networks.
Another option is to “open up” a specific port: in essence, a static NAT-table entry is made connecting a specific port on the NAT
router to a specific internal host and port (usually the same port). For example, all UDP packets to port 5060 on the NAT router
might be forwarded to port 5060 on internal host A, even in the absence of any prior packet exchange. Gamers creating peer-to-

7.8.2 https://eng.libretexts.org/@go/page/11146
peer game connections must also usually engage in some port-opening configuration. The Port Control Protocol (RFC 6887
[https://tools.ietf.org/html/rfc6887.html]) is sometimes used for this.
NAT routers work very well when the communications model is of client-side TCP connections, originating from the inside and
with public outside servers as destination. The NAT model works less well for peer-to-peer networking, as with the gamers above,
where two computers, each behind a different NAT router, wish to establish a connection. Most NAT routers provide at least
limited support for “opening” access to a given internal ⟨host,port⟩ socket, by creating a semi-permanent forwarding-table entry.
See also 12.24 Exercises, exercise 2.5.
NAT routers also often have trouble with UDP protocols, due to the tendency for such protocols to have the public server reply
from a different port than the one originally contacted. For example, if host A behind a NAT router attempts to use TFTP (11.2
Trivial File Transport Protocol, TFTP), and sends a packet to port 69 of public server C, then C is likely to reply from some new
port, say 3000, and this reply is likely to be dropped by the NAT router as there will be no entry there yet for traffic from ⟨C,3000⟩.

7.7.2 Middleboxes
Firewalls and NAT routers are sometimes classed as middleboxes: network devices that block, throttle or modify traffic beyond
what is necessary for basic forwarding. Middleboxes play a very important role in network security, but they sometimes (as here
with VoIP) break things. The word “middlebox” (versus “router” or “firewall”) usually has a perjorative connotation; middleboxes
have, in some circles, acquired a rather negative reputation.
NAT routers’ interference with VoIP, above, is a direct consequence of their function: NAT handles connections from inside to
outside quite well, but the NAT mechanism offers no support for connections from one inside to another inside. Sometimes,
however, middleboxes block traffic when there is no technical reason to do so, simply because correct behavior has not been widely
implemented. As an example, the SCTP protocol, 12.22.2 SCTP, has seen very limited use despite some putative advantages over
TCP, largely due to lack of NAT-router support. SCTP cannot be used by residential users because the middleboxes have not kept
up.
A third category of middlebox-related problems is overzealous blocking in the name of security. SCTP runs into this problem as
well, though not quite as universally: a few routers simply drop all SCTP packets because they represent an “unknown” – and
therefore suspect – type of traffic. There is a place for this block-by-default approach. If a datacenter firewall blocks all inbound
TCP traffic except to port 80 (the HTTP port), and if SCTP is not being used within the datacenter intentionally, it is hard to argue
against blocking all inbound SCTP traffic. But if the frontline router for home or office users blocks all outbound SCTP traffic,
then the users cannot use SCTP.
A consequence of overzealous blocking is that it becomes much harder to introduce new protocols. If a new protocol is blocked for
even a small fraction of potential users, it is just not worth the effort. See also the discussion at 12.22.4 QUIC Revisited; the design
of QUIC includes several elements to mitigate middlebox problems.
For another example of overzealous blocking by middleboxes, with the added element of spoofed TCP RST packets, see the
sidebar at 14.8.3 Explicit Congestion Notification (ECN).

This page titled 7.8: Network Address Translation is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

7.8.3 https://eng.libretexts.org/@go/page/11146
7.9: DNS
The Domain Name System, DNS, is an essential companion protocol to IPv4 (and IPv6); an overview can be found in RFC 1034
[https://tools.ietf.org/html/rfc1034.html]. It is DNS that permits users the luxury of not needing to remember numeric IP addresses.
Instead of 162.216.18.28, a user can simply enter intronetworks.cs.luc.edu [http://intronetworks.cs.luc.edu], and DNS will take care
of looking up the name and retrieving the corresponding address. DNS also makes it easy to move services from one server to
another with a different IP address; as users will locate the service by DNS name and not by IP address, they do not need to be
notified.
While DNS supports a wide variety of queries, for the moment we will focus on queries for IPv4 addresses, or so-called A
records. The AAAA record type is used for IPv6 addresses, and, internally, the NS record type is used to identify the “name
servers” that answer DNS queries.
While a workstation can use TCP/IP without DNS, users would have an almost impossible time finding anything, and so the core
startup configuration of an Internet-connected workstation almost always includes the IP address of its DNS server (see 7.10
Dynamic Host Configuration Protocol (DHCP) below for how startup configurations are often assigned).
Most DNS traffic today is over UDP, although a TCP option exists. Due to the much larger response sizes, TCP is often necessary
for DNSSEC (22.12 DNSSEC).
DNS is distributed, meaning that each domain is responsible for maintaining its own DNS servers to translate names to addresses.
(DNS, in fact, is a classic example of a highly distributed database where each node maintains a relatively small amount of data.) It
is hierarchical as well; for the DNS name intronetworks.cs.luc.edu the levels of the hierarchy are
edu: the top-level domain (TLD) for educational institutions in the US
luc: Loyola University Chicago
cs: The Loyola Computer Science Department
intronetworks: a hostname associated to a specific IP address
The hierarchy of DNS names (that is, the set of all names and suffixes of names) forms a tree, but it is not only leaf nodes that
represent individual hosts. In the example above, domain names luc.edu [http://luc.edu] and cs.luc.edu [http://cs.luc.edu] happen to
be valid hostnames as well.
The DNS hierarchy is in a great many cases not very deep, particularly for DNS names assigned to commercial websites. Such
domain names are often simply the company name (or a variant of it) followed by the top-level domain (often .com ). Still,
internally most organizations have many individually named behind-the-scenes servers with three-level (or more) domain names;
sometimes some of these can be identified by viewing the source of the web page and searching it for domain names.
Top-level domains are assigned by ICANN [www.icann.org/]. The original top-level domains were seven three-letter domains –
.com , .net , .org , .int , .edu , .mil and .gov – and the two-letter country-code domains (eg .us, .ca, .mx).
Now there are hundreds of non-country top-level domains, such as .aero , .biz , .info , and, apparently, .wtf
[en.Wikipedia.org/wiki/.wtf]. Domain names (and subdomain names) can also contain unicode characters, so as to support national
alphabets. Some top-level domains are generic, meaning anyone can apply for a subdomain although there may be qualifying
criteria. Other top-level domains are sponsored, meaning the sponsoring organization determines who can be assigned a
subdomain, and so the qualifying criteria can be a little more arbitrary.
ICANN still must approve all new top-level domains. Applications are accepted only during specific intervals; the application fee
for the 2012 interval was US$185,000. The actual leasing of domain names to companies and individuals is done by organizations
known as domain registrars who work under contract with ICANN.
The full tree of all DNS names and prefixes is divided administratively into zones: a zone is an independently managed subtree,
minus any sub-subtrees that have been placed – by delegation – into their own zone. Each zone has its own root DNS name that is a
suffix of every DNS name in the zone. For example, the luc.edu zone contains most of Loyola’s DNS names, but
cs.luc.edu has been spun off into its own zone. A zone cannot be the disjoint union of two subtrees; that is,
cs.luc.edu and math.luc.edu must be two distinct zones, unless both remain part of their parent zone.
A zone can define DNS names more than one level deep. For example, the luc.edu zone can define records for the
luc.edu name itself, for names with one additional level such as www.luc.edu , and for names with two additional levels

7.9.1 https://eng.libretexts.org/@go/page/11147
such as www.cs.luc.edu . That said, it is common for each zone to handle only one additional level, and to create subzones
for deeper levels.
Each zone has its own authoritative nameservers for the zone, which are charged with maintaining the records – known as
resource records, or RRs – for that zone. Each zone must have at least two nameservers, for redundancy. IPv4 addresses are stored
as so-called A records, for Address. Information about how to find sub-zones is stored as NS records, for Name Server. Additional
resource-record types are discussed at 7.8.2 Other DNS Records. An authoritative nameserver need not be part of the organization
that manages the zone, and a single server can be the authoritative nameserver for multiple unrelated zones. For example, many
domain registrars maintain single nameservers that handle DNS queries for all their domain customers who do not wish to set up
their own nameservers.
The root nameservers handle the zone that is the root of the DNS tree; that is, that is represented by the DNS name that is the
empty string. As of 2019, there are thirteen of them. The root nameservers contain only NS records, identifying the nameservers for
all the immediate subzones. Each top-level domain is its own such subzone. The IP addresses of the root nameservers are widely
distributed. Their DNS names (which are only of use if some DNS lookup mechanism is already present) are
a.root.servers.net through m.root-servers.net . These names today correspond not to individual machines but
to clusters of up to hundreds of servers.
We can now put together a first draft of a DNS lookup algorithm. To find the IP address of intronetworks.cs.luc.edu
[http://intronetworks.cs.luc.edu], a host first contacts a root nameserver (at a known address) to find the nameserver for the edu
zone; this involves the retrieval of an NS record. The edu nameserver is then queried to find the nameserver for the
luc.edu zone, which in turn supplies the NS record giving the address of the cs.luc.edu zone. This last has an A record
for the actual host. (This example is carried out in detail below.)

DNS Policing

It is sometimes suggested that if a site is engaged in illegal activity or copyright infringement, such as thepiratebay.se
[thepiratebay.se/], its domain name should be seized. The problem with this strategy is that it is straightforward for users to set
up “nonstandard” nameservers (for example the Gnu Name System, GNS [gnunet.org/gns]) that continue to list the banned
site.

This strategy has a defect in that it would send much too much traffic to the root nameservers. Instead, there exists a great number
of local and semi-local “DNS servers” that we will call resolvers (though, confusingly, these are sometimes also known as
“nameservers” or, more precisely, non-authoritative nameservers). A resolver is a host charged with looking up DNS names on
behalf of a user or set of users, and returning corresponding addresses; for this reason they are sometimes called recursive
nameservers (we return to recursive DNS lookups below).
Most ISPs and companies provide a resolver to handle the DNS needs of their customers and employees; we will refer to these as
site resolvers. The IP addresses of these site resolvers is generally supplied via DHCP options (7.10 Dynamic Host Configuration
Protocol (DHCP)); such resolvers are thus the default choice for DNS services.
Sometimes, however, users elect to use a DNS resolver not provided by their ISP or company; there are a number of public DNS
servers (that is, resolvers) available. Such resolvers generally serve much larger areas. Common choices include OpenDNS
[en.Wikipedia.org/wiki/OpenDNS], Google DNS [en.Wikipedia.org/wiki/Google_Public_DNS] (at 8.8.8.8), Cloudflare
[blog.cloudflare.com/dns-resolver-1-1-1-1/] (at 1.1.1.1) and the Gnu Name System mentioned in the sidebar above, though there
are many others. Searching for “public DNS server” turns up lists of them.
One advantage of using a public DNS server is that your local ISP can no longer track your DNS queries. On the other hand, the
public server now can, so this becomes a matter of which you trust more (or less).
Some public DNS servers provide additional services, such as automatically filtering out domain names associated with security
risks, or content inappropriate for young users. Sometimes there is a fee for this service.
A resolver uses the level-by-level algorithm above as a fallback, but also keeps a large cache of all the domain names (and other
record types) that have been requested. A lifetime for each cache entry is provided by that entry’s authoritative nameserver; these
lifetimes are typically on the order of several days. Every DNS record has a TTL (time-to-live) value representing its maximum
cache lifetime.

7.9.2 https://eng.libretexts.org/@go/page/11147
If I send a query to Loyola’s site resolver for google.com , it is almost certainly in the cache. If I send a query for the
misspelling googel.com [http://googel.com], this may not be in the cache, but the .com top-level nameserver almost certainly is
in the cache. From that nameserver my local resolver finds the nameserver for the googel.com zone, and from that finds the
IP address of the googel.com host.
Applications almost always invoke DNS through library calls, such as Java’s InetAddress.getByName() . The library
forwards the query to the system-designated resolver (though browsers sometimes offer other DNS options; see 22.12.4 DNS over
HTTPS). We will return to DNS library calls in 11.1.3.3 The Client and 12.6.1 The TCP Client.
On unix-based systems, traditionally the IPv4 addresses of the local DNS resolvers were kept in a file /etc/resolv.conf .
Typically this file was updated with the addresses of the current resolvers by DHCP (7.10 Dynamic Host Configuration Protocol
(DHCP)), at the time the system received its IPv4 address. It is possible, though not common, to create special entries in
/etc/resolv.conf so that queries about different domains are sent to different resolvers, or so that single-level hostnames
have a domain name appended to them before lookup. On Windows, similar functionality can be achieved through settings on the
DNS tab within the Network Connections applet.
Recent systems often run a small “stub” resolver locally (eg Linux’s dnsmasq [en.Wikipedia.org/wiki/Dnsmasq]); such resolvers
are sometimes also called DNS forwarders. The entry in /etc/resolv.conf is then an IPv4 address of localhost
(sometimes 127.0.1.1 rather than 127.0.0.1). Such a stub resolver would, of course, still need access to the addresses of site or
public resolvers; sometimes these addresses are provided by static configuration and sometimes by DHCP (7.10 Dynamic Host
Configuration Protocol (DHCP)).
If a system running a stub resolver then runs internal virtual machines, it is usually possible to configure everything so that the
virtual machines can be given an IP address of the host system as their DNS resolver. For example, often virtual machines are
assigned IPv4 addresses on a private subnet and connect to the outside world using NAT (7.7 Network Address Translation). In
such a setting, the virtual machines are given the IPv4 address of the host system interface that connects to the private subnet. It is
then necessary to ensure that, on the host system, the local resolver accepts queries sent not only to the designated loopback address
but also to the host system’s private-subnet address. (Generally, local resolvers do not accept requests arriving from externally
visible addresses.)
When someone submits a query for a nonexistent DNS name, the resolver is supposed to return an error message, technically
known as NXDOMAIN (Non eXistent Domain). Some resolvers, however, have been configured to return the IP address of a
designated web server; this is particularly common for ISP-provided site resolvers. Sometimes the associated web page is meant to
be helpful, and sometimes it presents an offer to buy the domain name from a registrar. Either way, additional advertising may be
displayed. Of course, this is completely useless to users who are trying to contact the domain name in question via a protocol (ssh,
smtp) other than http.
At the DNS protocol layer, a DNS lookup query can be either recursive or non-recursive. If A sends to B a recursive query to
resolve a given DNS name, then B takes over the job until it is finally able to return an answer to A. If The query is non-recursive,
on the other hand, then if B is not an authoritative nameserver for the DNS name in question it returns either a failure notice or an
NS record for the sub-zone that is the next step on the path. Almost all DNS requests from hosts to their site or public resolvers are
recursive.
A basic DNS response consists of an Answer section, an AUTHORITY section and, optionally, an ADDITIONAL section.
Generally a response to a lookup of a hostname contains an Answer section consisting of a single A record, representing a single
IPv4 address. If a site has multiple servers that are entirely equivalent, however, it is possible to give them all the same hostname
by configuring the authoritative nameserver to return, for the hostname in question, multiple A records listing, in turn, each of the
server IPv4 addresses. This is sometimes known as round-robin DNS. It is a simple form of load balancing; see also 18.9.5
loadbalance31.py. Consecutive queries to the nameserver should return the list of A records in different orders; ideally the same
should also happen with consecutive queries to a local resolver that has the hostname in its cache. It is also common for a single
server, with a single IPv4 address, to be identified by multiple DNS names; see the next section.
The response AUTHORITY section contains the DNS names of the authoritative nameservers responsible for the original DNS
name in question. The ADDITIONAL section contains information the sender thinks is related; for example, this section often
contains A records for the authoritative nameservers.
The Tor Project [en.Wikipedia.org/wiki/Tor_(a...mity_network)] uses DNS-like names that end in “.onion”. While these are not
true DNS names in that they are not managed by the DNS hierarchy, they do work as such for Tor users; see RFC 7686

7.9.3 https://eng.libretexts.org/@go/page/11147
[https://tools.ietf.org/html/rfc7686.html]. These names follow an unusual pattern: the next level of name is an 80-bit hash of the
site’s RSA public key (22.9.1 RSA), converted to sixteen ASCII bytes. For example, 3g2upl4pq6kufc4m.onion is apparently the
Tor address for the search engine duckduckgo.com [https://duckduckgo.com/]. Unlike DuckDuckGo, many sites try different RSA
keys until they find one where at least some initial prefix of the hash looks more or less meaningful; for example,
nytimes2tsqtnxek.onion . Facebook got very lucky [lists.torproject.org/piperma...r/035413.html] in finding an RSA key
whose corresponding Tor address is facebookcorewwwi.onion (though it is sometimes said that fortune is infatuated with
the wealthy). This naming strategy is a form of cryptographically generated addresses; for another example see 8.6.4 Security
and Neighbor Discovery. The advantage of this naming strategy is that you don’t need a certificate authority (22.10.2.1 Certificate
Authorities) to verify a site’s RSA key; the site name does it for you.

7.8.1 nslookup (and dig)


Let us trace a non-recursive lookup of intronetworks.cs.luc.edu , using the nslookup
[en.Wikipedia.org/wiki/Nslookup] tool. The nslookup tool is time-honored, but also not completely up-to-date, so we also include
examples using the dig [en.Wikipedia.org/wiki/Dig_(command)] utility (supposedly an acronym for “domain Internet groper”).
Lines we type in nslookup ’s interactive mode begin below with the prompt “>”; the shell prompt is “#”. All dig
commands are typed directly at the shell prompt.
The first step is to look up the IP address of the root nameserver a.name-servers.net . We can do this with a regular call to
nslookup or dig , we can look this up in our nameserver’s configuration files, or we can search for it on the Internet. The
address is 198.41.0.4.
We now send our nonrecursive query to this address. The presence of the single hyphen in the nslookup command line below
means that we want to use 198.41.0.4 as the nameserver rather than as the thing to be looked up; dig has places on the
command line for both the nameserver (following the @ ) and the DNS name. For both commands, we use the norecurse
option to send a nonrecursive query.

# nslookup -norecurse - 198.41.0.4


> intronetworks.cs.luc.edu
*** Can't find intronetworks.cs.luc.edu: No answer

# dig @198.41.0.4 intronetworks.cs.luc.edu +norecurse

These fail because by default nslookup and dig ask for an A record. What we want is an NS record: the name of the next
zone down to ask. (We can tell the dig query failed to find an A record because there are zero records in the Answer section)

> set query=ns


> intronetworks.cs.luc.edu
edu nameserver = a.edu-servers.net
...
a.edu-servers.net internet address = 192.5.6.30

# dig @198.41.0.4 intronetworks.cs.luc.edu NS +norecurse


;; AUTHORITY SECTION:
edu. 172800 IN NS b.edu-servers.net.
;; ADDITIONAL SECTION:
b.edu-servers.net. 172800 IN A 192.33.14.30

The full responses in each case are a list of all nameservers for the .edu zone; we list only the first response above. Note that the
full DNS name intronetworks.cs.luc.edu in the query here is not an exact match for the DNS name .edu in the

7.9.4 https://eng.libretexts.org/@go/page/11147
resource record returned; the latter is a suffix of the former. Some newer resolvers send just the .edu part, to limit the user’s
privacy exposure.
We send the next NS query to a.edu-servers.net (which does appear in the full dig answer)

# nslookup -query=ns -norecurse - 192.5.6.30


> intronetworks.cs.luc.edu
...
Authoritative answers can be found from:
luc.edu nameserver = bcdnswt1.it.luc.edu.
bcdnswt1.it.luc.edu internet address = 147.126.64.64

# dig @192.5.6.30 intronetworks.cs.luc.edu NS +norecurse


;; AUTHORITY SECTION:
luc.edu. 172800 IN NS bcdnswt1.it.luc.edu.
;; ADDITIONAL SECTION:
bcdnswt1.it.luc.edu. 172800 IN A 147.126.64.64

(Again, we show only one of several luc.edu nameservers returned). We continue.

# nslookup -query=ns - -norecurse 147.126.64.64


> intronetworks.cs.luc.edu
...
Authoritative answers can be found from:
cs.luc.edu nameserver = dns1.cs.luc.edu.
ns1.cs.luc.edu internet address = 147.126.2.44

# dig @147.126.64.64 intronetworks.cs.luc.edu NS +norecurse


;; AUTHORITY SECTION:
cs.luc.edu. 86400 IN NS ns1.cs.luc.edu.
;; ADDITIONAL SECTION:
ns1.cs.luc.edu. 86400 IN A 147.126.2.44

We now ask this last nameserver, for the cs.luc.edu zone, for the A record:

# nslookup -query=A -norecurse - 147.126.2.44


> intronetworks.cs.luc.edu
...
intronetworks.cs.luc.edu canonical name = linode1.cs.luc.edu.
Name: linode1.cs.luc.edu
Address: 162.216.18.28

# dig @147.126.2.44 intronetworks.cs.luc.edu A +norecurse


;; Answer SECTION:
intronetworks.cs.luc.edu. 300 IN A 162.216.18.28

This is the first time we get an Answer section (versus the AUTHORITY section)

7.9.5 https://eng.libretexts.org/@go/page/11147
Here we get a canonical name, or CNAME, record. The server that hosts intronetworks.cs.luc.edu also hosts several
other websites, with different names; for example, introcs.cs.luc.edu [introcs.cs.luc.edu/] (at least as of 2015). This is known as
virtual hosting. Rather than provide separate A records for each website name, DNS was set up to provide a CNAME record for
each website name pointing to a single physical server name linode1.cs.luc.edu . Only one A record is then needed, for
this server.
The nslookup request for an A record returned instead the CNAME record, together with the A record for that CNAME (this
is the 162.216.18.28 above). This is done for convenience.
Note that the IPv4 address here, 162.216.18.28, is unrelated to Loyola’s own IPv4 address block 147.126.0.0/16. The server
linode1.cs.luc.edu is managed by an external provider; there is no connection between the DNS name hierarchy and the
IP address hierarchy.
Finally, if we look up both www.cs.luc.edu and cs.luc.edu , we see they resolve to the same address. The use of
www as a hostname for a domain’s webserver is often considered unnecessary and old-fashioned; many users prefer the shorter,
“naked” domain name, eg cs.luc.edu .
It might be tempting to create a CNAME record for the naked domain, cs.luc.edu , pointing to the full hostname
www.cs.luc.edu . However, RFC 1034 [https://tools.ietf.org/html/rfc1034.html] does not allow this:

If a CNAME RR is present at a node, no other data should be present; this


ensures that the data for a canonical name and its aliases cannot be different.
There are, however, several other DNS data records for cs.luc.edu : an NS record (above), a SOA, or Start of Authority,
record containing various administrative data such as the expiration time, and an MX record, discussed in the following section. All
this makes www.cs.luc.edu and cs.luc.edu ineluctably quite different. RFC 1034
[https://tools.ietf.org/html/rfc1034.html] adds, “this rule also insures that a cached CNAME can be used without checking with an
authoritative server for other RR types.”
A better way to create a naked-domain record, at least from the perspective of DNS, is to give it its own A record. This does mean
that, if the webserver address changes, there are now two DNS records that need to be updated, but this is manageable.
Recently ANAME records have been proposed to handle this issue; an ANAME is like a limited CNAME not subject to the RFC
1034 [https://tools.ietf.org/html/rfc1034.html] restriction above. An ANAME record for a naked domain, pointing to another
hostname rather than to address, is legal. See the Internet draft draft-hunt-dnsop-aname [https://tools.ietf.org/html/draft-hunt-dnsop-
aname]. Some large CDNs (1.12.2 Content-Distribution Networks) already implement similar DNS tweaks internally. This does
not require end-user awareness; the user requests an A record and the ANAME is resolved at the CDN side.
Finally, there is also an argument, at least when HTTP (web) traffic is involved, that the www not be deprecated, and that the naked
domain should instead be redirected, at the HTTP layer, to the full hostname. This simplifies some issues; for example, you now
have only one website, rather than two. You no longer have to be concerned with the fact that HTTP cookies with and without the
“www” are different. And some CDNs may not be able to handle website failover to another server if the naked domain is reached
via an A record. But none of these are DNS issues.

7.8.2 Other DNS Records


Besides address lookups, DNS also supports a few other kinds of searches. The best known is probably reverse DNS, which takes
an IP address and returns a name. This is slightly complicated by the fact that one IP address may be associated with multiple DNS
names. What DNS does in this case is to return the canonical name, or CNAME; a given address can have only one CNAME.
Given an IPv4 address, say 147.126.1.230, the idea is to reverse it and append to it the suffix in-addr.arpa .

230.1.126.147.in-addr.arpa

There is a DNS name hierarchy for names of this form, with zones and authoritative servers. If all this has been configured – which
it often is not, especially for user workstations – a request for the PTR record corresponding to the above should return a DNS
hostname. In the case above, the name luc.edu [http://luc.edu] is returned (at least as of 2018).

7.9.6 https://eng.libretexts.org/@go/page/11147
PTR records are the only DNS records to have an entirely separate hierarchy; other DNS types fit into the “standard” hierarchy.
For example, DNS also supports MX, or Mail eXchange, records, meant to map a domain name (which might not correspond to
any hostname, and, if it does, is more likely to correspond to the name of a web server) to the hostname of a server that accepts
email on behalf of the domain. In effect this allows an organization’s domain name, eg luc.edu , to represent both a web server
and, at a different IP address, an email server. MX records can even represent a set of IP addresses that accept email.
DNS has from the beginning supported TXT records, for arbitrary text strings. The email Sender Policy Framework (RFC 7208
[https://tools.ietf.org/html/rfc7208.html]) was developed to make it harder for email senders to pretend to be a domain they are not;
this involves inserting so-called SPF records as DNS TXT records (or as substrings of TXT records, if TXT is also being used for
something else.
For example, a DNS query for TXT records of google.com (not gmail.com!) might yield (2018)

google.com text = "docusign=05958488-4752-4ef2-95eb-aa7ba8a3bd0e"


google.com text = "v=spf1 include:_spf.google.com ~all"

The SPF system is interested in only the second record; the “v=spf1” specifies the SPF version. This second record tells us to look
up _spf.google.com . That lookup returns

text = "v=spf1 include:_netblocks.google.com include:_netblocks2.google.com include:_n

Lookup of _netblocks.google.com then returns

text = "v=spf1 ip4:64.233.160.0/19 ip4:66.102.0.0/20 ip4:66.249.80.0/20 ip4:72.14.192

If a host connects to an email server, and declares that it is delivering mail from someone at google.com, then the host’s email list
should occur in the list above, or in one of the other included lists. If it does not, there is a good chance the email represents spam.
Each DNS record (or “resource record”) has a name (eg cs.luc.edu ) and a type (eg A or AAAA or NS or MX ). Given
a name and type, the set of matching resource records is known as the RRset for that name and type (technically there is also a
“class”, but the class of all the DNS records we are interested in is IN , for Internet). When a nameserver responds to a DNS
query, what is returned (in the Answer section) is always an entire RRset: the RRset of all resource records matching the name and
type contained in the original query.
In many cases, RRsets have a single member, because many hosts have a single IPv4 address. However, this is not universal. We
saw above the example of a single DNS name having multiple A records when round-robin DNS is used. A single DNS name
might also have separate A records for the host’s public and private IPv4 addresses. And perhaps most MX-record (Mail eXchange)
RRsets have multiple entries, as organizations often prefer, for redundancy, to have more than one server that can receive email.

7.8.3 DNS Cache Poisoning


The classic DNS security failure, known as cache poisoning, occurs when an attacker has been able to convince a DNS resolver
that the address of, say, www.example.com is something other than what it really is. A successful attack means the attacker can
direct traffic meant for www.example.com to the attacker’s own, malicious site.
The most basic cache-poisoning strategy is to send a stream of DNS reply packets to the resolver which declare that the IP address
of www.example.com is the attacker’s chosen IP address. The source IP address of these packets should be spoofed to be that of the
example.com authoritative nameserver; such spoofing is relatively easy using UDP. Most of these reply packets will be ignored, but
the hope is that one will arrive shortly after the resolver has sent a DNS request to the example.com authoritative nameserver,
and interprets the spoofed reply packet as a legitimate reply.
To prevent this, DNS requests contain a 16-bit ID field; the DNS response must echo this back. The response must also come from
the correct port. This leaves the attacker to guess 32 bits in all, but often the ID field (and even more often the port) can be guessed
based on past history.

7.9.7 https://eng.libretexts.org/@go/page/11147
Another approach requires the attacker to wait for the target resolver to issue a legitimate request to the attacker’s site,
attacker.com . The attacker then piggybacks in the ADDITIONAL section of the reply message an A record for
example.com pointing to the attacker’s chosen bad IP address for this site. The hope is that the receiving resolver will place
these A records from the ADDITIONAL section into its cache without verifying them further and without noticing they are
completely unrelated. Once upon a time, such DNS resolver behavior was common.
Most newer DNS resolvers carefully validate the replies: the ID field must match, the source port must match, and any received
DNS records in the ADDITIONAL section must match, at a minimum, the DNS zone of the request. Additionally, the request ID
field and source port should be chosen pseudorandomly in a secure fashion. For additional vulnerabilities, see RFC 3833
[https://tools.ietf.org/html/rfc3833.html].
The central risk in cache poisoning is that a resolver can be tricked into supplying users with invalid DNS records. A closely
related risk is that an attacker can achieve the same result by spoofing an authoritative nameserver. Both of these risks can be
mitigated through the use of the DNS security extensions, known as DNSSEC. Because DNSSEC makes use of public-key
signatures, we defer coverage to 22.12 DNSSEC.

7.8.4 DNS and CDNs


DNS is often pressed into service by CDNs (1.12.2 Content-Distribution Networks) to identify their closest “edge” server to a
given user. Typically this involves the use of geoDNS, a slightly nonstandard variation of DNS. When a DNS query comes in to
one of the CDN’s authoritative nameservers, that server
1. looks up the approximate location of the client (10.4.4 IP Geolocation)
2. determines the closest edge server to that location
3. replies with the IP address of that closest edge server
This works reasonably well most of the time. However, the requesting client is essentially never the end user; rather, it is the DNS
resolver being used by the user. Typically such resolvers are the site resolvers provided by the user’s ISP or organization, and are
physically quite close to the user; in this case, the edge server identified above will be close to the user as well. However, when a
user has chosen a (likely remote) public DNS resolver, as above, the IP address returned for the CDN edge server will be close to
the DNS resolver but likely far from optimal for the end user.
One solution to this last problem is addressed by RFC 7871 [https://tools.ietf.org/html/rfc7871.html], which allows DNS resolvers
to include the IP address of the client in the request sent to the authoritative nameserver. For privacy reasons, usually only a prefix
of the user’s IP address is included, perhaps /24. Even so, user’s privacy is at least partly compromised. For this reason, RFC 7871
[https://tools.ietf.org/html/rfc7871.html] recommends that the feature be disabled by default, and only enabled after careful analysis
of the tradeoffs.
A user who is concerned about the privacy issue can – in theory – configure their own DNS software to include this RFC 7871
[https://tools.ietf.org/html/rfc7871.html] option with a zero-length prefix of the user’s IP address, which conveys no address
information. The user’s resolver will then not change this to a longer prefix.
Use of this option also means that the DNS resolver receiving a user query about a given hostname can no longer simply return a
cached answer from a previous lookup of the hostname. Instead, the resolver needs to cache separately each ⟨hostname,prefix⟩ pair
it handles, where the prefix is the prefix of the user’s IP address forwarded to the authoritative nameserver. This has the potential to
increase the cache size by several orders of magnitude, which may thereby enable some cache-overflow attacks.

This page titled 7.9: DNS is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.9.8 https://eng.libretexts.org/@go/page/11147
7.10: Address Resolution Protocol - ARP
If a host or router A finds that the destination IP address D = DIP matches the network address of one of its interfaces, it is to
deliver the packet via the LAN (probably Ethernet). This means looking up the LAN address (MAC address) DLAN corresponding
to DIP. How does it do this?
One approach would be via a special server, but the spirit of early IPv4 development was to avoid such servers, for both cost and
reliability issues. Instead, the Address Resolution Protocol (ARP) is used. This is our first protocol that takes advantage of the
existence of LAN-level broadcast; on LANs without physical broadcast (such as ATM), some other mechanism (usually involving
a server) must be used.
The basic idea of ARP is that the host A sends out a broadcast ARP query or “who-has DIP?” request, which includes A’s own IPv4
and LAN addresses. All hosts on the LAN receive this message. The host for whom the message is intended, D, will recognize that
it should reply, and will return an ARP reply or “is-at” message containing DLAN. Because the original request contained ALAN, D’s
response can be sent directly to A, that is, unicast.

A B C D

A broadcasts “who-has D”
D replies to A, unicast, and includes its LAN address

Additionally, all hosts maintain an ARP cache, consisting of ⟨IPv4,LAN⟩ address pairs for other hosts on the network. After the
exchange above, A has ⟨DIP,DLAN⟩ in its table; anticipating that A will soon send it a packet to which it needs to respond, D also
puts ⟨AIP,ALAN⟩ into its cache.
ARP-cache entries eventually expire. The timeout interval used to be on the order of 10 minutes, but Linux systems now use a
much smaller timeout (~30 seconds observed in 2012). Somewhere along the line, and probably related to this shortened timeout
interval, repeat ARP queries about a timed-out entry are first sent unicast, not broadcast, to the previous Ethernet address on
record. This cuts down on the total amount of broadcast traffic; LAN broadcasts are, of course, still needed for new hosts. The ARP
cache on a Linux system can be examined with the command ip -s neigh ; the corresponding windows command is
arp -a .
The above protocol is sufficient, but there is one further point. When A sends its broadcast “who-has D?” ARP query, all other
hosts C check their own cache for an entry for A. If there is such an entry (that is, if AIP is found there), then the value for ALAN is
updated with the value taken from the ARP message; if there is no pre-existing entry then no action is taken. This update process
serves to avoid stale ARP-cache entries, which can arise is if a host has had its Ethernet interface replaced. (USB Ethernet
interfaces, in particular, can be replaced very quickly.)
ARP is quite an efficient mechanism for bridging the gap between IPv4 and LAN addresses. Nodes generally find out neighboring
IPv4 addresses through higher-level protocols, and ARP then quickly fills in the missing LAN address. However, in some
Software-Defined Networking (2.8 Software-Defined Networking) environments, the LAN switches and/or the LAN controller
may have knowledge about IPv4/LAN address correspondences, potentially making ARP superfluous. The LAN (Ethernet)
switching network might in principle even know exactly how to route via the LAN to a given IPv4 address, potentially even making
LAN addresses unnecessary. At such a point, ARP may become an inconvenience. For an example of a situation in which it is
necessary to work around ARP, see 18.9.5 loadbalance31.py.

7.9.1 ARP Finer Points


Most hosts today implement self-ARP, or gratuitous ARP, on startup (or wakeup): when station A starts up it sends out an ARP
query for itself: “who-has A?”. Two things are gained from this: first, all stations that had A in their cache are now updated with A’s
most current ALAN address, in case there was a change, and second, if an answer is received, then presumably some other host on
the network has the same IPv4 address as A.
Self-ARP is thus the traditional IPv4 mechanism for duplicate address detection. Unfortunately, it does not always work as well
as might be hoped; often only a single self-ARP query is sent, and if a reply is received then frequently the only response is to log

7.10.1 https://eng.libretexts.org/@go/page/11148
an error message; the host may even continue using the duplicate address! If the duplicate address was received via DHCP, below,
then the host is supposed to notify its DHCP server of the error and request a different IPv4 address.
RFC 5227 [https://tools.ietf.org/html/rfc5227.html] has defined an improved mechanism known as Address Conflict Detection, or
ACD. A host using ACD sends out three ARP queries for its new IPv4 address, spaced over a few seconds and leaving the ARP
field for the sender’s IPv4 address filled with zeroes. This last step means that any other host with that IPv4 address in its cache
will ignore the packet, rather than update its cache. If the original host receives no replies, it then sends out two more ARP queries
for its new address, this time with the ARP field for the sender’s IPv4 address filled in with the new address; this is the stage at
which other hosts on the network will make any necessary cache updates. Finally, ACD requires that hosts that do detect a
duplicate address must discontinue using it.
It is also possible for other stations to answer an ARP query on behalf of the actual destination D; this is called proxy ARP. An
early common scenario for this was when host C on a LAN had a modem connected to a serial port. In theory a host D dialing in to
this modem should be on a different subnet, but that requires allocation of a new subnet. Instead, many sites chose a simpler
arrangement. A host that dialed in to C’s serial port might be assigned IP address DIP, from the same subnet as C. C would be
configured to route packets to D; that is, packets arriving from the serial line would be forwarded to the LAN interface, and packets
sent to CLAN addressed to DIP would be forwarded to D. But we also have to handle ARP, and as D is not actually on the LAN it
will not receive broadcast ARP queries. Instead, C would be configured to answer on behalf of D, replying with ⟨DIP,CLAN⟩. This
generally worked quite well.
Proxy ARP is also used in Mobile IP, for the so-called “home agent” to intercept traffic addressed to the “home address” of a
mobile device and then forward it (eg via tunneling) to that device. See 7.13 Mobile IP.
One delicate aspect of the ARP protocol is that stations are required to respond to a broadcast query. In the absence of proxies this
theoretically should not create problems: there should be only one respondent. However, there were anecdotes from the Elder Days
of networking when a broadcast ARP query would trigger an avalanche of responses. The protocol-design moral here is that
determining who is to respond to a broadcast message should be done with great care. (RFC 1122
[https://tools.ietf.org/html/rfc1122.html] section 3.2.2 addresses this same point in the context of responding to broadcast ICMP
messages.)
ARP-query implementations also need to include a timeout and some queues, so that queries can be resent if lost and so that a burst
of packets does not lead to a burst of queries. A naive ARP algorithm without these might be:

To send a packet to destination DIP, see if DIP is in the ARP cache. If it is, address the
packet to DLAN; if not, send an ARP query for D
To see the problem with this approach, imagine that a 32 kB packet arrives at the IP layer, to be sent over Ethernet. It will be
fragmented into 22 fragments (assuming an Ethernet MTU of 1500 bytes), all sent at once. The naive algorithm above will likely
send an ARP query for each of these. What we need instead is something like the following:

To send a packet to destination DIP:


If DIP is in the ARP cache, send to DLAN and return
If not, see if an ARP query for DIP is pending.
If it is, put the current packet in a queue for D.
If there is no pending ARP query for DIP, start one,
again putting the current packet in the (new) queue for D
We also need:

If an ARP query for some CIP times out, resend it (up to a point)
If an ARP query for CIP is answered, send off any packets in C’s queue

7.10.2 https://eng.libretexts.org/@go/page/11148
7.9.2 ARP Security
Suppose A wants to log in to secure server S, using a password. How can B (for Bad) impersonate S?
Here is an ARP-based strategy, sometimes known as ARP Spoofing. First, B makes sure the real S is down, either by waiting until
scheduled downtime or by launching a denial-of-service attack against S.
When A tries to connect, it will begin with an ARP “who-has S?”. All B has to do is answer, “S is-at B”. There is a trivial way to
do this: B simply needs to set its own IP address to that of S.
A will connect, and may be convinced to give its password to B. B now simply responds with something plausible like “backup in
progress; try later”, and meanwhile use A’s credentials against the real S.
This works even if the communications channel A uses is encrypted! If A is using the SSH protocol (22.10.1 SSH), then A will get
a message that the other side’s key has changed (B will present its own SSH key, not S’s). Unfortunately, many users (and even
some IT departments) do not recognize this as a serious problem. Some organizations – especially schools and universities – use
personal workstations with “frozen” configuration, so that the filesystem is reset to its original state on every reboot. Such systems
may be resistant to viruses, but in these environments the user at A will always get a message to the effect that S’s credentials are
not known.

7.9.3 ARP Failover


Suppose you have two front-line servers, A and B (B for Backup), and you want B to be able to step in if A freezes. There are a
number of ways of achieving this, but one of the simplest is known as ARP Failover. First, we set AIP = BIP, but for the time being
B does not use the network so this duplication is not a problem. Then, once B gets the message that A is down, it sends out an ARP
query for AIP, including BLAN as the source LAN address. The gateway router, which previously would have had ⟨AIP,ALAN⟩ in its
ARP cache, updates this to ⟨AIP,BLAN⟩, and packets that had formerly been sent to A will now go to B. As long as B is trafficking
in stateless operations (eg html), B can pick up right where A left off.

7.9.4 Detecting Sniffers


Finally, there is an interesting use of ARP to detect Ethernet password sniffers (generally not quite the issue it once was, due to
encryption and switching). To find out if a particular host A is in promiscuous mode, send an ARP “who-has A?” query. Address it
not to the broadcast Ethernet address, though, but to some nonexistent Ethernet address.
If promiscuous mode is off, A’s network interface will ignore the packet. But if promiscuous mode is on, A’s network interface will
pass the ARP request to A itself, which is likely then to answer it.
Alas, Linux kernels reject at the ARP-software level ARP queries to physical Ethernet addresses other than their own. However,
they do respond to faked Ethernet multicast addresses, such as ff:ff:ff:00:00:00 or ff:ff:ff:ff:ff:fe.

7.9.5 ARP and multihomed hosts


If host A has two interfaces iface1 and iface2 on the same LAN, with respective IP addresses A1 and A2, then it is
common for the two to be used interchangeably. Traffic addressed to A1 may be received via iface2 and vice-versa, and traffic
from A1 may be sent via iface2 . In 7.2.1 Multihomed hosts this is described as the weak end-system model; the idea is that
we should think of the IP addresses A1 and A2 as bound to A rather than to their respective interfaces.
In support of this model, ARP can usually be configured (in fact this is often the default) so that ARP requests for either IP address
and received by either interface may be answered with either physical address. Usually all requests are answered with the physical
address of the preferred (ie faster) interface.
As an example, suppose A has an Ethernet interface eth0 with IP address 10.0.0.2 and a faster Wi-Fi interface wlan0 with
IP address 10.0.0.3 (although Wi-Fi interfaces are not always faster). In this setting, an ARP request “who-has 10.0.0.2” would be
answered with wlan0 ’s physical address, and so all traffic to A, to either IP address, would arrive via wlan0 . The eth0
interface would go essentially unused. Similarly, though not due to ARP, traffic sent by A with source address 10.0.0.2 might
depart via wlan0 .

7.10.3 https://eng.libretexts.org/@go/page/11148
This situation is on Linux systems adjusted by changing arp_ignore and arp_announce in
/proc/sys/net/ipv4/conf/all .

This page titled 7.10: Address Resolution Protocol - ARP is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

7.10.4 https://eng.libretexts.org/@go/page/11148
7.11: Dynamic Host Configuration Protocol (DHCP)
DHCP is the most common mechanism by which hosts are assigned their IPv4 addresses. DHCP started as a protocol known as
Reverse ARP (RARP), which evolved into BOOTP and then into its present form. It is documented in RFC 2131
[https://tools.ietf.org/html/rfc2131.html]. Recall that ARP is based on the idea of someone broadcasting an ARP query for a host,
containing the host’s IPv4 address, and the host answering it with its LAN address. DHCP involves a host, at startup, broadcasting
a query containing its own LAN address, and having a server reply telling the host what IPv4 address is assigned to it, hence the
“Reverse ARP” name.
The DHCP response message is also likely to carry, piggybacked onto it, several other essential startup options. Unlike the IPv4
address, these additional network parameters usually do not depend on the specific host that has sent the DHCP query; they are
likely constant for the subnet or even the site. In all, a typical DHCP message includes the following:
IPv4 address
subnet mask
default router
DNS Server
These four items are a standard minimal network configuration; in practical terms, hosts cannot function properly without them.
Most DHCP implementations support the piggybacking of the latter three above, and a wide variety of other configuration values,
onto the server responses.

Default Routers and DHCP


If you lose your default router, you cannot communicate. Here is something that used to happen to me, courtesy of DHCP:
1. I am connected to the Internet via Ethernet, and my default router is via my Ethernet interface
2. I connect to my institution’s wireless network.
3. Their DHCP server sends me a new default router on the wireless network. However, this default router will only allow
access to a tiny private network, because I have neglected to complete the “Wi-Fi network registration” process.
4. I therefore disconnect from the wireless network, and my wireless-interface default router goes away. However, my system
does not automatically revert to my Ethernet default-router entry; DHCP does not work that way. As a result, I will have no
router at all until the next scheduled DHCP lease renegotiation, and must fix things manually.

The DHCP server has a range of IPv4 addresses to hand out, and maintains a database of which IPv4 address has been assigned to
which LAN address. Reservations can either be permanent or dynamic; if the latter, hosts typically renew their DHCP reservation
periodically (typically one to several times a day).

7.10.1 NAT, DHCP and the Small Office


If you have a large network, with multiple subnets, a certain amount of manual configuration is inevitable. What about, however, a
home or small office, with a single line from an ISP? A combination of NAT (7.7 Network Address Translation) and DHCP has
made autoconfiguration close to a reality.
The typical home/small-office “router” is in fact a NAT router (7.7 Network Address Translation) coupled with an Ethernet switch,
and usually also coupled with a Wi-Fi access point and a DHCP server. In this section, we will use the term “NAT router” to refer
to this whole package. One specially designated port, the external port, connects to the ISP’s line, and uses DHCP as a client to
obtain an IPv4 address for that port. The other, internal, ports are connected together by an Ethernet switch; these ports as a group
are connected to the external port using NAT translation. If wireless is supported, the wireless side is connected directly to the
internal ports.
Isolated from the Internet, the internal ports can thus be assigned an arbitrary non-public IPv4 address block, eg 192.168.0.0/24.
The NAT router typically contains a DCHP server, usually enabled by default, that will hand out IPv4 addresses to everything
connecting from the internal side.
Generally this works seamlessly. However, if a second NAT router is also connected to the network (sometimes attempted to extend
Wi-Fi range, in lieu of a commercial Wi-Fi repeater), one then has two operating DHCP servers on the same subnet. This often

7.11.1 https://eng.libretexts.org/@go/page/11149
results in chaos, though is easily fixed by disabling one of the DHCP servers.
While omnipresent DHCP servers have made IPv4 autoconfiguration work “out of the box” in many cases, in the era in which IPv4
was designed the need for such servers would have been seen as a significant drawback in terms of expense and reliability. IPv6
has an autoconfiguration strategy (8.7.2 Stateless Autoconfiguration (SLAAC)) that does not require DHCP, though DHCPv6 may
well end up displacing it.

7.10.2 DHCP and Routers


It is often desired, for larger sites, to have only one or two DHCP servers, but to have them support multiple subnets. Classical
DHCP relies on broadcast, which isn’t forwarded by routers, and even if it were, the DHCP server would have no way of knowing
on what subnet the host in question was actually located.
This is generally addressed by DHCP Relay (sometimes still known by the older name BOOTP Relay). The router (or, sometimes,
some other node on the subnet) receives the DHCP broadcast message from a host, and notes the subnet address of the arrival
interface. The router then relays the DHCP request, together with this subnet address, to the designated DHCP Server; this relayed
message is sent directly (unicast), not broadcast. Because the subnet address is included, the DHCP server can figure out the correct
IPv4 address to assign.
This feature has to be specially enabled on the router.

This page titled 7.11: Dynamic Host Configuration Protocol (DHCP) is shared under a CC BY-NC-ND license and was authored, remixed, and/or
curated by Peter Lars Dordal.

7.11.2 https://eng.libretexts.org/@go/page/11149
7.12: Internet Control Message Protocol
The Internet Control Message Protocol, or ICMP, is a protocol for sending IP-layer error and status messages; it is defined in RFC
792 [https://tools.ietf.org/html/rfc792.html]. ICMP is, like IP, host-to-host, and so they are never delivered to a specific port, even
if they are sent in response to an error related to something sent from that port. In other words, individual UDP and TCP
connections do not receive ICMP messages, even when it would be helpful to get them.
ICMP messages are identified by an 8-bit type field, followed by an 8-bit subtype, or code. Here are the more common ICMP
types, with subtypes listed in the description.

Type Description

Echo Request ping queries

Echo Reply ping responses

Destination network unreachable


Destination host unreachable
Destination Unreachable Destination port unreachable
Fragmentation required but DF flag set
Network administratively prohibited

Source Quench Congestion control

Redirect datagram for the network


Redirect datagram for the host
Redirect Message
Redirect for TOS and network
Redirect for TOS and host

Router Solicitation Router discovery/selection/solicitation

TTL expired in transit


Time Exceeded
Fragment reassembly time exceeded
Pointer indicates the error
Bad IP Header or Parameter Missing a required option
Bad length

Timestamp Timestamp Reply Like ping , but requesting a timestamp from the destination

The Echo and Timestamp formats are queries, sent by one host to another. Most of the others are all error messages, sent by a
router to the sender of the offending packet. Error-message formats contain the IP header and next 8 bytes of the packet in question;
the 8 bytes will contain the TCP or UDP port numbers. Redirect and Router Solicitation messages are informational, but follow the
error-message format. Query formats contain a 16-bit Query Identifier, assigned by the query sender and echoed back by the query
responder.

ping Packet Size

The author once had to diagnose a problem where pings were almost 100% successful, and yet file transfers failed
immediately; this could have been the result of either a network fault or a file-transfer application fault. The problem turned
out to be a failed network device with a very high bit-error rate: 1500-byte file-transfer packets were frequently corrupted, but
ping packets, with a default size of 32-64 bytes, were mostly unaffected. If the bit-error rate is such that 1500-byte packets
have a 50% success rate, 50-byte packets can be expected to have a 98% (≃ 0.51/30) success rate. Setting the ping packet size
to a larger value made it immediately clear that the network, and not the file-transfer application, was at fault.

ICMP is perhaps best known for Echo Request/Reply, on which the ping tool (1.14 Some Useful Utilities) is based. Ping
remains very useful for network troubleshooting: if you can ping a host, then the network is reachable, and any problems are higher

7.12.1 https://eng.libretexts.org/@go/page/11150
up the protocol chain. Unfortunately, ping replies are often blocked by many firewalls, on the theory that revealing even the
existence of computers is a security risk. While this may sometimes be an appropriate decision, it does significantly impair the
utility of ping.
Ping can be asked to include IP timestamps (7.1 The IPv4 Header) on Linux systems with the -T option, and on Windows with
-s .
Source Quench was used to signal that congestion has been encountered. A router that drops a packet due to congestion experience
was encouraged to send ICMP Source Quench to the originating host. Generally the TCP layer would handle these appropriately
(by reducing the overall sending rate), but UDP applications never receive them. ICMP Source Quench did not quite work out as
intended, and was formally deprecated by RFC 6633 [https://tools.ietf.org/html/rfc6633.html]. (Routers can inform TCP
connections of impending congestion by using the ECN bits.)
The Destination Unreachable type has a large number of subtypes:
Network unreachable: some router had no entry for forwarding the packet, and no default route
Host unreachable: the packet reached a router that was on the same LAN as the host, but the host failed to respond to ARP
queries
Port unreachable: the packet was sent to a UDP port on a given host, but that port was not open. TCP, on the other hand, deals
with this situation by replying to the connecting endpoint with a reset packet. Unfortunately, the UDP Port Unreachable
message is sent to the host, not to the application on that host that sent the undeliverable packet, and so is close to useless as a
practical way for applications to be informed when packets cannot be delivered.
Fragmentation required but DF flag set: a packet arrived at a router and was too big to be forwarded without fragmentation.
However, the Don’t Fragment bit in the IPv4 header was set, forbidding fragmentation.
Administratively Prohibited: this is sent by a router that knows it can reach the network in question, but has configureintro to
drop the packet and send back Administratively Prohibited messages. A router can also be configured to blackhole messages:
to drop the packet and send back nothing.
In 12.13 Path MTU Discovery we will see how TCP uses the ICMP message Fragmentation required but DF flag set as part of
Path MTU Discovery, the process of finding the largest packet that can be sent to a specific destination without fragmentation.
The basic idea is that we set the DF bit on some of the packets we send; if we get back this message, that packet was too big.
Some sites and firewalls block ICMP packets in addition to Echo Request/Reply, and for some messages one can get away with this
with relatively few consequences. However, blocking Fragmentation required but DF flag set has the potential to severely affect
TCP connections, depending on how Path MTU Discovery is implemented, and thus is not recommended. If ICMP filtering is
contemplated, it is best to base block/allow decisions on the ICMP type, or even on the type and code. For example, most firewalls
support rule sets of the form “allow ICMP destination-unreachable; block all other ICMP”.
The Timestamp option works something like Echo Request/Reply, but the receiver includes its own local timestamp for the arrival
time, with millisecond accuracy. See also the IP Timestamp option, 7.1 The IPv4 Header, which appears to be more frequently
used.
The type/code message format makes it easy to add new ICMP types. Over the years, a significant number of additional such types
have been defined; a complete list [www.iana.org/assignments/icmp...ameters.xhtml] is maintained by the IANA. Several of these
later ICMP types were seldom used and eventually deprecated, many by RFC 6918 [https://tools.ietf.org/html/rfc6918.html].
ICMP packets are usually forwarded correctly through NAT routers, though due to the absence of port numbers the router must do
a little more work. RFC 3022 [https://tools.ietf.org/html/rfc3022.html] and RFC 5508 [https://tools.ietf.org/html/rfc5508.html]
address this. For ICMP queries, like ping, the ICMP Query Identifier field can be used to recognize the returning response. ICMP
error messages are a little trickier, because there is no direct connection between the inbound error message and any of the previous
outbound non-ICMP packets that triggered the response. However, the headers of the packet that triggered the ICMP error message
are embedded in the body of the ICMP message. The NAT router can look at those embedded headers to determine how to forward
the ICMP message (the NAT router must also rewrite the addresses of those embedded headers).

7.11.1 Traceroute and Time Exceeded


The traceroute program uses ICMP Time Exceeded messages. A packet is sent to the destination (often UDP to an unused port),
with the TTL set to 1. The first router the packet reaches decrements the TTL to 0, drops it, and returns an ICMP Time Exceeded

7.12.2 https://eng.libretexts.org/@go/page/11150
message. The sender now knows the first router on the chain. The second packet is sent with TTL set to 2, and the second router on
the path will be the one to return ICMP Time Exceeded. This continues until finally the remote host returns something, likely
ICMP Port Unreachable.
For an example of traceroute output, see 1.14 Some Useful Utilities. In that example, the three traceroute probes for the Nth router
are sometimes answered by two or even three different routers; this suggests routers configured to work in parallel rather than route
changes.
Many routers no longer respond with ICMP Time Exceeded messages when they drop packets. For the distance value
corresponding to such a router, traceroute reports *** .
Traceroute assumes the path does not change. This is not always the case, although in practice it is seldom an issue.
Route Efficiency
Once upon a time (~2001), traceroute showed that traffic from my home to the office, both in the Chicago area, went through the
MAE-EAST Internet exchange point, outside of Washington DC. That inefficient route was later fixed. A situation like this is
typically caused by two higher-level providers who did not negotiate sufficient Internet exchange points.
Traceroute to a nonexistent site works up to the point when the packet reaches the Internet “backbone”: the first router which does
not have a default route. At that point the packet is not routed further (and an ICMP Destination Network Unreachable should be
returned).
Traceroute also interacts somewhat oddly with routers using MPLS (see 20.12 Multi-Protocol Label Switching (MPLS)). Such
routers – most likely large-ISP internal routers – may continue to forward the ICMP Time Exceeded message on further towards its
destination before returning it to the sender. As a result, the round-trip time measurements reported may be quite a bit larger than
they should be.

7.11.2 Redirects
Most non-router hosts start up with an IPv4 forwarding table consisting of a single (default) router, discovered along with their
IPv4 address through DHCP. ICMP Redirect messages help hosts learn of other useful routers. Here is a classic example:

R1 R2 B

A is configured so that its default router is R1. It addresses a packet to B, and sends it to R1. R1 receives the packet, and forwards it
to R2. However, R1 also notices that R2 and A are on the same network, and so A could have sent the packet to R2 directly. So R1
sends an appropriate ICMP redirect message to A (“Redirect Datagram for the Network”), and A adds a route to B via R2 to its
own forwarding table.

7.11.3 Router Solicitation


These ICMP messages are used by some router protocols to identify immediate neighbors. When we look at routing-update
algorithms, 9 Routing-Update Algorithms, these are where the process starts.

This page titled 7.12: Internet Control Message Protocol is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

7.12.3 https://eng.libretexts.org/@go/page/11150
7.13: Unnumbered Interfaces
We mentioned in 1.10 IP - Internet Protocol and 7.2 Interfaces that some devices allow the use of point-to-point IP links without
assigning IP addresses to the interfaces at the ends of the link. Such IP interfaces are referred to as unnumbered; they generally
make sense only on routers. It is a firm requirement that the node (ie router) at each endpoint of such a link has at least one other
interface that does have an IP address; otherwise, the node in question would be anonymous, and could not participate in the router-
to-router protocols of 9 Routing-Update Algorithms.
The diagram below shows a link L joining routers R1 and R2, which are connected to subnets 200.0.0.0/24 and 201.1.1.0/24
respectively. The endpoint interfaces of L, both labeled link0 , are unnumbered.

A B 200.0.0.1 201.1.1.1 E F
L
R1 R2
200.0.0.0/24 eth0 link0 link0 eth0 201.1.1.0/24

Two LANs joined by an unnumbered link L

The endpoints of L could always be assigned private IPv4 addresses (7.3 Special Addresses), such as 10.0.0.1 and 10.0.0.2. To do
this we would need to create a subnet; because the host bits cannot be all 0’s or all 1’s, the minimum subnet size is four (eg
10.0.0.0/30). Furthermore, the routing protocols to be introduced in 9 Routing-Update Algorithms will distribute information about
the subnet throughout the organization or “routing domain”, meaning care must be taken to ensure that each link’s subnet is unique.
Use of unnumbered links avoids this.
If R1 were to originate a packet to be sent to (or forwarded via) R2, the standard strategy is for it to treat its link0 interface as
if it shared the IP address of its Ethernet interface eth0 , that is, 200.0.0.1; R2 would do likewise. This still leaves R1 and R2
violating the IP local-delivery rule of 7.5 The Classless IP Delivery Algorithm; R1 is expected to deliver packets via local delivery
to 201.1.1.1 but has no interface that is assigned an IP address on the destination subnet 201.1.1.0/24. The necessary dispensation,
however, is granted by RFC 1812 [https://tools.ietf.org/html/rfc1812.html]. All that is necessary by way of configuration is that R1
be told R2 is a directly connected neighbor reachable via its link0 interface. On Linux systems this might be done with the
ip route command on R1 as follows:
ip route
The Linux ip route command illustrated here was tested on a virtual point-to-point link created with ssh and pppd ;
the link interface name was in fact ppp0 . While the command appeared to work as advertised, it was only possible to create the
link if endpoint IP addresses were assigned at the time of creation; these were then removed with ip route del and then re-
assigned with the command shown here.

ip route add 201.1.1.1 dev link0


Because L is a point-to-point link, there is no destination LAN address and thus no ARP query.

This page titled 7.13: Unnumbered Interfaces is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

7.13.1 https://eng.libretexts.org/@go/page/11151
7.14: Mobile IP
In the original IPv4 model, there was a strong if implicit assumption that each IP host would stay put. One role of an IPv4 address
is simply as a unique endpoint identifier, but another role is as a locator: some prefix of the address (eg the network part, in the
class-A/B/C strategy, or the provider prefix) represents something about where the host is physically located. Thus, if a host moves
far enough, it may need a new address.
When laptops are moved from site to site, it is common for them to receive a new IP address at each location, eg via DHCP as the
laptop connects to the local Wi-Fi. But what if we wish to support devices like smartphones that may remain active and
communicating while moving for thousands of miles? Changing IP addresses requires changing TCP connections; life (and
application development) might be simpler if a device had a single, unchanging IP address.
One option, commonly used with smartphones connected to some so-called “3G” networks, is to treat the phone’s data network as a
giant wireless LAN. The phone’s IP address need not change as it moves within this LAN, and it is up to the phone provider to
figure out how to manage LAN-level routing, much as is done in 3.7.4.3 Roaming.
But Mobile IP is another option, documented in RFC 5944 [https://tools.ietf.org/html/rfc5944.html]. In this scheme, a mobile host
has a permanent home address and, while roaming about, will also have a temporary care-of address, which changes from place
to place. The care-of address might be, for example, an IP address assigned by a local Wi-Fi network, and which in the absence of
Mobile IP would be the IP address for the mobile host. (This kind of care-of address is known as “co-located”; the care-of address
can also be associated with some other device – known as a foreign agent – in the vicinity of the mobile host.) The goal of Mobile
IP is to make sure that the mobile host is always reachable via its home address.
To maintain connectivity to the home address, a Mobile IP host needs to have a home agent back on the home network; the job of
the home agent is to maintain an IP tunnel that always connects to the device’s current care-of address. Packets arriving at the home
network addressed to the home address will be forwarded to the mobile device over this tunnel by the home agent. Similarly, if the
mobile device wishes to send packets from its home address – that is, with the home address as IP source address – it can use the
tunnel to forward the packet to the home agent.
The home agent may use proxy ARP (7.9.1 ARP Finer Points) to declare itself to be the appropriate destination on the home LAN
for packets addressed to the home (IP) address; it is then straightforward for the home agent to forward the packets.
An agent discovery process is used for the mobile host to decide whether it is mobile or not; if it is, it then needs to notify its home
agent of its current care-of address.

7.13.1 IP-in-IP Encapsulation


There are several forms of packet encapsulation that can be used for Mobile IP tunneling, but the default one is IP-in-IP
encapsulation, defined in RFC 2003 [https://tools.ietf.org/html/rfc2003.html]. In this process, the entire original IP packet (with
header addressed to the home address) is used as data for a new IP packet, with a new IP header (the “outer” header) addressed to
the care-of address.

Original
Data
IP header

Original packet

Original
Outer header Data
IP header

Encapsulated packet with outer header

A value of 4 in the outer-IP-header Protocol field indicates that IPv4-in-IPv4 tunneling is being used, so the receiver knows
to forward the packet on using the information in the inner header. The MTU of the tunnel will be the original MTU of the path to
the care-of address, minus the size of the outer header. A very similar mechanism is used for IPv6-in-IPv4 encapsulation (that is,
with IPv6 in the inner packet), except that the outer IPv4 Protocol field value is now 41. See 8.13 IPv6 Connectivity via
Tunneling.

7.14.1 https://eng.libretexts.org/@go/page/11152
IP-in-IP encapsulation presents some difficulties for NAT routers. If two hosts A and B behind a NAT router send out encapsulated
packets, the packets may differ only in the source IP address. The NAT router, upon receiving responses, doesn’t know whether to
forward them to A or to B. One partial solution is for the NAT router to support only one inside host sending encapsulated packets.
If the NAT router knew that encapsulation was being used for Mobile IP, it might look at the home address in the inner header to
determine the correct home agent to which to deliver the packet, but this is a big assumption. A fuller solution is outlined in RFC
3519 [https://tools.ietf.org/html/rfc3519.html].

This page titled 7.14: Mobile IP is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

7.14.2 https://eng.libretexts.org/@go/page/11152
7.15: Epilog and Exercises

At this point we have concluded the basic mechanics of IPv4. Still to come is a discussion of how IP routers build their
forwarding tables. This turns out to be a complex topic, divided into routing within single organizations and ISPs – 9 Routing-
Update Algorithms – and routing between organizations – 10 Large-Scale IP Routing.
But before that, in the next chapter, we compare IPv4 with IPv6, now twenty years old but still seeing limited adoption. The biggest
issue fixed by IPv6 is IPv4’s lack of address space, but there are also several other less dramatic improvements.

Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 6.5 is distinct, for
example, from exercises 6.0 and 7.0. Exercises marked with a ♢ have solutions or hints at 24.7 Solutions for IPv4.
1.0. Suppose an Ethernet packet represents a TCP acknowledgment; that is, the packet contains an IPv4 header and a 20-byte TCP
header but nothing else. Is the IPv4 packet here smaller than the Ethernet minimum-packet size, and, if so, by how much?
2.0. How can a receiving host tell if an arriving IPv4 packet is unfragmented? Hint: such a packet will be both the “first fragment”
and the “last fragment”; how are these two states marked in the IPv4 header?
3.0. How long will it take the IDENT field of the IPv4 header to wrap around, if the sender host A sends a stream of packets to host
B as fast as possible? Assume the packet size is 1500 bytes and the bandwidth is 600 Mbps.
4.0. The following diagram has routers A, B, C, D and E; E is the “border router” connecting the site to the Internet. All router-to-
router connections are via Ethernet-LAN /24 subnets with addresses of the form 200.0.x. Give forwarding tables for each of A♢,
B, C and D. Each table should include each of the listed subnets and also a default entry that routes traffic toward router E.
Directly connected subnets may be listed with a next_hop of “direct”.

200.0.5────A────200.0.6────B────200.0.7────D────200.0.8────E────Internet

200.0.9

C

200.0.10

5.0. (This exercise is an attempt at modeling Internet-2 routing.) Suppose sites S1 … Sn each have a single connection to the
standard Internet, and each site Si has a single IPv4 address block Ai. Each site’s connection to the Internet is through a single
router Ri; each Ri’s default route points towards the standard Internet. The sites also maintain a separate, higher-speed network
among themselves; each site has a single link to this separate network, also through Ri. Describe what the forwarding tables on
each Ri will have to look like so that traffic from one Si to another will always use the separate higher-speed network.
6.0. For each IPv4 network prefix given (with length), identify which of the subsequent IPv4 addresses are part of the same subnet.
(a). 10.0.130.0/23: 10.0.130.23, 10.0.129.1, 10.0.131.12, 10.0.132.7
(b). 10.0.132.0/22: 10.0.130.23, 10.0.135.1, 10.0.134.12, 10.0.136.7
(c). 10.0.64.0/18: 10.0.65.13, 10.0.32.4, 10.0.127.3, 10.0.128.4
(d).♢ 10.0.168.0/21: 10.0.166.1, 10.0.170.3, 10.0.174.5, 10.0.177.7
(e). 10.0.0.64/26: 10.0.0.125, 10.0.0.66, 10.0.0.130, 10.0.0.62
6.5. Convert the following subnet masks to /k notation, and vice-versa:
(a).♢ 255.255.240.0
(b). 255.255.248.0
(c). 255.255.255.192
(d).♢ /20
(e). /22

7.15.1 https://eng.libretexts.org/@go/page/11153
(f). /27
7.0. Suppose that the subnet bits below for the following five subnets A-E all come from the beginning of the fourth byte of the
IPv4 address; that is, these are subnets of a /24 block.
A: 00
B: 01
C: 110
D: 111
E: 1010
(a). What are the sizes of each subnet, and the corresponding decimal ranges? Count the addresses with host bits all 0’s or with host
bits all 1’s as part of the subnet.
(b). How many IPv4 addresses in the class-C block do not belong to any of the subnets A, B, C, D and E?
8.0. In 7.9 Address Resolution Protocol: ARP it was stated that, in newer implementations, “repeat ARP queries about a timed out
entry are first sent unicast”, in order to reduce broadcast traffic. Suppose multiple unicast repeat-ARP queries for host A’s IP
address fail, but a followup broadcast query for A’s address succeeds. What probably changed at host A?
9.0. Suppose A broadcasts an ARP query “who-has B?”, receives B’s response, and proceeds to send B a regular IPv4 packet. If B
now wishes to reply, why is it likely that A will already be present in B’s ARP cache? Identify a circumstance under which this can
fail.
10.0. Suppose A broadcasts an ARP request “who-has B”, but inadvertently lists the physical address of another machine C instead
of its own (that is, A’s ARP query has IPsrc = A, but LANsrc = C). What will happen? Will A receive a reply? Will any other hosts
on the LAN be able to send to A? What entries will be made in the ARP caches on A, B and C?
11.0. Suppose host A connects to the Internet via Wi-Fi. The default router is RW. Host A now begins exchanging packets with a
remote host B: A sends to B, B replies, etc. The exact form of the connection does not matter, except that TCP may not work.
(a). You now plug in A’s Ethernet cable. The Ethernet port is assumed to be on a different subnet from the Wi-Fi (so that the strong
and weak end-system models of 7.9.5 ARP and multihomed hosts do not play a role here). Assume A automatically selects the new
Ethernet connection as its default route, with router RE. What happens to the original connection to A? Can packets still travel back
and forth? Does the return address used for either direction change?
(b). You now disconnect A’s Wi-Fi interface, leaving the Ethernet interface connected. What happens now to the connection to B?
Hint: to what IP address are the packets from B being sent?
See also 9 Routing-Update Algorithms, 9 Routing-Update Algorithms exercise 13.0, and 12 TCP Transport, exercise 13.0.

This page titled 7.15: Epilog and Exercises is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

7.15.2 https://eng.libretexts.org/@go/page/11153
CHAPTER OVERVIEW

8: IP version 6
What has been learned from experience with IPv4? First and foremost, more than 32 bits are needed for addresses; the primary
motive in developing IPv6 was the specter of running out of IPv4 addresses (something which, at the highest level, has already
happened; see the discussion at the end of 1.10 IP - Internet Protocol). Another important issue is that IPv4 requires (or used to
require) a modest amount of effort at configuration; IPv6 was supposed to improve this.
8.1: Prelude to IP version 6
8.2: The IPv6 Header
8.3: IPv6 Addresses
8.4: Network Prefixes
8.5: IPv6 Multicast
8.6: IPv6 Extension Headers
8.7: Neighbor Discovery
8.8: IPv6 Host Address Assignment
8.9: Globally Exposed Addresses
8.10: ICMPv6
8.11: IPv6 Subnets
8.12: Using IPv6 and IPv4 Together
8.13: IPv6 Examples Without a Router
8.14: IPv6 Connectivity via Tunneling
8.15: IPv6-to-IPv4 Connectivity
8.16: Epilog and Exercises
Index

This page titled 8: IP version 6 is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

1
8.1: Prelude to IP version 6
By 1990 the IETF was actively interested in proposals to replace IPv4. A working group for the so-called “IP next generation”, or
IPng, was created in 1993 to select the new version; RFC 1550 [https://tools.ietf.org/html/rfc1550.html] was this group’s formal
solicitation of proposals. In July 1994 the IPng directors voted to accept a modified version of the “Simple Internet Protocol”, or
SIP (unrelated to the Session Initiation Protocol, 20.11.4 RTP and VoIP), as the basis for IPv6. The first IPv6 specifications,
released in 1995, were RFC 1883 [https://tools.ietf.org/html/rfc1883.html] (now RFC 2460
[https://tools.ietf.org/html/rfc2460.html], with updates) for the basic protocol, and RFC 1884
[https://tools.ietf.org/html/rfc1884.html] (now RFC 4291 [https://tools.ietf.org/html/rfc4291.html], again with updates) for the
addressing architecture.
SIP addresses were originally 64 bits in length, but in the month leading up to adoption as the basis for IPv6 this was increased to
128. 64 bits would probably have been enough, but the problem is less the actual number than the simplicity with which addresses
can be allocated; the more bits, the easier this becomes, as sites can be given relatively large address blocks without fear of waste.
A secondary consideration in the 64-to-128 leap was the potential to accommodate now-obsolete CLNP addresses (1.15 IETF and
OSI), which were up to 160 bits in length, but compressible.
IPv6 has to some extent returned to the idea of a fixed division between network and host portions; for most IPv6 addresses, the
first 64 bits is the network prefix (including any subnet portion) and the remaining 64 bits represents the host portion. The rule as
spelled out in RFC 2460 [https://tools.ietf.org/html/rfc2460.html], in 1998, was that the 64/64 split would apply to all addresses
except those beginning with the bits 000; those addresses were then held in reserve in the unlikely event that the 64/64 split ran into
problems in the future. This was a change from 1995, when RFC 1884 [https://tools.ietf.org/html/rfc1884.html] envisioned 48-bit
host portions and 80-bit prefixes.
While the IETF occasionally revisits the issue, at the present time the 64/64 split seems here to stay; for discussion and
justification, see 8.10.1 Subnets and /64 and RFC 7421 [https://tools.ietf.org/html/rfc7421.html]. The 64/64 split is not automatic,
however; there is no default prefix length as there was in the Class A/B/C IPv4 scheme. Thus, it is misleading to think of IPv6 as a
return to something like IPv4’s classful addressing scheme. Router advertisements must always include the prefix length, and,
when assigning IPv6 addresses manually, the /64 prefix length must be specified explicitly; see 8.12.3 Manual address
configuration.
High-level routing, however, can, as in IPv4, be done on prefixes of any length (usually that means lengths shorter than /64).
Routing can also be done on different prefix lengths at different points of the network.
IPv6 is now twenty years old, and yet usage as of 2015 remains quite modest. However, the shortage in IPv4 addresses has begun
to loom ominously; IPv6 adoption rates may rise quickly if IPv4 addresses begin to climb in price.

This page titled 8.1: Prelude to IP version 6 is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

8.1.1 https://eng.libretexts.org/@go/page/11189
8.2: The IPv6 Header
The IPv6 fixed header is pictured below; at 40 bytes, it is twice the size of the IPv4 header. The fixed header is intended to support
only what every packet needs: there is no support for fragmentation, no header checksum, and no option fields. However, the
concept of extension headers has been introduced to support some of these as options; some IPv6 extension headers are described
in 8.5 IPv6 Extension Headers. Whatever header comes next is identified by the Next Header field, much like the IPv4 Protocol
field. Some other fixed-header fields have also been renamed from their IPv4 analogues: the IPv4 TTL is now the IPv6 Hop_Limit
(still decremented by each router with the packet discarded when it reaches 0), and the IPv4 DS field has become the IPv6 Traffic
Class.
0 16 32

Version Traffic Class Flow Label

Payload Length Next Header Hop Limit

Source Address

Destination Address

The Flow Label is new. RFC 2460 [https://tools.ietf.org/html/rfc2460.html] states that it

may be used by a source to label sequences of packets for which it requests special
handling by the IPv6 routers, such as non-default quality of service or “real-time”
service.
Senders not actually taking advantage of any quality-of-service options are supposed to set the Flow Label to zero.
When used, the Flow Label represents a sender-computed hash of the source and destination addresses, and perhaps the traffic
class. Routers can use this field as a way to look up quickly any priority or reservation state for the packet. All packets belonging to
the same flow should have the same Routing Extension header, 8.5.3 Routing Header. The Flow Label will in general not include
any information about the source and destination port numbers, except that only some of the connections between a pair of hosts
may make use of this field.
A flow, as the term is used here, is one-way; the return traffic belongs to a different flow. Historically, the term “flow” has also
been used at various other scales: a single bidirectional TCP connection, multiple related TCP connections, or even all traffic from
a particular subnet (eg the “computer-lab flow”).

This page titled 8.2: The IPv6 Header is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

8.2.1 https://eng.libretexts.org/@go/page/11179
8.3: IPv6 Addresses
IPv6 addresses are written in eight groups of four hex digits, with a-f preferred over A-F (RFC 5952
[https://tools.ietf.org/html/rfc5952.html]). The groups are separated by colons, and have leading 0’s removed, eg

fedc:13:1654:310:fedc:bc37:61:3210
If an address contains a long run of 0’s – for example, if the IPv6 address had an embedded IPv4 address – then when writing the
address the string “::” should be used to represent however many blocks of 0000 as are needed to create an address of the correct
length; to avoid ambiguity this can be used only once. Also, embedded IPv4 addresses may continue to use the “.” separator:

::ffff:147.126.65.141
The above is an example of one standard IPv6 format for representing IPv4 addresses (see 8.11 Using IPv6 and IPv4 Together). 48
bits are explicitly displayed; the :: means these are prefixed by 80 0-bits.
The IPv6 loopback address is ::1 (that is, 127 0-bits followed by a 1-bit).
Network address prefixes may be written with the “/” notation, as in IPv4:

12ab:0:0:cd30::/60
RFC 3513 [https://tools.ietf.org/html/rfc3513.html] suggested that initial IPv6 unicast-address allocation be initially limited to
addresses beginning with the bits 001, that is, the 2000::/3 block (20 in binary is 0010 0000).
Generally speaking, IPv6 addresses consist of a 64-bit network prefix (perhaps including subnet bits) followed by a 64-bit
“interface identifier”. See 8.3 Network Prefixes and 8.2.1 Interface identifiers.
IPv6 addresses all have an associated scope, defined in RFC 4007 [https://tools.ietf.org/html/rfc4007.html]. The scope of a unicast
address is either global, meaning it is intended to be globally routable, or link-local, meaning that it will only work with directly
connected neighbors (8.2.2 Link-local addresses). The loopback address is considered to have link-local scope. A few more scope
levels are available for multicast addresses, eg “site-local” (RFC 4291 [https://tools.ietf.org/html/rfc4291.html]). The scope of an
IPv6 address is implicitly coded within the first 64 bits; addresses in the 2000::/3 block above, for example, have global scope.
Packets with local-scope addresses (eg link-local addresses) for either the destination or the source cannot be routed (the latter
because a reply would be impossible).
Although addresses in the “unique local address” category of 8.3 Network Prefixes officially have global scope, in a practical sense
they still behave as if they had the now-officially-deprecated “site-local scope”.

8.2.1 Interface identifiers


As mentioned earlier, most IPv6 addresses can be divided into a 64-bit network prefix and a 64-bit “host” portion, the latter
corresponding to the “host” bits of an IPv4 address. These host-portion bits are known officially as the interface identifier; the
change in terminology reflects the understanding that all IP addresses attach to interfaces rather than to hosts.
The original plan for the interface identifier was to derive it in most cases from the LAN address, though the interface identifier can
also be set administratively. Given a 48-bit Ethernet address, the interface identifier based on it was to be formed by inserting
0xfffe between the first three bytes and the last three bytes, to get 64 bits in all. The seventh bit of the first byte (the Ethernet
“universal/local” flag) was then set to 1. The result of this process is officially known as the Modified EUI-64 Identifier, where
EUI stands for Extended Unique Identifier; details can be found in RFC 4291 [https://tools.ietf.org/html/rfc4291.html]. As an
example, for a host with Ethernet address 00:a0:cc:24:b0:e4, the EUI-64 identifier would be 02a0:ccff:fe24:b0e4 (the leading 00
becomes 02 when the seventh bit is turned on). At the time the EUI-64 format was proposed, it was widely expected that Ethernet
MAC addresses would eventually become 64 bits in length.
EUI-64 interface identifiers turn out to introduce a major privacy concern: no matter where a (portable) host connects to the
Internet – home or work or airport or Internet cafe – such an interface identifier always remains the same, and thus serves as a
permanent host fingerprint. As a result, EUI-64 identifiers are now discouraged for personal workstations and mobile devices.

8.3.1 https://eng.libretexts.org/@go/page/11178
(Some fixed-location hosts continue to use EUI-64 interface identifiers, or, alternatively, administratively assigned interface
identifiers.)
RFC 7217 [https://tools.ietf.org/html/rfc7217.html] proposes an alternative: the interface identifier is a secure hash (22.6 Secure
Hashes) of a “Net_Iface” parameter, the 64-bit IPv6 address prefix, and a host-specific secret key (a couple other parameters are
also thrown into the mix, but they need not concern us here). The “Net_Iface” parameter can be the interface’s MAC address, but
can also be the interface’s “name”, eg eth0 . Interface identifiers created this way change from connection point to connection
point (because the prefix changes), do not reveal the Ethernet address, and are randomly scattered (because of the key, if nothing
else) through the 264-sized interface-identifier space. The last feature makes probing for IPv6 addresses effectively impossible; see
exercise 6.0.
Interface identifiers as in the previous paragraph do not change unless the prefix changes, which normally happens only if the host
is moved to a new network. In 8.7.2.1 SLAAC privacy we will see that interface identifiers are often changed at regular intervals,
for privacy reasons.
Finally, interface identifiers are often centrally assigned, using DHCPv6 (8.7.3 DHCPv6).
Remote probing for IPv6 addresses based on EUI-64 identifiers is much easier than for those based on RFC-7217 identifiers, as the
former are not very random. If an attacker can guess the hardware vendor, and thus the first three bytes of the Ethernet address
(2.1.3 Ethernet Address Internal Structure), there are only 224 possibilities, down from 264. As the last three bytes are often
assigned in serial order, considerable further narrowing of the search space may be possible. While it may amount to security
through obscurity [en.Wikipedia.org/wiki/Securi...ugh_obscurity], keeping internal global IPv6 addresses hidden is often of
practical importance.
Additional discussion of host-scanning in IPv6 networks can be found in RFC 7707 [https://tools.ietf.org/html/rfc7707.html] and
draft-ietf-opsec-ipv6-host-scanning-06 [https://tools.ietf.org/html/draft-ie...t-scanning-06].

8.2.2 Link-local addresses


IPv6 defines link-local addresses, with so-called link-local scope, intended to be used only on a single LAN and never routed.
These begin with the 64-bit link-local prefix consisting of the ten bits 1111 1110 10 followed by 54 more zero bits; that is,
fe80::/64. The remaining 64 bits are the interface identifier for the link interface in question, above. The EUI-64 link-local address
of the machine in the previous section with Ethernet address 00:a0:cc:24:b0:e4 is thus fe80::2a0:ccff:fe24:b0e4.
The main applications of link-local addresses are as a “bootstrap” address for global-address autoconfiguration (8.7.2 Stateless
Autoconfiguration (SLAAC)), and as an optional permanent address for routers. IPv6 routers often communicate with neighboring
routers via their link-local addresses, with the understanding that these do not change when global addresses (or subnet
configurations) change (RFC 4861 [https://tools.ietf.org/html/rfc4861.html] §6.2.8). If EUI-64 interface identifiers are used then
the link-local address does change whenever the Ethernet hardware is replaced. However, if RFC 7217
[https://tools.ietf.org/html/rfc7217.html] interface identifiers are used and that mechanism’s “Net_Iface” parameter represents the
interface name rather than its physical address, the link-local address can be constant for the life of the host. (When RFC 7217 is
used to generate link-local addresses, the “prefix” hash parameter is the link-local prefix fe80::/64.)
A consequence of identifying routers to their neighbors by their link-local addresses is that it is often possible to configure routers
so they do not even have global-scope addresses; for forwarding traffic and for exchanging routing-update messages, link-local
addresses are sufficient. Similarly, many ordinary hosts forward packets to their default router using the latter’s link-local address.
We will return to router addressing in 8.13.2 Setting up a router and 8.13.2.1 A second router.
For non-Ethernet-like interfaces, eg tunnel interfaces, there may be no natural candidate for the interface identifier, in which case a
link-local address may be assigned manually, with the low-order 64 bits chosen to be unique for the link in question.
When sending to a link-local address, one must separately supply somewhere the link’s “zone identifier”, often by appending a
string containing the interface name to the IPv6 address, eg fe80::f00d:cafe%eth0. See 8.12.1 ping6 and 8.12.2 TCP connections
using link-local addresses for examples of such use of link-local addresses.
IPv4 also has true link-local addresses, defined in RFC 3927 [https://tools.ietf.org/html/rfc3927.html], though they are rarely used;
such addresses are in the 169.254.0.0/16 block (not to be confused with the 192.168.0.0/16 private-address block). Other than
these, IPv4 addresses always implicitly identify the link subnet by virtue of the network prefix.

8.3.2 https://eng.libretexts.org/@go/page/11178
Once the link-local address is created, it must pass the duplicate-address detection test before being used; see 8.7.1 Duplicate
Address Detection.

8.2.3 Anycast addresses


IPv6 also introduced anycast addresses. An anycast address might be assigned to each of a set of routers (in addition to each
router’s own unicast addresses); a packet addressed to this anycast address would be delivered to only one member of this set. Note
that this is quite different from multicast addresses; a packet addressed to the latter is delivered to every member of the set.
It is up to the local routing infrastructure to decide which member of the anycast group would receive the packet; normally it would
be sent to the “closest” member. This allows hosts to send to any of a set of routers, rather than to their designated individual
default router.
Anycast addresses are not marked as such, and a node sending to such an address need not be aware of its anycast status. Addresses
are anycast simply because the routers involved have been configured to recognize them as such.
IPv4 anycast exists also, but in a more limited form (10.6.8 BGP and Anycast); generally routers are configured much more
indirectly (eg through BGP).

This page titled 8.3: IPv6 Addresses is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

8.3.3 https://eng.libretexts.org/@go/page/11178
8.4: Network Prefixes
We have been assuming that an IPv6 address, at least as seen by a host, is composed of a 64-bit network prefix and a 64-bit
interface identifier. As of 2015 this remains a requirement; RFC 4291 [https://tools.ietf.org/html/rfc4291.html] (IPv6 Addressing
Architecture) states:

For all unicast addresses, except those that start with the binary value 000, Interface IDs
are required to be 64 bits long….
This /64 requirement is occasionally revisited by the IETF, but is unlikely to change for mainstream IPv6 traffic. This firm 64/64
split is a departure from IPv4, where the host/subnet division point has depended, since the development of subnets, on local
configuration.
Note that while the net/interface (net/host) division point is fixed, routers may still use CIDR (10.1 Classless Internet Domain
Routing: CIDR) and may still base forwarding decisions on prefixes shorter than /64.
As of 2015, all allocations for globally routable IPv6 prefixes are part of the 2000::/3 block.
IPv6 also defines a variety of specialized network prefixes, including the link-local prefix and prefixes for anycast and multicast
addresses. For example, as we saw earlier, the prefix ::ffff:0:0/96 identifies IPv6 addresses with embedded IPv4 addresses.
The most important class of 64-bit network prefixes, however, are those supplied by a provider or other address-numbering entity,
and which represent the first half of globally routable IPv6 addresses. These are the prefixes that will be visible to the outside
world.
IPv6 customers will typically be assigned a relatively large block of addresses, eg /48 or /56. The former allows 64−48 = 16 bits for
local “subnet” specification within a 64-bit network prefix; the latter allows 8 subnet bits. These subnet bits are – as in IPv4 –
supplied through router configuration; see 8.10 IPv6 Subnets. The closest IPv6 analogue to the IPv4 subnet mask is that all network
prefixes are supplied to hosts with an associated length, although that length will almost always be 64 bits.
Many sites will have only a single externally visible address block. However, some sites may be multihomed and thus have
multiple independent address blocks.
Sites may also have private unique local address prefixes, corresponding to IPv4 private address blocks like 192.168.0.0/16 and
10.0.0.0/8. They are officially called Unique Local Unicast Addresses and are defined in RFC 4193
[https://tools.ietf.org/html/rfc4193.html]; these replace an earlier site-local address plan (and official site-local scope) formally
deprecated in RFC 3879 [https://tools.ietf.org/html/rfc3879.html] (though unique-local addresses are sometimes still informally
referred to as site-local).
The first 8 bits of a unique-local prefix are 1111 1101 (fd00::/8). The related prefix 1111 1100 (fc00::/8) is reserved for future use;
the two together may be consolidated as fc00::/7. The last 16 bits of a 64-bit unique-local prefix represent the subnet ID, and are
assigned either administratively or via autoconfiguration. The 40 bits in between, from bit 8 up to bit 48, represent the Global ID.
A site is to set the Global ID to a pseudorandom value.
The resultant unique-local prefix is “almost certainly” globally unique (and is considered to have global scope in the sense of 8.2
IPv6 Addresses), although it is not supposed to be routed off a site. Furthermore, a site would generally not admit any packets from
the outside world addressed to a destination with the Global ID as prefix. One rationale for choosing unique Global IDs for each
site is to accommodate potential later mergers of organizations without the need for renumbering; this has been a chronic problem
for sites using private IPv4 address blocks. Another justification is to accommodate VPN connections from other sites. For
example, if I use IPv4 block 10.0.0.0/8 at home, and connect using VPN to a site also using 10.0.0.0/8, it is possible that my printer
will have the same IPv4 address as their application server.

This page titled 8.4: Network Prefixes is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

8.4.1 https://eng.libretexts.org/@go/page/11177
8.5: IPv6 Multicast
IPv6 has moved away from LAN-layer broadcast, instead providing a wide range of LAN-layer multicast groups. (Note that LAN-
layer multicast is often straightforward; it is general IP-layer multicast (20.5 Global IP Multicast) that is problematic. See 2.1.2
Ethernet Multicast for the Ethernet implementation.) This switch to multicast is intended to limit broadcast traffic in general,
though many switches still propagate LAN multicast traffic everywhere, like broadcast.
An IPv6 multicast address is one beginning with the eight bits 1111 1111 (ff00::/8); numerous specific such addresses, and even
classes of addresses, have been defined. For actual delivery, IPv6 multicast addresses correspond to LAN-layer (eg Ethernet)
multicast addresses through a well-defined static correspondence; specifically, if x, y, z and w are the last four bytes of the IPv6
multicast address, in hex, then the corresponding Ethernet multicast address is 33:33:x:y:z:w (RFC 2464
[https://tools.ietf.org/html/rfc2464.html]). A typical IPv6 host will need to join (that is, subscribe to) several Ethernet multicast
groups.
The IPv6 multicast address with the broadest scope is all-nodes, with address ff02::1; the corresponding Ethernet multicast address
is 33:33:00:00:00:01. This essentially corresponds to IPv4’s LAN broadcast, though the use of LAN multicast here means that non-
IPv6 hosts should not see packets sent to this address. Another important IPv6 multicast address is ff02::2, the all-routers address.
This is meant to be used to reach all routers, and routers only; ordinary hosts do not subscribe.
Generally speaking, IPv6 nodes on Ethernets send LAN-layer Multicast Listener Discovery (MLD) messages to multicast groups
they wish to start using; these messages allow multicast-aware Ethernet switches to optimize forwarding so that only those hosts
that have subscribed to the multicast group in question will receive the messages. Otherwise switches are supposed to treat
multicast like broadcast; worse, some switches may simply fail to forward multicast packets to destinations that have not explicitly
opted to join the group.

This page titled 8.5: IPv6 Multicast is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

8.5.1 https://eng.libretexts.org/@go/page/11176
8.6: IPv6 Extension Headers
In IPv4, the IP header contained a Protocol field to identify the next header; usually UDP or TCP. All IPv4 options were contained
in the IP header itself. IPv6 has replaced this with a scheme for allowing an arbitrary chain of supplemental IPv6 headers. The IPv6
Next Header field can indicate that the following header is UDP or TCP, but can also indicate one of several IPv6 options. These
optional, or extension, headers include:
Hop-by-Hop options header
Destination options header
Routing header
Fragment header
Authentication header
Mobility header
Encapsulated Security Payload header
These extension headers must be processed in order; the recommended order for inclusion is as above. Most of them are intended
for processing only at the destination host; the hop-by-hop and routing headers are exceptions.

8.5.1 Hop-by-Hop Options Header


This consists of a set of ⟨type,value⟩ pairs which are intended to be processed by each router on the path. A tag in the type field
indicates what a router should do if it does not understand the option: drop the packet, or continue processing the rest of the
options. The only Hop-by-Hop options provided by RFC 2460 [https://tools.ietf.org/html/rfc2460.html] were for padding, so as to
set the alignment of later headers.
RFC 2675 [https://tools.ietf.org/html/rfc2675.html] later defined a Hop-by-Hop option to support IPv6 jumbograms: datagrams
larger than 65,535 bytes. The need for such large packets remains unclear, in light of 5.3 Packet Size. IPv6 jumbograms are not
meant to be used if the underlying LAN does not have an MTU larger than 65,535 bytes; the LAN world is not currently moving in
this direction.
Because Hop-by-Hop Options headers must be processed by each router encountered, they have the potential to overburden the
Internet routing system. As a result, RFC 6564 [https://tools.ietf.org/html/rfc6564.html] strongly discourages new Hop-by-Hop
Option headers, unless examination at every hop is essential.

8.5.2 Destination Options Header


This is very similar to the Hop-by-Hop Options header. It again consists of a set of ⟨type,value⟩ pairs, and the original RFC 2460
[https://tools.ietf.org/html/rfc2460.html] specification only defined options for padding. The Destination header is intended to be
processed at the destination, before turning over the packet to the transport layer.
Since RFC 2460 [https://tools.ietf.org/html/rfc2460.html], a few more Destination Options header types have been defined, though
none is in common use. RFC 2473 [https://tools.ietf.org/html/rfc2473.html] defined a Destination Options header to limit the
nesting of tunnels, called the Tunnel Encapsulation Limit. RFC 6275 [https://tools.ietf.org/html/rfc6275.html] defines a
Destination Options header for use in Mobile IPv6. RFC 6553 [https://tools.ietf.org/html/rfc6553.html], on the Routing Protocol
for Low-Power and Lossy Networks, or RPL, has defined a Destination (and Hop-by-Hop) Options type for carrying RPL data.
A complete list of Option Types for Hop-by-Hop Option and Destination Option headers can be found at
www.iana.org/assignments/ipv6-parameters [http://www.iana.org/assignments/ipv6...-parameters-2]; in accordance with RFC 2780
[https://tools.ietf.org/html/rfc2780.html].

8.5.3 Routing Header


The original, or Type 0, Routing header contained a list of IPv6 addresses through which the packet should be routed. These did not
have to be contiguous. If the list to be visited en route to destination D was ⟨R1,R2,…,Rn⟩, then this option header contained
⟨R2,R3,…,Rn,D⟩ with R1 as the initial destination address; R1 then would update this header to ⟨R1,R3,…,Rn,D⟩ (that is, the old

8.6.1 https://eng.libretexts.org/@go/page/11175
destination R1 and the current next-router R2 were swapped), and would send the packet on to R2. This was to continue on until
Rn addressed the packet to the final destination D. The header contained a Segments Left pointer indicating the next address to be
processed, incremented at each Ri. When the packet arrived at D the Routing Header would contain the routing list ⟨R1,R3,…,Rn⟩.
This is, in general principle, very much like IPv4 Loose Source routing. Note, however, that routers between the listed routers R1…
Rn did not need to examine this header; they processed the packet based only on its current destination address.
This form of routing header was deprecated by RFC 5095 [https://tools.ietf.org/html/rfc5095.html], due to concerns about a traffic-
amplification attack. An attacker could send off a packet with a routing header containing an alternating list of just two routers
⟨R1,R2,R1,R2,…,R1,R2,D⟩; this would generate substantial traffic on the R1–R2 link. RFC 6275
[https://tools.ietf.org/html/rfc6275.html] and RFC 6554 [https://tools.ietf.org/html/rfc6554.html] define more limited routing
headers. RFC 6275 [https://tools.ietf.org/html/rfc6275.html] defines a quite limited routing header to be used for IPv6 mobility
(and also defines the IPv6 Mobility header). The RFC 6554 [https://tools.ietf.org/html/rfc6554.html] routing header used for RPL,
mentioned above, has the same basic form as the Type 0 header described above, but its use is limited to specific low-power
routing domains.

8.5.4 IPv6 Fragment Header


IPv6 supports limited IPv4-style fragmentation via the Fragment Header. This header contains a 13-bit Fragment Offset field,
which contains – as in IPv4 – the 13 high-order bits of the actual 16-bit offset of the fragment. This header also contains a 32-bit
Identification field; all fragments of the same packet must carry the same value in this field.
IPv6 fragmentation is done only by the original sender; routers along the way are not allowed to fragment or re-fragment a packet.
Sender fragmentation would occur if, for example, the sender had an 8 kB IPv6 packet to send via UDP, and needed to fragment it
to accommodate the 1500-byte Ethernet MTU.
If a packet needs to be fragmented, the sender first identifies the unfragmentable part, consisting of the IPv6 fixed header and any
extension headers that must accompany each fragment (these would include Hop-by-Hop and Routing headers). These
unfragmentable headers are then attached to each fragment.
IPv6 also requires that every link on the Internet have an MTU of at least 1280 bytes beyond the LAN header; link-layer
fragmentation and reassembly can be used to meet this MTU requirement (which is what ATM links (3.5 Asynchronous Transfer
Mode: ATM) carrying IP traffic do).
Generally speaking, fragmentation should be avoided at the application layer when possible. UDP-based applications that attempt
to transmit filesystem-sized (usually 8 kB) blocks of data remain persistent users of fragmentation.

8.5.5 General Extension-Header Issues


In the IPv4 world, many middleboxes (7.7.2 Middleboxes) examine not just the destination address but also the TCP port numbers;
firewalls, for example, do this routinely to block all traffic except to a designated list of ports. In the IPv6 world, a middlebox may
have difficulty finding the TCP header, as it must traverse a possibly lengthy list of extension headers. Worse, some of these
extension headers may be newer than the middlebox, and thus unrecognized. Some middleboxes would simply drop packets with
unrecognized extension headers, making the introduction of new such headers problematic.
RFC 6564 [https://tools.ietf.org/html/rfc6564.html] addresses this by requiring that all future extension headers use a common
“type-length-value” format: the first byte indicates the extension-header’s type and the second byte indicates its length. This
facilitiates rapid traversal of the extension-header chain. A few older extension headers – for example the Encapsulating Security
Payload header of RFC 4303 [https://tools.ietf.org/html/rfc4303.html] – do not follow this rule; middleboxes must treat these as
special cases.
RFC 2460 [https://tools.ietf.org/html/rfc2460.html] states

With one exception [that is, Hop-by-Hop headers], extension headers are not examined or
processed by any node along a packet’s delivery path, until the packet reaches the node
(or each of the set of nodes, in the case of multicast) identified in the Destination Address
field of the IPv6 header.

8.6.2 https://eng.libretexts.org/@go/page/11175
Nonetheless, sometimes intermediate nodes do attempt to add extension headers. This can break Path MTU Discovery (12.13 Path
MTU Discovery), as the sender no longer controls the total packet size.
RFC 7045 [https://tools.ietf.org/html/rfc7045.html] attempts to promulgate some general rules for the real-world handling of
extension headers. For example, it states that, while routers are allowed to drop packets with certain extension headers, they may
not do this simply because those headers are unrecognized. Also, routers may ignore Hop-by-Hop Option headers, or else process
packets with such headers via a slower queue.

This page titled 8.6: IPv6 Extension Headers is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

8.6.3 https://eng.libretexts.org/@go/page/11175
8.7: Neighbor Discovery
IPv6 Neighbor Discovery, or ND, is a set of related protocols that replaces several IPv4 tools, most notably ARP, ICMP redirects
and most non-address-assignment parts of DHCP. The messages exchanged in ND are part of the ICMPv6 framework, 8.9
ICMPv6. The original specification for ND is in RFC 2461 [https://tools.ietf.org/html/rfc2461.html], later updated by RFC 4861
[https://tools.ietf.org/html/rfc4861.html]. ND provides the following services:
Finding the local router(s) [8.6.1 Router Discovery]
Finding the set of network address prefixes that can be reached via local delivery (IPv6 allows there to be more than one) [8.6.2
Prefix Discovery]
Finding a local host’s LAN address, given its IPv6 address [8.6.3 Neighbor Solicitation]
Detecting duplicate IPv6 addresses [8.7.1 Duplicate Address Detection]
Determining that some neighbors are now unreachable

8.6.1 Router Discovery


IPv6 routers periodically send Router Advertisement (RA) packets to the all-nodes multicast group. Ordinary hosts wanting to
know what router to use can wait for one of these periodic multicasts, or can request an RA packet immediately by sending a
Router Solicitation request to the all-routers multicast group. Router Advertisement packets serve to identify the routers; this
process is sometimes called Router Discovery. In IPv4, by comparison, the address of the default router is usually piggybacked
onto the DHCP response message (7.10 Dynamic Host Configuration Protocol (DHCP)).
These RA packets, in addition to identifying the routers, also contain a list of all network address prefixes in use on the LAN. This
is “prefix discovery”, described in the following section. To a first approximation on a simple network, prefix discovery supplies
the network portion of the IPv6 address; on IPv4 networks, DHCP usually supplies the entire IPv4 address.
RA packets may contain other important information about the LAN as well, such as an agreed-on MTU.
These IPv6 router messages represent a change from IPv4, in which routers need not send anything besides forwarded packets. To
become an IPv4 router, a node need only have IPv4 forwarding enabled in its kernel; it is then up to DHCP (or the equivalent) to
inform neighboring nodes of the router. IPv6 puts the responsibility for this notification on the router itself: for a node to become an
IPv6 router, in addition to forwarding packets, it “MUST” (RFC 4294 [https://tools.ietf.org/html/rfc4294.html]) also run software
to support Router Advertisement. Despite this mandate, however, the RA mechanism does not play a role in the forwarding process
itself; an IPv6 network can run without Router Advertisements if every node is, for example, manually configured to know where
the routers are and to know which neighbors are on-link. (We emphasize that manual configuration like this scales very poorly.)
On Linux systems, the Router Advertisement agent is most often the radvd [en.Wikipedia.org/wiki/Radvd] daemon. See 8.13 IPv6
Connectivity via Tunneling below.

8.6.2 Prefix Discovery


Closely related to Router Discovery is the Prefix Discovery process by which hosts learn what IPv6 network-address prefixes,
above, are valid on the network. It is also where hosts learn which prefixes are considered to be local to the host’s LAN, and thus
reachable at the LAN layer instead of requiring router assistance for delivery. IPv6, in other words, does not limit determination of
whether delivery is local to the IPv4 mechanism of having a node check a destination address against each of the network-address
prefixes assigned to the node’s interfaces.
Even IPv4 allows two IPv4 network prefixes to share the same LAN (eg a private one 10.1.2.0/24 and a public one
147.126.65.0/24), but a consequence of IPv4 routing is that two such LAN-sharing subnets can only reach one another via a router
on the LAN, even though they should in principle be able to communicate directly. IPv6 drops this restriction.
The Router Advertisement packets sent by the router should contain a complete list of valid network-address prefixes, as the Prefix
Information option. In simple cases this list may contain a single globally routable 64-bit prefix corresponding to the LAN subnet.
If a particular LAN is part of multiple (overlapping) physical subnets, the prefix list will contain an entry for each subnet; these 64-
bit prefixes will themselves likely share a common site-wide prefix of length N<64. For multihomed sites the prefix list may

8.7.1 https://eng.libretexts.org/@go/page/11174
contain multiple unrelated prefixes corresponding to the different address blocks. Finally, site-specific “unique local” IPv6 address
prefixes may also be included.
Each prefix will have an associated lifetime; nodes receiving a prefix from an RA packet are to use it only for the duration of this
lifetime. On expiration (and likely much sooner) a node must obtain a newer RA packet with a newer prefix list. The rationale for
inclusion of the prefix lifetime is ultimately to allow sites to easily renumber; that is, to change providers and switch to a new
network-address prefix provided by a new router. Each prefix is also tagged with a bit indicating whether it can be used for
autoconfiguration, as in 8.7.2 Stateless Autoconfiguration (SLAAC) below.
Each prefix also comes with a flag indicating whether the prefix is on-link. If set, then every node receiving that prefix is supposed
to be on the same LAN. Nodes assume that to reach a neighbor sharing the same on-link address prefix, Neighbor Solicitation is to
be used to find the neighbor’s LAN address. If a neighbor shares an off-link prefix, a router must be used. The IPv4 equivalent of
two nodes sharing the same on-link prefix is sharing the same subnet prefix. For an example of subnets with prefix-discovery
information, see 8.10 IPv6 Subnets.
Routers advertise off-link prefixes only in special cases; this would mean that a node is part of a subnet but cannot reach other
members of the subnet directly. This may apply in some wireless settings, eg MANETs (3.7.8 MANETs) where some nodes on the
same subnet are out of range of one another. It may also apply when using IPv6 Mobility (7.13 Mobile IP, RFC 3775
[https://tools.ietf.org/html/rfc3775.html]).

8.6.3 Neighbor Solicitation


Neighbor Solicitation messages are the IPv6 analogues of IPv4 ARP requests. These are essentially queries of the form “who has
IPv6 address X?” While ARP requests were broadcast, IPv6 Neighbor Solicitation messages are sent to the solicited-node
multicast address, which at the LAN layer usually represents a rather small multicast group. This address is ff02::0001:x.y.z.w,
where x, y, z and w are the low-order 32 bits of the IPv6 address the sender is trying to look up. Each IPv6 host on the LAN will
need to subscribe to all the solicited-node multicast addresses corresponding to its own IPv6 addresses (normally this is not too
many).
Neighbor Solicitation messages are repeated regularly, but followup verifications are initially sent to the unicast LAN address on
file (this is common practice with ARP implementations, but is optional). Unlike with ARP, other hosts on the LAN are not
expected to eavesdrop on the initial Neighbor Solicitation message. The target host’s response to a Neighbor Solicitation message
is called Neighbor Advertisement; a host may also send these unsolicited if it believes its LAN address may have changed.
The analogue of Proxy ARP is still permitted, in that a node may send Neighbor Advertisements on behalf of another. The most
likely reason for this is that the node receiving proxy services is a “mobile” host temporarily remote from the home LAN. Neighbor
Advertisements sent as proxies have a flag to indicate that, if the real target does speak up, the proxy advertisement should be
ignored.
Once a node (host or router) has discovered a neighbor’s LAN address through Neighbor Solicitation, it continues to monitor the
neighbor’s continued reachability.
Neighbor Solicitation also includes Neighbor Unreachability Detection. Each node (host or router) continues to monitor its known
neighbors; reachability can be inferred either from ongoing IPv6 traffic exchanges or from Neighbor Advertisement responses. If a
node detects that a neighboring host has become unreachable, the original node may retry the multicast Neighbor Solicitation
process, in case the neighbor’s LAN address has simply changed. If a node detects that a neighboring router has become
unreachable, it attempts to find an alternative path.
Finally, IPv4 ICMP Redirect messages have also been moved in IPv6 to the Neighbor Discovery protocol. These allow a router to
tell a host that another router is better positioned to handle traffic to a given destination.

8.6.4 Security and Neighbor Discovery


In the protocols outlined above, received ND messages are trusted; this can lead to problems with nodes pretending to be things
they are not. Here are two examples:

8.7.2 https://eng.libretexts.org/@go/page/11174
A host can pretend to be a router simply by sending out Router Advertisements; such a host can thus capture traffic from its
neighbors, and even send it on – perhaps selectively – to the real router.
A host can pretend to be another host, in the IPv6 analog of ARP spoofing (7.9.2 ARP Security). If host A sends out a Neighbor
Solicitation for host B, nothing prevents host C from sending out a Neighbor Advertisement claiming to be B (after previously
joining the appropriate multicast group).
These two attacks can have the goal either of eavesdropping or of denial of service; there are also purely denial-of-service attacks.
For example, host C can answer host B’s DAD queries (below at 8.7.1 Duplicate Address Detection) by claiming that the IPv6
address in question is indeed in use, preventing B from ever acquiring an IPv6 address. A good summary of these and other attacks
can be found in RFC 3756 [https://tools.ietf.org/html/rfc3756.html].
These attacks, it is worth noting, can only be launched by nodes on the same LAN; they cannot be launched remotely. While this
reduces the risk, though, it does not eliminate it. Sites that allow anyone to connect, such as Internet cafés, run the highest risk, but
even in a setting in which all workstations are “locked down”, a node compromised by a virus may be able to disrupt the network.
RFC 4861 [https://tools.ietf.org/html/rfc4861.html] suggested that, at sites concerned about these kinds of attacks, hosts might use
the IPv6 Authentication Header or the Encapsulated Security Payload Header to supply digital signatures for ND packets (see 22.11
IPsec). If a node is configured to require such checks, then most ND-based attacks can be prevented. Unfortunately, RFC 4861
[https://tools.ietf.org/html/rfc4861.html] offered no suggestions beyond static configuration, which scales poorly and also rather
completely undermines the goal of autoconfiguration.
A more flexible alternative is Secure Neighbor Discovery, or SEND, specified in RFC 3971
[https://tools.ietf.org/html/rfc3971.html]. This uses public-key encryption (22.9 Public-Key Encryption) to validate ND messages;
for the remainder of this section, some familiarity with the material at 22.9 Public-Key Encryption may be necessary. Each message
is digitally signed by the sender, using the sender’s private key; the recipient can validate the message using the sender’s
corresponding public key. In principle this makes it impossible for one message sender to pretend to be another sender.
In practice, the problem is that public keys by themselves guarantee (if not compromised) only that the sender of a message is the
same entity that previously sent messages using that key. In the second bulleted example above, in which C sends an ND message
falsely claiming to be B, straightforward applications of public keys would prevent this if the original host A had previously heard
from B, and trusted that sender to be the real B. But in general A would not know which of B or C was the real B. A cannot trust
whichever host it heard from first, as it is indeed possible that C started its deception with A’s very first query for B, beating B to
the punch.
A common solution to this identity-guarantee problem is to create some form of “public-key infrastructure” such as certificate
authorities, as in 22.10.2.1 Certificate Authorities. In this setting, every node is configured to trust messages signed by the
certificate authority; that authority is then configured to vouch for the identities of other nodes whenever this is necessary for
secure operation. SEND implements its own version of certificate authorities; these are known as trust anchors. These would be
configured to guarantee the identities of all routers, and perhaps hosts. The details are somewhat simpler than the mechanism
outlined in 22.10.2.1 Certificate Authorities, as the anchors and routers are under common authority. When trust anchors are used,
each host needs to be configured with a list of their addresses.
SEND also supports a simpler public-key validation mechanism known as cryptographically generated addresses, or CGAs
(RFC 3972 [https://tools.ietf.org/html/rfc3972.html]). These are IPv6 interface identifiers that are secure hashes (22.6 Secure
Hashes) of the host’s public key (and a few other non-secret parameters). CGAs are an alternative to the interface-identifier
mechanisms discussed in 8.2.1 Interface identifiers. DNS names in the .onion domain used by TOR also use CGAs.
The use of CGAs makes it impossible for host C to successfully claim to be host B: only B will have the public key that hashes to
B’s address and the matching private key. If C attempts to send to A a neighbor advertisement claiming to be B, then C can sign the
message with its own private key, but the hash of the corresponding public key will not match the interface-identifier portion of B’s
address. Similarly, in the DAD scenario, if C attempts to tell B that B’s newly selected CGA address is already in use, then again C
won’t have a key matching that address, and B will ignore the report.
In general, CGI addresses allow recipients of a message to verify that the source address is the “owner” of the associated public
key, without any need for a public-key infrastructure (22.9.3 Trust and the Man in the Middle). C can still pretend to be a router,
using its own CGA address, because router addresses are not known by the requester beforehand. However, it is easier to protect
routers using trust anchors as there are fewer of them.

8.7.3 https://eng.libretexts.org/@go/page/11174
SEND relies on the fact that finding two inputs hashing to the same 64-bit CGA is infeasible, as in general this would take about
264 tries. An IPv4 analog would be impossible as the address host portion won’t have enough bits to prevent finding hash collisions
via brute force. For example, if the host portion of the address has ten bits, it would take C about 210 tries (by tweaking the
supplemental hash parameters) until it found a match for B’s CGA.
SEND has seen very little use in the IPv6 world, partly because IPv6 itself has seen such slow adoption, but also because of the
perception that the vulnerabilities SEND protects against are difficult to exploit.
RA-guard is a simpler mechanism to achieve ND security, but one that requires considerable support from the LAN layer.
Outlined in RFC 6105 [https://tools.ietf.org/html/rfc6105.html], it requires that each host connects directly to a switch; that is,
there must be no shared-media Ethernet. The switches must also be fairly smart; it must be possible to configure them to know
which ports connect to routers rather than hosts, and, in addition, it must be possible to configure them to block Router
Advertisements from host ports that are not router ports. This is quite effective at preventing a host from pretending to be a router,
and, while it assumes that the switches can do a significant amount of packet inspection, that is in fact a fairly common Ethernet
switch feature. If Wi-Fi is involved, it does require that access points (which are a kind of switch) be able to block Router
Advertisements; this isn’t quite as commonly available. In determining which switch ports are connected to routers, RFC 6105
[https://tools.ietf.org/html/rfc6105.html] suggests that there might be a brief initial learning period, during which all switch ports
connecting to a device that claims to be a router are considered, permanently, to be router ports.

This page titled 8.7: Neighbor Discovery is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

8.7.4 https://eng.libretexts.org/@go/page/11174
8.8: IPv6 Host Address Assignment
IPv6 provides two competing ways for hosts to obtain their full IP addresses. One is DHCPv6, based on IPv4’s DHCP (7.10
Dynamic Host Configuration Protocol (DHCP)), in which the entire address is handed out by a DHCPv6 server. The other is
StateLess Address AutoConfiguration, or SLAAC, in which the interface-identifier part of the address is generated locally, and
the network prefix is obtained via prefix discovery. The original idea behind SLAAC was to support complete plug-and-play
network setup: hosts on an isolated LAN could talk to one another out of the box, and if a router was introduced connecting the
LAN to the Internet, then hosts would be able to determine unique, routable addresses from information available from the router.
In the early days of IPv6 development, in fact, DHCPv6 may have been intended only for address assignments to routers and
servers, with SLAAC meant for “ordinary” hosts. In that era, it was still common for IPv4 addresses to be assigned “statically”, via
per-host configuration files. RFC 4862 [https://tools.ietf.org/html/rfc4862.html] states that SLAAC is to be used when “a site is not
particularly concerned with the exact addresses hosts use, so long as they are unique and properly routable.”
SLAAC and DHCPv6 evolved to some degree in parallel. While SLAAC solves the autoconfiguration problem quite neatly, at this
point DHCPv6 solves it just as effectively, and provides for greater administrative control. For this reason, SLAAC may end up
less widely deployed. On the other hand, SLAAC gives hosts greater control over their IPv6 addresses, and so may end up offering
hosts a greater degree of privacy by allowing endpoint management of the use of private and temporary addresses (below).
When a host first begins the Neighbor Discovery process, it receives a Router Advertisement packet. In this packet are two special
bits: the M (managed) bit and the O (other configuration) bit. The M bit is set to indicate that DHCPv6 is available on the network
for address assignment. The O bit is set to indicate that DHCPv6 is able to provide additional configuration information (eg the
name of the DNS server) to hosts that are using SLAAC to obtain their addresses. In addition, each individual prefix in the RA
packet has an A bit, which when set indicates that the associated prefix may be used with SLAAC.

8.7.1 Duplicate Address Detection


Whenever an IPv6 host obtains a unicast address – a link-local address, an address created via SLAAC, an address received via
DHCPv6 or a manually configured address – it goes through a duplicate-address detection (DAD) process. The host sends one or
more Neighbor Solicitation messages (that is, like an ARP query), as in 8.6 Neighbor Discovery, asking if any other host has this
address. If anyone answers, then the address is a duplicate. As with IPv4 ACD (7.9.1 ARP Finer Points), but not as with the
original IPv4 self-ARP, the source-IP-address field of this NS message is set to a special “unspecified” value; this allows other
hosts to recognize it as a DAD query.
Because this NS process may take some time, and because addresses are in fact almost always unique, RFC 4429
[https://tools.ietf.org/html/rfc4429.html] defines an optimistic DAD mechanism. This allows limited use of an address before the
DAD process completes; in the meantime, the address is marked as “optimistic”.
Outside the optimistic-DAD interval, a host is not allowed to use an IPv6 address if the DAD process has failed. RFC 4862
[https://tools.ietf.org/html/rfc4862.html] in fact goes further: if a host with an established address receives a DAD query for that
address, indicating that some other host wants to use that address, then the original host should discontinue use of the address.
If the DAD process fails for an address based on an EUI-64 identifier, then some other node has the same Ethernet address and you
have bigger problems than just finding a working IPv6 address. If the DAD process fails for an address constructed with the RFC
7217 [https://tools.ietf.org/html/rfc7217.html] mechanism, 8.2.1 Interface identifiers, the host is able to generate a new interface
identifier and try again. A counter for the number of DAD attempts is included in the hash that calculates the interface identifier;
incrementing this counter results in an entirely new identifier.
While DAD works quite well on Ethernet-like networks with true LAN-layer multicast, it may be inefficient on, say, MANETs
(3.7.8 MANETs), as distant hosts may receive the DAD Neighbor Solicitation message only after some delay, or even not at all.
Work continues on the development of improvements to DAD for such networks.

8.7.2 Stateless Autoconfiguration (SLAAC)


To obtain an address via SLAAC, defined in RFC 4862 [https://tools.ietf.org/html/rfc4862.html], the first step for a host is to
generate its link-local address (above, 8.2.2 Link-local addresses), appending the standard 64-bit link-local prefix fe80::/64 to its

8.8.1 https://eng.libretexts.org/@go/page/11173
interface identifier (8.2.1 Interface identifiers). The latter is likely derived from the host’s LAN address using either EUI-64 or the
RFC 7217 [https://tools.ietf.org/html/rfc7217.html] mechanism; the important point is that it is available without network
involvement.
The host must then ensure that its newly configured link-local address is in fact unique; it uses DAD (above) to verify this.
Assuming no duplicate is found, then at this point the host can talk to any other hosts on the same LAN, eg to figure out where the
printers are.
The next step is to see if there is a router available. The host may send a Router Solicitation (RS) message to the all-routers
multicast address. A router – if present – should answer with a Router Advertisement (RA) message that also contains a Prefix
Information option; that is, a list of IPv6 network-address prefixes (8.6.2 Prefix Discovery).
As mentioned earlier, the RA message will mark with a flag those prefixes eligible for use with SLAAC; if no prefixes are so
marked, then SLAAC should not be used. All prefixes will also be marked with a lifetime, indicating how long the host may
continue to use the prefix. Once the prefix expires, the host must obtain a new one via a new RA message.
The host chooses an appropriate prefix, stores the prefix-lifetime information, and appends the prefix to the front of its interface
identifier to create what should now be a routable address. The address so formed must now be verified through the DAD
mechanism above.
In the era of EUI-64 interface identifiers, it would in principle have been possible for the receiver of a packet to extract the sender’s
LAN address from the interface-identifier portion of the sender’s SLAAC-generated IPv6 address. This in turn would allow
bypassing the Neighbor Solicitation process to look up the sender’s LAN address. This was never actually permitted, however,
even before the privacy options below, as there is no way to be certain that a received address was in fact generated via SLAAC.
With RFC 7217 [https://tools.ietf.org/html/rfc7217.html]-based interface identifiers, LAN-address extraction is no longer even
potentially an option.
A host using SLAAC may receive multiple network prefixes, and thus generate for itself multiple addresses. RFC 6724
[https://tools.ietf.org/html/rfc6724.html] defines a process for a host to determine, when it wishes to connect to destination address
D, which of its own multiple addresses to use. For example, if D is a unique-local address, not globally visible, then the host will
likely want to choose a source address that is also unique-local. RFC 6724 [https://tools.ietf.org/html/rfc6724.html] also includes
mechanisms to allow a host with a permanent public address (possibly corresponding to a DNS entry, but just as possibly formed
directly from an interface identifier) to prefer alternative “temporary” or “privacy” addresses for outbound connections. Finally,
RFC 6724 [https://tools.ietf.org/html/rfc6724.html] also defines the sorting order for multiple addresses representing the same
destination; see 8.11 Using IPv6 and IPv4 Together.
At the end of the SLAAC process, the host knows its IPv6 address (or set of addresses) and its default router. In IPv4, these would
have been learned through DHCP along with the identity of the host’s DNS server; one concern with SLAAC is that it originally
did not provide a way for a host to find its DNS server. One strategy is to fall back on DHCPv6 for this. However, RFC 6106
[https://tools.ietf.org/html/rfc6106.html] now defines a process by which IPv6 routers can include DNS-server information in the
RA packets they send to hosts as part of the SLAAC process; this completes the final step of autoconfiguration.
How to get DNS names for SLAAC-configured IPv6 hosts into the DNS servers is an entirely separate issue. One approach is
simply not to give DNS names to such hosts. In the NAT-router model for IPv4 autoconfiguration, hosts on the inward side of the
NAT router similarly do not have DNS names (although they are also not reachable directly, while SLAAC IPv6 hosts would be
reachable). If DNS names are needed for hosts, then a site might choose DHCPv6 for address assignment instead of SLAAC. It is
also possible to figure out the addresses SLAAC would use (by identifying the host-identifier bits) and then creating DNS entries
for these hosts. Finally, hosts can also use Dynamic DNS (RFC 2136 [https://tools.ietf.org/html/rfc2136.html]) to update their own
DNS records.

8.7.2.1 SLAAC privacy


A portable host that always uses SLAAC as it moves from network to network and always bases its SLAAC addresses on the EUI-
64 interface identifier (or on any other static interface identifier) will be easy to track: its interface identifier will never change. This
is one reason why the obfuscation mechanism of RFC 7217 [https://tools.ietf.org/html/rfc7217.html] interface identifiers (8.2.1
Interface identifiers) includes the network prefix in the hash: connecting to a new network will then result in a new interface
identifier.

8.8.2 https://eng.libretexts.org/@go/page/11173
Well before RFC 7217 [https://tools.ietf.org/html/rfc7217.html], however, RFC 4941 [https://tools.ietf.org/html/rfc4941.html]
introduced a set of privacy extensions to SLAAC: optional mechanisms for the generation of alternative interface identifiers,
based as with RFC 7217 on pseudorandom generation using the original LAN-address-based interface identifier as a “seed” value.
RFC 4941 goes further, however, in that it supports regular changes to the interface identifier, to increase the difficulty of tracking
a host over time even if it does not change its network prefix. One first selects a 128-bit secure-hash function F(), eg MD5 (22.6
Secure Hashes). New temporary interface IDs (IIDs) can then be calculated as follows

(I I Dnew , seednew ) = F (seedold , I I Dold ) (8.8.1)

where the left-hand pair represents the two 64-bit halves of the 128-bit return value of F() and the arguments to F() are
concatenated together. (The seventh bit of IIDnew must also be set to 0; cf 8.2.1 Interface identifiers where this bit is set to 1.) This
process is privacy-safe even if the initial IID is based on EUI-64.
The probability of two hosts accidentally choosing the same interface identifier in this manner is vanishingly small; the Neighbor
Solicitation mechanism with DAD must, however, still be used to verify that the address is in fact unique within the host’s LAN.
The privacy addresses above are to be used only for connections initiated by the client; to the extent that the host accepts incoming
connections and so needs a “fixed” IPv6 address, the address based on the original EUI-64/RFC-7217 interface identifier should
still be available. As a result, the RFC 7217 mechanism is still important for privacy even if the RFC 4941 mechanism is fully
operational.
RFC 4941 stated that privacy addresses were to be disabled by default, largely because of concerns about frequently changing IP
addresses. These concerns have abated with experience and so privacy addresses are often now automatically enabled. Typical
address lifetimes range from a few hours to 24 hours. Once an address has “expired” it generally remains available but deprecated
for a few temporary-address cycles longer.
DHCPv6 also provides an option for temporary address assignments, again to improve privacy, but one of the potential advantages
of SLAAC is that this process is entirely under the control of the end system.
Regularly (eg every few hours, or less) changing the host portion of an IPv6 address should make external tracking of a host more
difficult, at least if tracking via web-browser cookies is also somehow prevented. However, for a residential “site” with only a
handful of hosts, a considerable degree of tracking may be obtained simply by observing the common 64-bit prefix.
For a general discussion of privacy issues related to IPv6 addressing, see RFC 7721 [https://tools.ietf.org/html/rfc7721.html].

8.7.3 DHCPv6
The job of a DHCPv6 server is to tell an inquiring host its network prefix(es) and also supply a 64-bit host-identifier, very similar
to an IPv4 DHCPv4 server. Hosts begin the process by sending a DHCPv6 request to the All_DHCP_Relay_Agents_and_Servers
multicast IPv6 address ff02::1:2 (versus the broadcast address for IPv4). As with DHCPv4, the job of a relay agent is to tag a
DHCPv6 request with the correct current subnet, and then to forward it to the actual DCHPv6 server. This allows the DHCPv6
server to be on a different subnet from the requester. Note that the use of multicast does nothing to diminish the need for relay
agents. In fact, the All_DHCP_Relay_Agents_and_Servers multicast address scope is limited to the current LAN; relay agents then
forward to the actual DHCPv6 server using the site-scoped address All_DHCP_Servers.
Hosts using SLAAC to obtain their address can still use a special Information-Request form of DHCPv6 to obtain their DNS server
and any other “static” DHCPv6 information.
Clients may ask for temporary addresses. These are identified as such in the “Identity Association” field of the DHCPv6 request.
They are handled much like “permanent” address requests, except that the client may ask for a new temporary address only a short
time later. When the client does so, a different temporary address will be returned; a repeated request for a permanent address, on
the other hand, would usually return the same address as before.
When the DHCPv6 server returns a temporary address, it may of course keep a log of this address. The absence of such a log is one
reason SLAAC may provide a greater degree of privacy. SLAAC also places control of the cryptographic mechanisms for
temporary-address creation in the hands of the end user.
A DHCPv6 response contains a list (perhaps of length 1) of IPv6 addresses. Each separate address has an expiration date. The
client must send a new request before the expiration of any address it is actually using.

8.8.3 https://eng.libretexts.org/@go/page/11173
In DHCPv4, the host portion of addresses typically comes from “address pools” representing small ranges of integers such as 64-
254; these values are generally allocated consecutively. A DHCPv6 server, on the other hand, should take advantage of the
enormous range (264) of possible host portions by allocating values more sparsely, through the use of pseudorandomness. This is to
make it very difficult for an outsider who knows one of a site’s host addresses to guess the addresses of other hosts, cf 8.2.1
Interface identifiers.
The Internet Draft draft-ietf-dhc-stable-privacy-addresses [tools.ietf.org/id/draft-ietf...resses-00.txt] proposes the following
mechanism by which a DHCPv6 server may generate the interface-identifier bits for the addresses it hands out; F() is a secure-hash
function and its arguments are concatenated together:

F (pref ix, clientD U I D, I AI D, DADc ounter, secretk ey) (8.8.2)

The prefix, DAD_counter and secret_key arguments are as in 8.7.2.1 SLAAC privacy. The client_DUID is the string by which the
client identifies itself to the DHCPv6 server; it may be based on the Ethernet address though other options are possible. The IAID,
or Identity Association identifier, is a client-provided name for this request; different names are used when requesting temporary
versus permanent addresses.
Some older DHCPv6 servers may still allocate interface identifiers in serial order; such obsolete servers might make the SLAAC
approach more attractive.

This page titled 8.8: IPv6 Host Address Assignment is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

8.8.4 https://eng.libretexts.org/@go/page/11173
8.9: Globally Exposed Addresses
Perhaps the most striking difference between a contemporary IPv4 network and an IPv6 network is that on the former, many hosts
are likely to be “hidden” behind a NAT router (7.7 Network Address Translation). On an IPv6 network, on the other hand, every
host may be globally visible to the IPv6 world (though NAT may still be used to allow connectivity to legacy IPv4 servers).
Legacy IPv4 NAT routers provide a measure of each of privacy, security and nuisance. Privacy in IPv6 can be handled, as above,
through private or temporary addresses.
The degree of security provided via NAT is entirely due to the fact that all connections must be initiated from the inside; no packet
from the outside is allowed through the NAT firewall unless it is a response to a packet sent from the inside. This feature, however,
can also be implemented via a conventional firewall (IPv4 or IPv6), without address translation. Furthermore, given such a
conventional firewall, it is then straightforward to modify it so as to support limited and regulated connections from the outside
world as desired; an analogous modification of a NAT router is more difficult. (That said, a blanket ban on IPv6 connections from
the outside can prove as frustrating as IPv4 NAT.)
Finally, one of the major reasons for hiding IPv4 addresses is that with IPv4 it is easy to map a /24 subnet by pinging or otherwise
probing each of the 254 possible hosts; such mapping may reveal internal structure. In IPv6 such mapping is meant to be
impractical as a /64 subnet has 264 ≃ 18 quintillion hosts (though see the randomness note in 8.2.1 Interface identifiers). If the low-
order 64 bits of a host’s IPv6 address are chosen with sufficient randomness, finding the host by probing is virtually impossible; see
exercise 6.0.
As for nuisance, NAT has always broken protocols that involve negotiation of new connections (eg TFTP, FTP, or SIP, used by
VoIP); IPv6 should make these much easier to manage.

This page titled 8.9: Globally Exposed Addresses is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

8.9.1 https://eng.libretexts.org/@go/page/11172
8.10: ICMPv6
RFC 4443 [https://tools.ietf.org/html/rfc4443.html] defines an updated version of the ICMP protocol for IPv6. As with the IPv4
version, messages are identified by 8-bit type and code (subtype) fields, making it reasonably easy to add new message formats. We
have already seen the ICMP messages that make up Neighbor Discovery (8.6 Neighbor Discovery).
Unlike ICMPv4, ICMPv6 distinguishes between informational and error messages by the first bit of the type field. Unknown
informational messages are simply dropped, while unknown error messages must be handed off, if possible, to the appropriate
upper-layer process. For example, “[UDP] port unreachable” messages are to be delivered to the UDP sender of the undeliverable
packet.
ICMPv6 includes an IPv6 version of Echo Request / Echo Reply, upon which the “ping6” command (8.12.1 ping6) is based; unlike
with IPv4, arriving IPv6 echo-reply messages are delivered to the process that generated the corresponding echo request. The base
ICMPv6 specification also includes formats for the error conditions below; this list is somewhat cleaner than the corresponding
ICMPv4 list:
Destination Unreachable
In this case, one of the following numeric codes is returned:
0. No route to destination, returned when a router has no next_hop entry.
1. Communication with destination administratively prohibited, returned when a router has a next_hop entry, but declines to
use it for policy reasons. Codes 5 and 6, below, are special cases of this situation; these more-specific codes are returned when
appropriate.
2. Beyond scope of source address, returned when a router is, for example, asked to route a packet to a global address, but the
return address is not, eg is unique-local. In IPv4, when a host with a private address attempts to connect to a global address,
NAT is almost always involved.
3. Address unreachable, a catchall category for routing failure not covered by any other message. An example is if the packet
was successfully routed to the last_hop router, but Neighbor Discovery failed to find a LAN address corresponding to the IPv6
address.
4. Port unreachable, returned when, as in ICMPv4, the destination host does not have the requested UDP port open.
5. Source address failed ingress/egress policy, see code 1.
6. Reject route to destination, see code 1.
Packet Too Big
This is like ICMPv4’s “Fragmentation Required but DontFragment flag set”; IPv6 however has no router-based fragmentation.
Time Exceeded
This is used for cases where the Hop Limit was exceeded, and also where source-based fragmentation was used and the fragment-
reassembly timer expired.
Parameter Problem
This is used when there is a malformed entry in the IPv6 header, an unrecognized Next Header type, or an unrecognized IPv6
option.
_node information:

8.9.1 Node Information Messages


ICMPv6 also includes Node Information (NI) Messages, defined in RFC 4620 [https://tools.ietf.org/html/rfc4620.html]. One form
of NI query allows a host to be asked directly for its name; this is accomplished in IPv4 via reverse-DNS lookups (7.8.2 Other DNS
Records). Other NI queries allow a host to be asked for its other IPv6 addresses, or for its IPv4 addresses. Recipients of NI queries
may be configured to refuse to answer.

This page titled 8.10: ICMPv6 is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

8.10.1 https://eng.libretexts.org/@go/page/11171
8.11: IPv6 Subnets
In the IPv4 world, network managers sometimes struggle to divide up a limited address space into a pool of appropriately sized
subnets. In IPv6, this is much simpler: all subnets are of size /64, following the guidelines set out in 8.3 Network Prefixes.
There is one common exception: RFC 6164 [https://tools.ietf.org/html/rfc6164.html] permits the use of 127-bit prefixes at each
end of a point-to-point link. The 128th bit is then 0 at one end and 1 at the other.
A site receiving from its provider an address prefix of size /56 can assign up to 256 /64 subnets. As with IPv4, the reasons for IPv6
subnetting are to join incompatible LANs, to press intervening routers into service as inter-subnet firewalls, or otherwise to
separate traffic.
The diagram below shows a site with an external prefix of 2001::/62, two routers R1 and R2 with interfaces numbered as shown,
and three internal LANS corresponding to three subnets 2001:0:0:1::/64, 2001:0:0:2::/64 and 2001:0:0:3::/64. The subnet
2001:0:0:0::/64 (2001::/64) is used to connect to the provider.

2001:0:0:2::/64

2
0 1 1
Provider R1 R2
2001:0:0:0::/64 2001:0:0:1::/64
3

2001:0:0:3::/64

Interface 0 of R1 would be assigned an address from the /64 block 2001:0:0:0/64, perhaps 2001::2.
R1 will announce over its interface 1 – via router advertisements – that it has a route to ::/0, that is, it has the default route. It will
also advertise via interface 1 the on-link prefix 2001:0:0:1::/64.
R2 will announce via interface 1 its routes to 2001:0:0:2::/64 and 2001:0:0:3::/64. It will also announce the default route on
interfaces 2 and 3. On interface 2 it will advertise the on-link prefix 2001:0:0:2::/64, and on interface 3 the prefix 2001:0:0:3::/64. It
could also, as a backup, advertise prefix 2001:0:0:1::/64 on its interface 1. On each subnet, only the subnet’s on-link prefix is
advertised.

8.10.1 Subnets and /64


Fixing the IPv6 division of prefix and host (interface) lengths at 64 bits for each is a compromise. While it does reduce the
maximum number of subnets from 2128 to 264, in practice this is not a realistic concern, as 264 is still an enormous number.
By leaving 64 bits for host identifiers, this 64/64 split leaves enough room for the privacy mechanisms of 8.7.2.1 SLAAC privacy
and 8.7.3 DHCPv6 to provide reasonable protection.
Much of the recent motivation for considering divisions other than 64/64 is grounded in concerns about ISP address-allocation
policies. By declaring that users should each receive a /64 allocation, one hope is that users will in fact get enough for several
subnets. Even a residential customer with only, say, two hosts and a router needs more than a single /64 address block, because the
link from ISP to customer needs to be on its own subnet (it could use a 127-bit prefix, as above, but many customers would in fact
have a need for multiple /64 subnets). By requiring /64 for a subnet, the hope is that users will all be allocated, for example,
prefixes of at least /60 (16 subnets) or even /56 (256 subnets).
Even if that hope does not pan out, the 64/64 rule means that every user should at least get a /64 allocation.
On the other hand, if users are given only /64 blocks, and they want to use subnets, then they have to break the 64/64 rule locally.
Perhaps they can create four subnets each with a prefix of length 66 bits, and each with only 62 bits for the host identifier. Wanting
to do that in a standard way would dictate more flexibility in the prefix/host division.
But if the prefix/host division becomes completely arbitrary, there is nothing to stop ISPs from handing out prefixes with lengths of
/80 (leaving 48 host bits) or even /120.
The general hope is that ISPs will not be so stingy with prefix lengths. But with IPv6 adoption still relatively modest, how this will
all work out is not yet clear. In the IPv4 world, users use NAT (7.7 Network Address Translation) to create as many subnets as they
desire. In the IPv6 world, NAT is generally considered to be a bad idea.

8.11.1 https://eng.libretexts.org/@go/page/11170
Finally, in theory it is possible to squeeze a site with two subnets onto a single /64 by converting the site’s main router to a switch;
all the customer’s hosts now connect on an equal footing to the ISP. But this means making it much harder to use the router as a
firewall, as described in 8.8 Globally Exposed Addresses. For most users, this is too risky.

This page titled 8.11: IPv6 Subnets is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

8.11.2 https://eng.libretexts.org/@go/page/11170
8.12: Using IPv6 and IPv4 Together
In this section we will assume that IPv6 connectivity exists at a site; if it does not, see 8.13 IPv6 Connectivity via Tunneling.
If IPv6 coexists on a client machine with IPv4, in a so-called dual-stack configuration, which is used? If the client wants to
connect using TCP to an IPv4-only website (or to some other network service), there is no choice. But what if the remote site also
supports both IPv4 and IPv6?
The first step is the DNS lookup, triggered by the application’s call to the appropriate address-lookup library procedure; in the Java
stalk example of 11.1.3.3 The Client we use InetAddress.getByName() . In the C language, address lookup is done with
getaddrinfo() or (the now-deprecated) gethostbyname() . The DNS system on the client then contacts its DNS
resolver and asks for the appropriate address record corresponding to the server name.
For IPv4 addresses, DNS maintains so-called “A” records, for “Address”. The IPv6 equivalent is the “AAAA” record, for “Address
four times longer”. A dual-stack machine usually requests both. The Internet Draft draft-vavrusa-dnsop-aaaa-for-free
[datatracker.ietf.org/doc/dra...aaa-for-free/] proposes that, whenever a DNS server delivers an IPv4 A record, it also includes the
corresponding AAAA record, much as IPv4 CNAME records are sent with piggybacked corresponding A records (7.8.1 nslookup
(and dig)). The DNS requests are sent to the client’s pre-configured DNS-resolver address (probably set via DHCP).

IPv6 and this book


This book is, as of April 2015, available via IPv6. Within the cs.luc.edu DNS zone are defined the following:
intronetworks [http://intronetworks.cs.luc.edu]: both A and AAAA records
intronetworks6 [http://intronetworks6.cs.luc.edu]: AAAA records only
intronetworks4 [http://intronetworks4.cs.luc.edu]: A records only

DNS itself can run over either IPv4 or IPv6. A DNS server (authoritative nameserver or just resolver) using only IPv4 can answer
IPv6 AAAA-record queries, and a DNS server using only IPv6 can answer IPv4 A-record queries. Ideally each nameserver would
eventually support both IPv4 and IPv6 for all queries, though it is common for hosts with newly enabled IPv6 connectivity to
continue to use IPv4-only resolvers. See RFC 4472 [https://tools.ietf.org/html/rfc4472.html] for a discussion of some operational
issues.
Here is an example of DNS requests for A and AAAA records made with the nslookup utility from the command line. (In this
example, the DNS resolver was contacted using IPv4.)

nslookup -query=A facebook.com


Name: facebook.com
Address: 173.252.120.6
nslookup -query=AAAA facebook.com
facebook.com has AAAA address 2a03:2880:2130:cf05:face:b00c:0:1
A few sites have IPv6-only DNS names. If the DNS query returns only an AAAA record, IPv6 must be used. One example in 2015
is ipv6.google.com [http://ipv6.google.com]. In general, however, IPv6-only names such as this are recommended only for
diagnostics and testing. The primary DNS names for IPv4/IPv6 sites should have both types of DNS records, as in the Facebook
example above (and as for google.com [http://google.com]).

Java getByName()
The Java getByName() call may not abide by system-wide RFC 6742 [https://tools.ietf.org/html/rfc6742.html]-style
preferences; the Java Networking Properties documentation [docs.oracle.com/javase/8/docs...operties.html] (2015) states that
“the default behavior is to prefer using IPv4 addresses over IPv6 ones”. This can be changed by setting the system property
java.net.preferIPv6Addresses to true , using System.setProperty() .

If the client application uses a library call like Java’s InetAddress.getByName() , which returns a single IP address, the
client will then attempt to connect to the address returned. If an IPv4 address is returned, the connection will use IPv4, and

8.12.1 https://eng.libretexts.org/@go/page/11169
similarly with IPv6. If an IPv6 address is returned and IPv6 connectivity is not working, then the connection will fail.
For such an application, the DNS resolver library thus effectively makes the IPv4-or-IPv6 decision. RFC 6724
[https://tools.ietf.org/html/rfc6724.html], which we encountered above in 8.7.2 Stateless Autoconfiguration (SLAAC), provides a
configuration mechanism, through a small table of IPv6 prefixes and precedence values such as the following.

prefix precedence

::1/128 50 IPv6 loopback

::/0 40 “default” match

2002::/16 30 6to4 address; see sidebar in 8.13 IPv6 Connectivity via Tunneling

::ffff:0:0/96 10 Matches embedded IPv4 addresses; see 8.3 Network Prefixes

fc00::/7 3 unique-local plus reserved; see 8.3 Network Prefixes

An address is assigned a precedence by looking it up in the table, using the longest-match rule (10.1 Classless Internet Domain
Routing: CIDR); a list of addresses is then sorted in decreasing order of precedence. There is no entry above for link-local
addresses, but by default they are ranked below global addresses. This can be changed by including the link-local prefix fe80::/64
in the above table and ranking it higher than, say, ::/0.
The default configuration is generally to prefer IPv6 if IPv6 is available; that is, if an interface has an IPv6 address that is (or
should be) globally routable. Given the availability of both IPv6 and IPv4, a preference for IPv6 is implemented by assigning the
prefix ::/0 – matching general IPv6 addresses – a higher precedence than that assigned to the IPv4-specific prefix ::ffff:0:0/96. This
is done in the table above.
Preferring IPv6 does not always work out well, however; many hosts have IPv6 connectivity through tunneling that may be slow,
limited or outright down. The precedence table can be changed to prefer IPv4 over IPv6 by raising the precedence for the prefix
::ffff:0.0.0.0/96 to a value higher than that for ::/0. Such system-wide configuration is usually done on Linux hosts by editing
/etc/gai.conf and on Windows via the netsh command; for example,
netsh interface ipv6 show prefixpolicies .
We can see this systemwide IPv4/IPv6 preference in action using OpenSSH [www.openssh.com/] (see 22.10.1 SSH), between two
systems that each support both IPv4 and IPv6 (the remote system here is intronetworks.cs.luc.edu). With the IPv4-matching prefix
precedence set high, connection is automatically via IPv4:

/etc/gai.conf: precedence ::ffff:0:0/96 100


ssh:
Connecting to intronetworks.cs.luc.edu [162.216.18.28] ...
With the IPv4-prefix precedence set low, new connections use IPv6:

/etc/gai.conf: precedence ::ffff:0:0/96 10


ssh:
Connecting to intronetworks.cs.luc.edu
[2600:3c03::f03c:91ff:fe69:f438] ...
Applications can also use a DNS-resolver call that returns a list of all addresses matching a given hostname. (Often this list will
have just two entries, for the IPv4 and IPv6 addresses, though round-robin DNS (7.8 DNS) can make the list much longer.) The C
language getaddrinfo() call returns such a list, as does the Java InetAddress.getAllByName() . The RFC 6724
[https://tools.ietf.org/html/rfc6724.html] preferences then determine the relative order of IPv4 and IPv6 entries in this list.
If an application requests such a list of all addresses, probably the most common strategy is to try each address in turn, according to
the system-provided order. In the example of the previous paragraph, OpenSSH does in fact request a list of addresses, using
getaddrinfo() , but, according to its source code, tries them in order and so usually connects to the first address on the list,
that is, to the one preferred by the RFC 6724 [https://tools.ietf.org/html/rfc6724.html] rules. Alternatively, an application might

8.12.2 https://eng.libretexts.org/@go/page/11169
implement user-specified configuration preferences to decide between IPv4 and IPv6, though user interest in this tends to be
limited (except, perhaps, by readers of this book).
The “Happy Eyeballs” algorithm, RFC 8305 [https://tools.ietf.org/html/rfc8305.html], offers a more nuanced strategy for deciding
whether an application should connect using IPv4 or IPv6. Initially, the client might try the IPv6 address (that is, will send TCP
SYN to the IPv6 address, 12.3 TCP Connection Establishment). If that connection does not succeed within, say, 250 ms, the client
would try the IPv4 address. 250 ms is barely enough time for the TCP handshake to succeed; it does not allow – and is not meant to
allow – sufficient time for a retransmission. The client falls back to IPv4 well before the failure of IPv6 is certain.
IPv6 servers
As of 2015, the list of websites supporting IPv6 was modest, though the number has crept up since then. Some sites, such as
apple.com and microsoft.com, require the “www” prefix for IPv6 availability. Networking providers are more likely to be IPv6-
available. Sprint.com gets an honorable mention for having the shortest IPv6 address I found: 2600::aaaa.
A Happy-Eyeballs client is also encouraged to cache the winning protocol, so for the next connection the client will attempt to use
only the protocol that was successful before. The cache timeout is to be on the order of 10 minutes, so that if IPv6 connectivity
failed and was restored then the client can resume using it with only moderate delay. Unfortunately, if the Happy Eyeballs
mechanism is implemented at the application layer, which is often the case, then the scope of this cache may be limited to the
particular application.
As IPv6 becomes more mainstream, Happy Eyeballs implementations are likely to evolve towards placing greater confidence in the
IPv6 option. One simple change is to increase the time interval during which the client waits for an IPv6 response before giving up
and trying IPv4.
We can test for the Happy Eyeballs mechanism by observing traffic with WireShark. As a first example, we imagine giving our
client host a unique-local IPv6 address (in addition to its automatic link-local address); recall that unique-local addresses are not
globally routable. If we now were to connect to, say, google.com [http://google.com], and monitor the traffic using WireShark, we
would see a DNS AAAA query (IPv6) for “google.com” followed immediately by a DNS A query (IPv4). The subsequent TCP
SYN, however, would be sent only to the IPv4 address: the client host would know that its IPv6 unique-local address is not
routable, and it is not even tried.
Next let us change the IPv6 address for the client host to 2000:dead:beef:cafe::2 , through manual configuration (8.12.3
Manual address configuration), and without providing an actual IPv6 connection. (We also manually specify a fake default router.)
This address is part of the 2000::/3 block, and is supposed to be globally routable.
We now try two connections to google.com , TCP port 80. The first is via the Firefox browser.

We see two DNS queries, AAAA and A, in packets 1-4, followed by the first attempt (highlighted in orange) at T=0.071 to
negotiate a TCP connection via IPv6 by sending a TCP SYN packet (12.3 TCP Connection Establishment) to the google.com
IPv6 address 2607:f8b0:4009:80b::200e. Only 250 ms later, at T=0.321, we see a second DNS A-query (IPv4), followed by an
ultimately successful connection attempt using IPv4 starting at T=0.350. This particular version of Firefox, in other words, has
implemented the Happy Eyeballs dual-stack mechanism.
Now we try the connection using the previously mentioned OpenSSH application, using -p 80 to connect to port 80. (This
example was generated somewhat later; DNS now returns 2607:f8b0:4009:807::1004 as google.com’s IPv6 address.)

8.12.3 https://eng.libretexts.org/@go/page/11169
We see two DNS queries, AAAA and A, in packets numbered 4 and 6 (pale blue); these are made by the client from its IPv4
address 10.2.5.19. Half a millisecond after the A query returns (packet 7), the client sends a TCP SYN packet to google.com’s IPv6
address; this packet is highlighted in orange. This SYN packet is retransmitted 3 seconds and then 9 seconds later (in black), to no
avail. After 21 seconds, the client gives up on IPv6 and attempts to connect to google.com at its IPv4 address,
173.194.46.105; this connection (in green) is successful. The long delay shows that Happy Eyeballs was not implemented by
OpenSSH, which its source code confirms.
(The host initiating the connections here was running Ubuntu 10.04 LTS, from 2010. The ultimately failing TCP connection gives
up after three tries over only 21 seconds; newer systems make more tries and take much longer before they abandon a connection
attempt.)

This page titled 8.12: Using IPv6 and IPv4 Together is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

8.12.4 https://eng.libretexts.org/@go/page/11169
8.13: IPv6 Examples Without a Router
In this section we present a few IPv6 experiments that can be done without an IPv6 connection and without even an IPv6 router.
Without a router, we cannot use SLAAC or DHCPv6. We will instead use link-local addresses, which require the specification of
the interface along with the address, and manually configured unique-local (8.3 Network Prefixes) addresses. One practical
problem with link-local addresses is that application documentation describing how to include a specification of the interface is
sometimes sparse.

8.12.1 ping6
The IPv6 analogue of the familiar ping command, used to send ICMPv6 Echo Requests, is ping6 on Linux and Mac
systems and ping -6 on Windows. The ping6 command supports an option to specify the interface; eg -I eth0 ; as noted
above, this is mandatory when sending to link-local addresses. Here are a few ping6 examples:
ping6 ::1: This pings the host’s loopback address; it should always work.
ping6 -I eth0 ff02::1: This pings the all-nodes multicast group on interface eth0 . Here are two of the answers received:
64 bytes from fe80::3e97:eff:fe2c:2beb (this is the host I am pinging from)
64 bytes from fe80::2a0:ccff:fe24:b0e4 (a second Linux host)
Answers were also received from a Windows machine and an Android phone. A VoIP phone – on the same subnet but supporting
IPv4 only – remained mute, despite VoIP’s difficulties with IPv4 NAT that would be avoided with IPv6. In lieu of the interface
option -I eth0 , the “zone-identifier” syntax ping6 ff02::1%eth0 also usually works; see the following section.
ping6 -I eth0 fe80::2a0:ccff:fe24:b0e4: This pings the link-local address of the second Linux host answering the previous query;
again, the %eth0 syntax should also work. The destination interface identifier here uses the now-deprecated EUI-64 format; note
the “ff:fe” in the middle. Also note the flipped seventh bit of the two bytes 02a0; the destination has Ethernet address
00:a0:cc:24:b0:e4.

8.12.2 TCP connections using link-local addresses


The next experiment is to create a TCP connection. Some commands, like ping6 above, may provide for a way of specifying the
interface as a command-line option. Failing that, RFC 4007 [https://tools.ietf.org/html/rfc4007.html] defines the concept of a zone
identifier that is appended to the IPv6 address, separated from it by a “%” character, to specify the link involved. On Linux
systems the zone identifier is most often the interface name, eg eth0 or ppp1 . Numeric zone identifiers are also used, in
which case it represents the number of the particular interface in some designated list and can be called the zone index. On
Windows systems the zone index for an interface can often be inferred from the output of the ipconfig command, which
should include it with each link-local address. The use of zone identifiers is often restricted to literal (numeric) IPv6 addresses,
perhaps because there is little demand for symbolic link-local addresses.
The following link-local address with zone identifier creates an ssh connection to the second Linux host in the example of the
preceding section:

ssh fe80::2a0:ccff:fe24:b0e4 %eth0


That the ssh service is listening for IPv6 connections can be verified on that host by netstat -a | grep -i tcp6 . That
the ssh connection actually used IPv6 can be verified by, say, use of a network sniffer like WireShark (for which the filter
expression ipv6 or ip.version == 6 is useful). If the connection fails, but ssh works for IPv4 connections and shows as
listening in the tcp6 list from the netstat command, a firewall-blocked port is a likely suspect.

8.12.3 Manual address configuration


The use of manually configured addresses is also possible, for either global or unique-local (ie not connected to the Internet)
addresses. However, without a router there can be no Prefix Discovery, 8.6.2 Prefix Discovery, and this may create subtle

8.13.1 https://eng.libretexts.org/@go/page/11168
differences.
The first step is to pick a suitable prefix; in the example below we use the unique-local prefix fd37:beef:cafe::/64 (though this
particular prefix does not meet the randomness rules for unique-local prefixes). We could also use a globally routable prefix, but
here we do not want to mislead any hosts about reachability.
Without a router as a source of Router Advertisements, we need some way to specify both the prefix and the prefix length; the
latter can be thought of as corresponding to the IPv4 subnet mask. One might be forgiven for imagining that the default prefix
length would be /64, given that this is the only prefix length generally allowed (8.3 Network Prefixes), but this is often not the case.
In the commands below, the prefix length is included at the end as the /64. This usage is just slightly peculiar, in that in the IPv4
world the slash notation is most often used only with true prefixes, with all bits zero beyond the slash length. (The Linux ip
command also uses the slash notation in the sense here, to specify an IPv4 subnet mask, eg 10.2.5.37/24. The ifconfig and
Windows netsh commands specify the IPv4 subnet mask the traditional way, eg 255.255.255.0.)
Hosts will usually assume that a prefix configured this way with a length represents an on-link prefix, meaning that neighbors
sharing the prefix are reachable directly via the LAN.
We can now assign the low-order 64 bits manually. On Linux this is done with:
host1: ip -6 address add fd37:beef:cafe::1/64 dev eth0
host2: ip -6 address add fd37:beef:cafe::2/64 dev eth0
Macintosh systems can be configured similarly except the name of the interface is probably en0 rather than eth0 . On
Windows systems, a typical IPv6-address-configuration command is

netsh interface ipv6 add address "Local Area Connection" fd37:beef:cafe::1/64

Now on host1 the command

ssh fd37:beef:cafe::2
should create an ssh connection to host2, again assuming ssh on host2 is listening for IPv6 connections. Because the addresses
here are not link-local, /etc/host entries may be created for them to simplify entry.
Assigning IPv6 addresses manually like this is not recommended, except for experiments.
On a LAN not connected to the Internet and therefore with no actual routing, it is nonetheless possible to start up a Router
Advertisement agent (8.6.1 Router Discovery), such as radvd, with a manually configured /64 prefix. The RA agent will include
this prefix in its advertisements, and reasonably modern hosts will then construct full addresses for themselves from this prefix
using SLAAC. IPv6 can then be used within the LAN. If this is done, the RA agent should also be configured to announce only a
meaningless route, such as ::/128, or else nodes may falsely believe the RA agent is providing full Internet connectivity.

This page titled 8.13: IPv6 Examples Without a Router is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

8.13.2 https://eng.libretexts.org/@go/page/11168
8.14: IPv6 Connectivity via Tunneling
The best option for IPv6 connectivity is native support by one’s ISP. In such a situation one’s router should be sending out Router
Advertisement messages, and from these all the hosts should discover how to reach the IPv6 Internet.
If native IPv6 support is not forthcoming, however, a short-term option is to connect to the IPv6 world using packet tunneling
(less often, some other VPN mechanism is used). RFC 4213 [https://tools.ietf.org/html/rfc4213.html] outlines the common 6in4
strategy of simply attaching an IPv4 header to the front of the IPv6 packet; it is very similar to the IPv4-in-IPv4 encapsulation of
7.13.1 IP-in-IP Encapsulation.
There are several available providers for this service; they can be found by searching for “IPv6 tunnel broker”. Some tunnel
brokers provide this service at no charge.
6in4, 6to4
6in4 tunneling should not be confused with 6to4 tunneling, which uses the same encapsulation as 6in4 but which constructs a site’s
IPv6 prefix by embedding its IPv4 address: a site with IPv4 address 129.3.5.7 gets IPv6 prefix 2002:8103:0507::/48 (129 decimal =
0x81). See RFC 3056 [https://tools.ietf.org/html/rfc3056.html]. There is also a 6over4, RFC 2529
[https://tools.ietf.org/html/rfc2529.html].
The basic idea behind 6in4 tunneling is that the tunnel broker allocates you a /64 prefix out of its own address block, and agrees to
create an IPv4 tunnel to you using 6in4 encapsulation. All your IPv6 traffic from the Internet is routed by the tunnel broker to you
via this tunnel; similarly, IPv6 packets from your site reach the outside world using this same tunnel. The tunnel, in other words, is
your link to an IPv6 router.
Generally speaking, the MTU of the tunnel must be at least 20 bytes less than the MTU of the physical interface, to allow space for
the header. At the near end this requires a local configuration change; tunnel brokers often provide a way for users to set the MTU
at the far end. Practical MTU values vary from a mandatory IPv6 minimum of 1280 to the Ethernet maximum of 1500−20 = 1480.
Setting up the tunnel does not involve creating a stateful connection. All that happens is that the tunnel client (ie your endpoint) and
the broker record each other’s IPv4 addresses, and agree to accept encapsulated IPv6 packets from one another provided these two
endpoint addresses are used as source and destination. The tunnel at the client end is represented by an appropriate “virtual network
interface”, eg sit0 or gif0 or IP6Tunnel . Tunnel providers generally supply the basic commands necessary to get the
tunnel interface configured and the MTU set.
Once the tunnel is created, the tunnel interface at the client end must be assigned an IPv6 address and then a (default) route. We
will assume that the /64 prefix for the broker-to-client link is 2001:470:0:10::/64, with the broker at 2001:470:0:10::1 and with the
client to be assigned the address 2001:470:0:10::2. The address and route are set up on the client with the following commands
(Linux/Mac/Windows respectively; interface names may vary, and some commands assume the interface represents a point-to-
point link):

ip addr add 2001:470:0:10::2/64 dev sit1


ip route add ::/0 dev sit1

ifconfig gif0 inet6 2001:470:0:10::2 2001:470:0:10::1 prefixlen 128


route -n add -inet6 default 2001:470:0:10::1

netsh interface ipv6 add address IP6Tunnel 2001:470:0:10::2


netsh interface ipv6 add route ::/0 IP6Tunnel 2001:470:0:10::1

At this point the tunnel client should have full IPv6 connectivity! To verify this, one can use ping6 , or visit IPv6-only versions
of websites (eg intronetworks6.cs.luc.edu [http://intronetworks6.cs.luc.edu]), or visit IPv6-identifying sites such as
IsMyIPv6Working.com [ismyipv6working.com]. Alternatively, one can often install a browser plugin to at least make visible
whether IPv6 is used. Finally, one can use netcat with the -6 option to force IPv6 use, following the HTTP example in
12.6.2 netcat again.

8.14.1 https://eng.libretexts.org/@go/page/11167
There is one more potential issue. If the tunnel client is behind an IPv4 NAT router, that router must deliver arriving encapsulated
6in4 packets correctly. This can sometimes be a problem; encapsulated 6in4 packets are at some remove from the TCP and UDP
traffic that the usual consumer-grade NAT router is primarily designed to handle. Careful study of the router forwarding settings
may help, but sometimes the only fix is a newer router. A problem is particularly likely if two different inside clients attempt to set
up tunnels simultaneously; see 7.13.1 IP-in-IP Encapsulation.

8.13.1 IPv6 firewalls


It is strongly recommended that an IPv6 host block new inbound connections over its IPv6 interface (eg the tunnel interface),
much as an IPv4 NAT router would do. Exceptions may be added as necessary for essential services (such as ICMPv6). Using the
linux ip6tables firewall command, with IPv6-tunneled interface sit1 , this might be done with the following:

ip6tables --append INPUT --in-interface sit1 --protocol icmpv6 --jump ACCEPT


ip6tables --append INPUT --in-interface sit1 --match conntrack --ctstate ESTABLISHED,R
ip6tables --append INPUT --in-interface sit1 --jump DROP

At this point the firewall should be tested by attempting to access inside hosts from the outside. At a minimum, ping6 from the
outside to any global IPv6 address of any inside host should fail if the ICMPv6 exception above is removed (and should succeed if
the ICMPv6 exception is restored). This can be checked by using any of several websites that send pings on request; such sites can
be found by searching for “online ipv6 ping”. There are also a few sites that will run a remote IPv6 TCP port scan; try searching for
“online ipv6 port scan”. See also exercise 7.0.

8.13.2 Setting up a router


The next step, if desired, is to set up the tunnel endpoint as a router, so other hosts at the client site can also enjoy IPv6
connectivity. For this we need a second /64 prefix; we will assume this is 2001:470:0:20::/64 (note this is not an “adjacent” /64; the
two /64 prefixes cannot be merged into a /63). Let R be the tunnel endpoint, with eth0 its LAN interface, and let A be another
host on the LAN.
We will use the linux radvd package as our Router Advertisement agent (8.6.1 Router Discovery). In the radvd.conf file,
we need to say that we want the LAN prefix 2001:470:0:20::/64 advertised as on-link over interface eth0:

interface eth0 {
...
prefix 2001:470:0:20::/64
{
AdvOnLink on; # advertise this prefix as on-link
AdvAutonomous on; # allows SLAAC with this prefix
};
};

If radvd is now started, other LAN hosts (eg A) will automatically get the prefix (and thus a full SLAAC address). Radvd
will automatically share R’s default route (::/0), taking it not from the configuration file but from R’s routing table. (It may still be
necessary to manually configure the IPv6 address of R’s eth0 interface, eg as 2001:470:0:20::1.)
On the author’s version of host A, the IPv6 route is now (with some irrelevant attributes not shown)

default via fe80::2a0:ccff:fe24:b0e4 dev eth0

That is, host A routes to R via the latter’s link-local address, always guaranteed on-link, rather than via the subnet address.
If radvd or its equivalent is not available, the manual approach is to assign R and A each a /64 address:

8.14.2 https://eng.libretexts.org/@go/page/11167
On host R: ip -6 address add 2001:470:0:20::1/64 dev eth0
On host A: ip -6 address add 2001:470:0:20::2/64 dev eth0
Because of the “/64” here (8.12.3 Manual address configuration), R and A understand that they can reach each other via the LAN,
and do so. Host A also needs to be told of the default route via R:

On host A: ip -6 route add ::/0 via 2001:470:0:10::1 dev eth0


Here we use the subnet address of R, but we could have used R’s link-local address as well.
It is likely that A’s eth0 will also need its MTU configured, so that it matches that of R’s virtual tunnel interface (which, recall,
should be at least 20 bytes less than the MTU of R’s physical outbound interface).

8.13.2.1 A second router


Now let us add a second router R2, as in the diagram below. The R──R2 link is via a separate Ethernet LAN, not a point-to-point
link. The LAN with A is, as above, subnet 2001:470:0:20::/64.
router_ipv6.svg

In this case, it is R2 that needs to run the Router Advertisement agent (eg radvd). If this were an IPv4 network, the interfaces eth0
and eth1 on the R──R2 link would need IPv4 addresses from some new subnet (though the use of private addresses is an option).
We can’t use unnumbered interfaces (7.12 Unnumbered Interfaces), because the R──R2 connection is not a point-to-point link.
But with IPv6, we can configure the R──R2 routing to use only link-local addresses. Let us assume for mnemonic convenience
these are as follows:

R’s eth0: fe80::ba5e:ba11


R2’s eth1: fe80::dead:beef
R2’s forwarding table will have a default route with next_hop fe80::ba5e:ba11 (R). Similarly, R’s forwarding table will
have an entry for destination subnet 2001:470:0:20::/64 with next_hop fe80::dead:beef (R2). Neither eth0 nor eth1 needs
any other IPv6 address.
R2’s eth2 interface will likely need a global IPv6 address, eg 2001:470:0:20::1 again. Otherwise R2 may not be able to determine
that its eth2 interface is in fact connected to the 2001:470:0:20::/64 subnet.
One advantage of not giving eth0 or eth1 global addresses is that it is then impossible for an outside attacker to reach these
interfaces directly. It also saves on subnets, although one hopes with IPv6 those are not in short supply. All routers at a site are
likely to need, for management purposes, an IP address reachable throughout the site, but this does not have to be globally visible.

This page titled 8.14: IPv6 Connectivity via Tunneling is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

8.14.3 https://eng.libretexts.org/@go/page/11167
8.15: IPv6-to-IPv4 Connectivity
What happens if you switch to IPv6 completely, perhaps because your ISP (or phone provider) has run out of IPv4 addresses?
Some of the time – hopefully more and more of the time – you will only need to talk to IPv6 servers. For example, the DNS names
facebook.com and google.com each correspond to an IPv4 address, but also to an IPv6 address (above). But what do
you do if you want to reach an IPv4-only server? Such servers are expected to continue operating for a long time to come. It is
necesary to have some sort of centralized IPv6-to-IPv4 translator.
An early strategy was NAT-PT (RFC 2766 [https://tools.ietf.org/html/rfc2766.html]). The translator was assigned a /96 prefix. The
IPv6 host would append to this prefix the 32-bit IPv4 address of the destination, and use the resulting address to contact the IPv4
destination. Packets sent to this address would be delivered via IPv6 to the translator, which would translate the IPv6 header into
IPv4 and then send the translated packet on to the IPv4 destination. As in IPv4 NAT (7.7 Network Address Translation), the reverse
translation will typically involve TCP port numbers to resolve ambiguities. This approach requires the IPv6 host to be aware of the
translator, and is limited to TCP and UDP (because of the use of port numbers). Due to these and several other limitations, NAT-PT
was formally deprecated in RFC 4966 [https://tools.ietf.org/html/rfc4966.html].

Do you still have IPv4 service?


As of 2017, several phone providers have switched many of their customers to IPv6 while on their mobile-data networks. The
change can be surprisingly inconspicuous. Connections to IPv4-only services still work just fine, courtesy of NAT64. About
the only way to tell is to look up the phone’s IP address.

The replacement protocol is NAT64, documented in RFC 6146 [https://tools.ietf.org/html/rfc6146.html]. This is also based on
address translation, and, as such, cannot allow connections initiated from IPv4 hosts to IPv6 hosts. It is, however, transparent to
both the IPv6 and IPv4 hosts involved, and is not restricted to TCP (though only TCP, UDP and ICMP are supported by RFC 6146
[https://tools.ietf.org/html/rfc6146.html]). It uses a special DNS variant, DNS64 (RFC 6147
[https://tools.ietf.org/html/rfc6147.html]), as a companion protocol.
To use NAT64, an IPv6 client sends out its ordinary DNS query to find the addresses of the destination server. The DNS resolver
(7.8 DNS) receiving the request must use DNS64. If the destination has only an IPv4 address, then the DNS resolver will return to
the IPv6 client (as an AAAA record) a synthetic IPv6 address consisting of a prefix and the embedded IPv4 address of the server,
much as in NAT-PT above (though multiple prefix-length options exist; see RFC 6052 [https://tools.ietf.org/html/rfc6052.html]).
The prefix belongs to the actual NAT64 translator; any packet addressed to an IPv6 address starting with the prefix will be
delivered to the translator. There is no relationship between the NAT64 translator and the DNS64 resolver beyond the fact that the
former’s prefix is configured into the latter.
The IPv6 client now uses this synthetic IPv6 address to contact the IPv4 server. Its packets will be routed to the NAT64 translator
itself, by virtue of the prefix, much as in NAT-PT. Upon receiving the first packet from the IPv6 client, the NAT64 translator will
assign one of its IPv4 addresses to the new connection. As IPv4 addresses are in short supply, this pool of available IPv4 addresses
may be small, so NAT64 allows one IPv4 address to be used by many IPv6 clients. To this end, the NAT64 translator will also (for
TCP and UDP) establish a port mapping between the incoming IPv6 source port and a port number allocated by the NAT64 to
ensure that traffic is uniquely reversable. As with IPv4 NAT, if two IPv6 clients try to contact the same IPv4 server using the same
source ports, and are assigned the same NAT64 IPv4 address, then one of the clients will have its port number changed.
If an ICMP query is being sent, the Query Identifier is used in lieu of port numbers. To extend NAT64 to new protocols, an
appropriate analog of port numbers must be identified, to allow demultiplexing of multiple connections sharing a single IPv4
address.
After the translation is set up, by creating appropriate table entries, the translated packet is sent on to the IPv4 server address that
was embedded in the synthetic IPv6 address. The source address will be the assigned IPv4 address of the translator, and the source
port will have been rewritten in accordance with the new port mapping. At this point packets can flow freely between the original
IPv6 client and its IPv4 destination, with neither endpoint being aware of the translation (unless the IPv6 client carefully inspects
the synthetic address it receives via DNS64). A timer within the NAT64 translator will delete the association between the IPv6 and
IPv4 addresses if the connection is not used for a while.
As an example, suppose the IPv6 client has address 2000:1234::abba, and is trying to reach intronetworks4.cs.luc.edu at TCP port
80. It contacts its DNS server, which finds no AAAA record but IPv4 address 162.216.18.28 (in hex, a2d8:121c). It takes the prefix

8.15.1 https://eng.libretexts.org/@go/page/11166
for its NAT64 translator, which we will assume is 2000:cafe::, and returns the synthetic address 2000:cafe::a2d8:121c.
AAAA query
DNS query for intronetworks4
none
2000:1234::abba
cs.luc.edu
abba DNS64 A query cs.luc.edu
nameserver
2000:cafe::a2d8:121c
162.216.18.28

TCP connect to
2000:cafe:a2d8:121c
from port 4000

To 162.216.18.28
200.0.0.1 from port 4002 intro
NAT64 networks4

intronetworks4.cs.luc.edu
2000:cafe::/64 162.216.18.28
IPv6 addr src IPv6 port src IPv4 addr src IPv4 addr dest IPv4 port src
(a2d8:121c)
2000:1234::abba 4000 200.0.0.1 162.216.18.28 4002
not all columns shown

The IPv6 client now tries to connect to 2000:cafe::a2d8:121c, using source port 4000. The first packet arrives at the NAT64
translator, which assigns the connection the outbound IPv4 address of 200.0.0.1, and reassigns the source port on the IPv4 side to
4002. The new IPv4 packet is sent on to 162.216.18.28. The reply from intronetworks4.cs.luc.edu comes back, to ⟨200.0.0.1,4002⟩.
The NAT64 translator looks this up and finds that this corresponds to ⟨2000:1234::abba,4000⟩, and forwards it back to the original
IPv6 client.

This page titled 8.15: IPv6-to-IPv4 Connectivity is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

8.15.2 https://eng.libretexts.org/@go/page/11166
8.16: Epilog and Exercises
IPv4 has run out of large address blocks, as of 2011. IPv6 has reached a mature level of development. Most common operating
systems provide excellent IPv6 support.
Yet conversion has been slow. Many ISPs still provide limited (to nonexistent) support, and inexpensive IPv6 firewalls to replace
the ubiquitous consumer-grade NAT routers are just beginning to appear. Time will tell how all this evolves. However, while IPv6
has now been around for twenty years, top-level IPv4 address blocks disappeared much more recently. It is quite possible that this
will prove to be just the catalyst IPv6 needs.

8.16 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises.
1.0. Each IPv6 address is associated with a specific solicited-node multicast address.
(a). Explain why, on a typical Ethernet, if the original IPv6 host address was obtained via SLAAC then the LAN multicast group
corresponding to the host’s solicited-node multicast addresses is likely to be small, in many cases consisting of one host only.
(Packet delivery to small LAN multicast groups can be much more efficient than delivery to large multicast groups.)
(b). What steps might a DHCPv6 server take to ensure that, for the IPv6 addresses it hands out, the LAN multicast groups
corresponding to the host addresses’ solicited-node multicast addresses will be small?
2.0. If an attacker sends a large number of probe packets via IPv4, you can block them by blocking the attacker’s IP address. Now
suppose the attacker uses IPv6 to launch the probes; for each probe, the attacker changes the low-order 64 bits of the address. Can
these probes be blocked efficiently? If so, what do you have to block? Might you also be blocking other users?
3.0. Suppose someone tried to implement ping6 so that, if the address was a link-local address and no interface was specified, the
ICMPv6 Echo Request was sent out all non-loopback interfaces. Could the end result be different than conventional ping6 with the
correct interface supplied? If so, how likely is this?
4.0. Create an IPv6 ssh connection as in 8.12 IPv6 Examples Without a Router. Examine the connection’s packets using WireShark
or the equivalent. Does the TCP handshake (12.3 TCP Connection Establishment) look any different over IPv6?
5.0. Create an IPv6 ssh connection using manually configured addresses as in 8.12.3 Manual address configuration. Again use
WireShark or the equivalent to monitor the connection. Is DAD (8.7.1 Duplicate Address Detection) used?
6.0. An IPv6 fixed-header is 40 bytes. Taking this as the minimum packet size, how long will it take to send 1015 hosts (one
quadrillion) probe packets to a site, if the bandwidth is 1 Gbps?
7.0. Suppose host A gets its IPv6 traffic through tunnel provider H, as in 8.13 IPv6 Connectivity via Tunneling. To improve
security, A blocks all packets that are not part of connections it has initiated, and makes no exception for ICMPv6 traffic. H is
correctly configured to know the MTU of the A–H link. For (a) and (b), this MTU is 1280, the minimum allowed for IPv6. Much
of the Internet, however, allows larger MTU values.

A ─── H ─── Internet ─── B

(a). If A attempts to send a larger-than-1280-byte IPv6 packet to remote host B, will A be informed of the resultant failure? Why or
why not?
(b). Suppose B attempts to send a larger-than-1280-byte IPv6 packet to A. Will B receive an ICMPv6 Packet Too Big
message? Why or why not?
(c). Now suppose the MTU of the A–H link is raised to 1400 bytes. Outline a scenario in which A sends a packet of size greater
than 1280 bytes to remote host B, the packet is too big to make it all the way to B, and yet A receives no notification of this.

This page titled 8.16: Epilog and Exercises is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

8.16.1 https://eng.libretexts.org/@go/page/11165
CHAPTER OVERVIEW

9: Routing-Update Algorithms
How do IP routers build and maintain their forwarding tables?
Ethernet bridges always have the option of fallback-to-flooding for unknown destinations, so they can afford to build their
forwarding tables “incrementally”, putting a host into the forwarding table only when that host is first seen as a sender. For IP,
there is no fallback delivery mechanism: forwarding tables must be built before delivery can succeed. While manual table
construction is possible, it is not practical.
9.1: Prelude to Routing-Update Algorithms
9.2: Distance-Vector Routing-Update Algorithm
9.3: Distance-Vector Slow-Convergence Problem
9.4: Observations on Minimizing Route Cost
9.5: Loop-Free Distance Vector Algorithms
9.6: Link-State Routing-Update Algorithm
9.7: Routing on Other Attributes
9.8: ECMP
9.9: Epilog and Exercises
Index

This page titled 9: Routing-Update Algorithms is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1
9.1: Prelude to Routing-Update Algorithms
In the literature it is common to refer to router-table construction as “routing algorithms”. We will avoid that term, however, to
avoid confusion with the fundamental datagram-forwarding algorithm; instead, we will call these “routing-update algorithms”.
The two classes of algorithms we will consider here are distance-vector and link-state. In the distance-vector approach, often used
at smaller sites, routers exchange information with their immediately neighboring routers; tables are built up this way through a
sequence of such periodic exchanges. In the link-state approach, routers rapidly propagate information about the state of each link;
all routers in the organization receive this link-state information and each one uses it to build and maintain a map of the entire
network. The forwarding table is then constructed (sometimes on demand) from this map.
Both approaches assume that consistent information is available as to the cost of each link (eg that the two routers at opposite ends
of each link know this cost, and agree on how the cost is determined). This requirement classifies these algorithms as interior
routing-update algorithms: the routers involved are internal to a larger organization or other common administrative regime that has
an established policy on how to assign link weights. The set of routers following a common policy is known as a routing domain
or (from the BGP protocol) an autonomous system.
The simplest link-weight strategy is to give each link a cost of 1; link costs can also be based on bandwidth, propagation delay,
financial cost, or administrative preference value. Careful assignment of link costs often plays a major role in herding traffic onto
the faster or “better” links.
In the following chapter we will look at the Border Gateway Protocol, or BGP, in which no link-cost calculations are made. BGP is
used to select routes that traverse other organizations, and financial rather than technical factors may therefore play the dominant
role in making routing choices.
Generally, all these algorithms apply to IPv6 as well as IPv4, though specific protocols of course may need modification.
Finally, we should point out that from the early days of the Internet, routing was allowed to depend not just on the destination, but
also on the “quality of service” (QoS) requested; thus, forwarding table entries are strictly speaking not ⟨destination, next_hop⟩ but
rather ⟨destination, QoS, next_hop⟩. Originally, the Type of Service field in the IPv4 header (7.1 The IPv4 Header) could be used to
specify QoS (often then called ToS). Packets could request low delay, high throughput or high reliability, and could be routed
accordingly. In practice, the Type of Service field was rarely used, and was eventually taken over by the DS field and ECN bits.
The first three bits of the Type of Service field, known as the precedence bits, remain available, however, and can still be used for
QoS routing purposes (see the Class Selector PHB of 20.7 Differentiated Services for examples of these bits). See also RFC 2386
[https://tools.ietf.org/html/rfc2386.html].
In much of the following, we are going to ignore QoS information, and assume that routing decisions are based only on the
destination. See, however, the first paragraph of 9.5 Link-State Routing-Update Algorithm, and also 9.6 Routing on Other
Attributes.

9.1: Prelude to Routing-Update Algorithms is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.1.1 https://eng.libretexts.org/@go/page/11202
9.2: Distance-Vector Routing-Update Algorithm
Distance-vector is the simplest routing-update algorithm, used by the Routing Information Protocol, or RIP. Version 2 of the
protocol is specified in RFC 2453 [https://tools.ietf.org/html/rfc2453.html].
Routers identify their router neighbors (through some sort of neighbor-discovery mechanism), and add a third column to their
forwarding tables representing the total cost for delivery to the corresponding destination. These costs are the “distance” of the
algorithm name. Forwarding-table entries are now of the form ⟨destination,next_hop,cost⟩.
Costs are administratively assigned to each link, and the algorithm then calculates the total cost to a destination as the sum of the
link costs along the path. The simplest case is to assign a cost of 1 to each link, in which case the total cost to a destination will be
the number of links to that destination. This is known as the “hopcount” metric; it is also possible to assign link costs that reflect
each link’s bandwidth, or delay, or whatever else the network administrators wish. Thoughtful cost assignments are a form of traffic
engineering and sometimes play a large role in network performance.
At this point, each router then reports the ⟨destination,cost⟩ portion of its table to its neighboring routers at regular intervals; these
table portions are the “vectors” of the algorithm name. It does not matter if neighbors exchange reports at the same time, or even at
the same rate.
Each router also monitors its continued connectivity to each neighbor; if neighbor N becomes unreachable then its reachability cost
is set to infinity.
In a real IP network, actual destinations would be subnets attached to routers; one router might be directly connected to several
such destinations. In the following, however, we will identify all a router’s directly connected subnets with the router itself. That is,
we will build forwarding tables to reach every router. While it is possible that one destination subnet might be reachable by two or
more routers, thus breaking our identification of a router with its set of attached subnets, in practice this is of little concern. See
exercise 4.0 for an example in which subnets are not identified with adjacent routers.
In 18.5 IP Routers With Simple Distance-Vector Implementation we present a simplified working implementation of RIP using the
Mininet network emulator.

9.1.1 Distance-Vector Update Rules


Let A be a router receiving a report ⟨D,cD⟩ from neighbor N at cost cN. Note that this means A can reach D via N with cost c = cD +
cN. A updates its own table according to the following three rules:
1. New destination: D is a previously unknown destination. A adds ⟨D,N,c⟩ to its forwarding table.
2. Lower cost: D is a known destination with entry ⟨D,M,cold⟩, but the new total cost c is less than cold. A switches to the cheaper
route, updating its entry for D to ⟨D,N,c⟩. It is possible that M=N, meaning that N is now reporting a cost decrease to D. (If c =
cold, A ignores the new report; see exercise 5.5.)
3. Next_hop increase: A has an existing entry ⟨D,N,cold⟩, and the new total cost c is greater than cold. Because this is a cost
increase from the neighbor N that A is currently using to reach D, A must incorporate the increase in its table. A updates its
entry for D to ⟨D,N,c⟩.
The first two rules are for new destinations and a shorter path to existing destinations. In these cases, the cost to each destination
monotonically decreases (at least if we consider all unreachable destinations as being at cost ∞). Convergence is automatic, as the
costs cannot decrease forever.
The third rule, however, introduces the possibility of instability, as a cost may also go up. It represents the bad-news case, in that
neighbor N has learned that some link failure has driven up its own cost to reach D, and is now passing that “bad news” on to A,
which routes to D via N.
The next_hop-increase case only passes bad news along; the very first cost increase must always come from a router discovering
that a neighbor N is unreachable, and thus updating its cost to N to ∞. Similarly, if router A learns of a next_hop increase to
destination D from neighbor B, then we can follow the next_hops back until we reach a router C which is either the originator of
the cost=∞ report, or which has learned of an alternative route through one of the first two rules.

9.2.1 https://eng.libretexts.org/@go/page/11193
9.1.2 Example 1
For our first example, no links will break and thus only the first two rules above will be used. We will start out with the network
below with empty forwarding tables; all link costs are 1.
fivenodes1.svg

After initial neighbor discovery, here are the forwarding tables. Each node has entries only for its directly connected neighbors:

A: ⟨B,B,1⟩ ⟨C,C,1⟩ ⟨D,D,1⟩


B: ⟨A,A,1⟩ ⟨C,C,1⟩
C: ⟨A,A,1⟩ ⟨B,B,1⟩ ⟨E,E,1⟩
D: ⟨A,A,1⟩ ⟨E,E,1⟩
E: ⟨C,C,1⟩ ⟨D,D,1⟩
Now let D report to A; it sends records ⟨A,1⟩ and ⟨E,1⟩. A ignores D’s ⟨A,1⟩ record, but ⟨E,1⟩ represents a new destination; A
therefore adds ⟨E,D,2⟩ to its table. Similarly, let A now report to D, sending ⟨B,1⟩ ⟨C,1⟩ ⟨D,1⟩ ⟨E,2⟩ (the last is the record we just
added). D ignores A’s records ⟨D,1⟩ and ⟨E,2⟩ but A’s records ⟨B,1⟩ and ⟨C,1⟩ cause D to create entries ⟨B,A,2⟩ and ⟨C,A,2⟩. A and
D’s tables are now, in fact, complete.
Now suppose C reports to B; this gives B an entry ⟨E,C,2⟩. If C also reports to E, then E’s table will have ⟨A,C,2⟩ and ⟨B,C,2⟩. The
tables are now:

A: ⟨B,B,1⟩ ⟨C,C,1⟩ ⟨D,D,1⟩ ⟨E,D,2⟩


B: ⟨A,A,1⟩ ⟨C,C,1⟩ ⟨E,C,2⟩
C: ⟨A,A,1⟩ ⟨B,B,1⟩ ⟨E,E,1⟩
D: ⟨A,A,1⟩ ⟨E,E,1⟩ ⟨B,A,2⟩ ⟨C,A,2⟩
E: ⟨C,C,1⟩ ⟨D,D,1⟩ ⟨A,C,2⟩ ⟨B,C,2⟩
We have two missing entries: B and C do not know how to reach D. If A reports to B and C, the tables will be complete; B and C
will each reach D via A at cost 2. However, the following sequence of reports might also have occurred:
E reports to C, causing C to add ⟨D,E,2⟩
C reports to B, causing B to add ⟨D,C,3⟩
In this case we have 100% reachability but B routes to D via the longer-than-necessary path B–C–E–D. However, one more report
will fix this: suppose A reports to B. B will received ⟨D,1⟩ from A, and will update its entry ⟨D,C,3⟩ to ⟨D,A,2⟩.
Note that A routes to E via D while E routes to A via C; this asymmetry was due to indeterminateness in the order of initial table
exchanges.
If all link weights are 1, and if each pair of neighbors exchange tables once before any pair starts a second exchange, then the above
process will discover the routes in order of length, ie the shortest paths will be the first to be discovered. This is not, however, a
particularly important consideration.

9.1.3 Example 2
The next example illustrates link weights other than 1. The first route discovered between A and B is the direct route with cost 8;
eventually we discover the longer A–C–D–B route with cost 2+1+3=6.
2831.svg

The initial tables are these:

A: ⟨C,C,2⟩ ⟨B,B,8⟩
B: ⟨A,A,8⟩ ⟨D,D,3⟩
C: ⟨A,A,2⟩ ⟨D,D,1⟩

9.2.2 https://eng.libretexts.org/@go/page/11193
D: ⟨B,B,3⟩ ⟨C,C,1⟩
After A and C exchange, A has ⟨D,C,3⟩ and C has ⟨B,A,10⟩. After C and D exchange, C updates its ⟨B,A,10⟩ entry to ⟨B,D,4⟩ and
D adds ⟨A,C,3⟩; D receives C’s report of ⟨B,10⟩ but ignores it. Now finally suppose B and D exchange. D ignores B’s route to A,
as it has a better one. B, however, gets D’s report ⟨A,3⟩ and updates its entry for A to ⟨A,D,6⟩. At this point the tables are as
follows:

A: ⟨C,C,2⟩ ⟨B,B,8⟩ ⟨D,C,3⟩


B: ⟨A,D,6⟩ ⟨D,D,3⟩
C: ⟨A,A,2⟩ ⟨D,D,1⟩ ⟨B,D,4⟩
D: ⟨B,B,3⟩ ⟨C,C,1⟩ ⟨A,C,3⟩
We have two more things to fix before we are done: A has an inefficient route to B, and B has no route to C. The first will be fixed
when C reports ⟨B,4⟩ to A; A will replace its route to B with ⟨B,C,6⟩. The second will be fixed when D reports to B; if A reports to
B first then B will temporarily add the inefficient route ⟨C,A,10⟩; this will change to ⟨C,D,4⟩ when D’s report to B arrives. If we
look only at the A–B route, B discovers the lower-cost route to A once, first, C reports to D and, second, after that, D reports to B;
a similar sequence leads to A’s discovering the lower-cost route.

9.1.4 Example 3
Our third example will illustrate how the algorithm proceeds when a link breaks. We return to the first diagram, with all tables
completed, and then suppose the D–E link breaks. This is the “bad-news” case: a link has broken, and is no longer available; this
will bring the third rule into play.
fivenodes_break.svg

We shall assume, as above, that A reaches E via D, but we will here assume – contrary to Example 1 – that C reaches D via A (see
exercise 3.5 for the original case).
Initially, upon discovering the break, D and E update their tables to ⟨E,-,∞⟩ and ⟨D,-,∞⟩ respectively (whether or not they actually
enter ∞ into their tables is implementation-dependent; we may consider this as equivalent to removing their entries for one another;
the “-” as next_hop indicates there is no next_hop).
Eventually D and E will report the break to their respective neighbors A and C. A will apply the “bad-news” rule above and update
its entry for E to ⟨E,-,∞⟩. We have assumed that C, however, routes to D via A, and so it will ignore E’s report.
We will suppose that the next steps are for C to report to E and to A. When C reports its route ⟨D,2⟩ to E, E will add the entry
⟨D,C,3⟩, and will again be able to reach D. When C reports to A, A will add the route ⟨E,C,2⟩. The final step will be when A next
reports to D, and D will have ⟨E,A,3⟩. Connectivity is restored.

9.1.5 Example 4
The previous examples have had a “global” perspective in that we looked at the entire network. In the next example, we look at
how one specific router, R, responds when it receives a distance-vector report from its neighbor S. Neither R nor S nor we have any
idea of what the entire network looks like. Suppose R’s table is initially as follows, and the S–R link has cost 1:

destination next_hop cost

A S 3

B T 4

C S 5

D U 6

S now sends R the following report, containing only destinations and its costs:

9.2.3 https://eng.libretexts.org/@go/page/11193
destination cost

A 2

B 3

C 5

D 4

E 2

R then updates its table as follows:

destination next_hop cost reason

A S 3 No change; S probably sent this report before

B T 4 No change; R’s cost via S is tied with R’s cost via T

C S 6 Next_hop increase

D S 5 Lower-cost route via S

E S 3 New destination

Whatever S’s cost to a destination, R’s cost to that destination via S is one greater.

9.2: Distance-Vector Routing-Update Algorithm is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.2.4 https://eng.libretexts.org/@go/page/11193
9.3: Distance-Vector Slow-Convergence Problem
Hold down is sort of a receiver-side version of triggered updates: the receiver does not use new alternative routes for a period of
time (perhaps two router-update cycles) following discovery of unreachability. This gives time for bad news to arrive. In the first
example, it would mean that when A received B’s report ⟨D,2⟩, it would set this aside. It would then report ⟨D,∞⟩ to B as usual, at
which point B would now report ⟨D,∞⟩ back to A, at which point B’s earlier report ⟨D,2⟩ would be discarded. A significant
drawback of hold down is that legitimate new routes are also delayed by the hold-down period.
These mechanisms for preventing slow convergence are, in the real world, quite effective. The Routing Information Protocol (RIP,
RFC 2453 [https://tools.ietf.org/html/rfc2453.html]) implements all but hold-down, and has been widely adopted at smaller
installations.
However, the potential for routing loops and the limited value for infinity led to the development of alternatives. One alternative is
the link-state strategy, 9.5 Link-State Routing-Update Algorithm. Another alternative is Cisco’s Enhanced Interior Gateway
Routing Protocol, or EIGRP, 9.4.4 EIGRP. While part of the distance-vector family, EIGRP is provably loop-free, though to
achieve this it must sometimes suspend forwarding to some destinations while tables are in flux.
There is a significant problem with distance-vector table updates in the presence of broken links. Not only can routing loops form,
but the loops can persist indefinitely! As an example, suppose we have the following arrangement, with all links having cost 1:
slow_convergence.svg

Now suppose the D–A link breaks:


slow_convergence_break.svg

If A immediately reports to B that D is no longer reachable (cost = ∞), then all is well. However, it is possible that B reports to A
first, telling A that it has a route to D, with cost 2, which B still believes it has.
This means A now installs the entry ⟨D,B,3⟩. At this point we have what we called in 1.6 Routing Loops a linear routing loop: if a
packet is addressed to D, A will forward it to B and B will forward it back to A.
Worse, this loop will be with us a while. At some point A will report ⟨D,3⟩ to B, at which point B will update its entry to ⟨D,A,4⟩.
Then B will report ⟨D,4⟩ to A, and A’s entry will be ⟨D,B,5⟩, etc. This process is known as slow convergence to infinity. If A and
B each report to the other once a minute, it will take 2,000 years for the costs to overflow an ordinary 32-bit integer.

9.2.1 Slow-Convergence Fixes


The simplest fix to this problem is to use a small value for infinity. Most flavors of the RIP protocol use infinity=16, with updates
every 30 seconds. The drawback to so small an infinity is that no path through the network can be longer than this; this makes paths
with weighted link costs difficult. Cisco IGRP uses a variable value for infinity up to a maximum of 256; the default infinity is 100.
There are several well-known other fixes:

9.2.1.1 Split Horizon


Under split horizon, if A uses N as its next_hop for destination D, then A simply does not report to N that it can reach D; that is, in
preparing its report to N it first deletes all entries that have N as next_hop. In the example above, split horizon would mean B
would never report to A about the reachability of D because A is B’s next_hop to D.
Split horizon prevents all linear routing loops. However, there are other topologies where it cannot prevent loops. One is the
following:
split_horizon_loop.svg

Suppose the A-D link breaks, and A updates to ⟨D,-,∞⟩. A then reports ⟨D,∞⟩ to B, which updates its table to ⟨D,-,∞⟩. But then,
before A can also report ⟨D,∞⟩ to C, C reports ⟨D,2⟩ to B. B then updates to ⟨D,C,3⟩, and reports ⟨D,3⟩ back to A; neither this nor
the previous report violates split-horizon. Now A’s entry is ⟨D,B,4⟩. Eventually A will report to C, at which point C’s entry
becomes ⟨D,A,5⟩, and the numbers keep increasing as the reports circulate counterclockwise. The actual routing proceeds in the
other direction, clockwise.

9.3.1 https://eng.libretexts.org/@go/page/11194
Split horizon often also includes poison reverse: if A uses N as its next_hop to D, then A in fact reports ⟨D,∞⟩ to N, which is a
more definitive statement that A cannot reach D by itself. However, coming up with a scenario where poison reverse actually
affects the outcome is not trivial.

9.2.1.2 Triggered Updates


In the original example, if A was first to report to B then the loop resolved immediately; the loop occurred if B was first to report to
A. Nominally each outcome has probability 50%. Triggered updates means that any router should report immediately to its
neighbors whenever it detects any change for the worse. If A reports first to B in the first example, the problem goes away.
Similarly, in the second example, if A reports to both B and C before B or C report to one another, the problem goes away. There
remains, however, a small window where B could send its report to A just as A has discovered the problem, before A can report to
B.

9.3: Distance-Vector Slow-Convergence Problem is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.3.2 https://eng.libretexts.org/@go/page/11194
9.4: Observations on Minimizing Route Cost
Does distance-vector routing actually achieve minimum costs? For that matter, does each packet incur the cost its sender expects?
Suppose node A has a forwarding entry ⟨D,B,c⟩, meaning that A forwards packets to destination D via next_hop B, and expects the
total cost to be c. If A sends a packet to D, and we follow it on the actual path it takes, must the total link cost be c? If so, we will
say that the network has accurate costs.
The answer to the accurate-costs question, as it turns out, is yes for the distance-vector algorithm, if we follow the rules carefully,
and the network is stable (meaning that no routing reports are changing, or, more concretely, that every update report now
circulating is based on the current network state); a proof is below. However, if there is a routing loop, the answer is of course no:
the actual cost is now infinite. The answer would also be no if A’s neighbor B has just switched to using a longer route to D than it
last reported to A.
It turns out, however, that we seek the shortest route not because we are particularly trying to save money on transit costs; a route
50% longer would generally work just fine. (AT&T, back when they were the Phone Company, once ran a series of print
advertisements claiming longer routes as a feature: if the direct path was congested, they could still complete your call by routing
you the long way ‘round.) However, we are guaranteed that if all routers seek the shortest route – and if the network is stable – then
all paths are loop-free, because in this case the network will have accurate costs.
Here is a simple example illustrating the importance of global cost-minimization in preventing loops. Suppose we have a network
like this one:
cost_minimization.svg

Now suppose that A and B use distance-vector but are allowed to choose the shortest route to within 10%. A would get a report
from C that D could be reached with cost 1, for a total cost of 21. The forwarding entry via C would be ⟨D,C,21⟩. Similarly, A
would get a report from B that D could be reached with cost 21, for a total cost of 22: ⟨D,B,22⟩. Similarly, B has choices ⟨D,C,21⟩
and ⟨D,A,22⟩.
If A and B both choose the minimal route, no loop forms. But if A and B both use the 10%-overage rule, they would be allowed to
choose the other route: A could choose ⟨D,B,22⟩ and B could choose ⟨D,A,22⟩. If this happened, we would have a routing loop: A
would forward packets for D to B, and B would forward them right back to A.
As we apply distance-vector routing, each router independently builds its tables. A router might have some notion of the path its
packets would take to their destination; for example, in the case above A might believe that with forwarding entry ⟨D,B,22⟩ its
packets would take the path A–B–C–D (though in distance-vector routing, routers do not particularly worry about the big picture).
Consider again the accurate-cost question above. This fails in the 10%-overage example, because the actual path is now infinite.
We now prove that, in distance-vector routing, the network will have accurate costs, provided
each router selects what it believes to be the shortest path to the final destination, and
the network is stable, meaning that further dissemination of any reports would not result in changes
To see this, suppose the actual route taken by some packet from source to destination, as determined by application of the
distributed distance-vector algorithm, is longer than the cost calculated by the source. Choose an example of such a path with the
fewest number of links, among all such paths in the network. Let S be the source, D the destination, and k the number of links in
the actual path P. Let S’s forwarding entry for D be ⟨D,N,c⟩, where N is S’s next_hop neighbor.
To have obtained this route through the distance-vector algorithm, S must have received report ⟨D,cD⟩ from N, where we also have
the cost of the S–N link as cN and c = cD + cN. If we follow a packet from N to D, it must take the same path P with the first link
deleted; this sub-path has length k-1 and so, by our hypothesis that k was the length of the shortest path with non-accurate costs, the
cost from N to D is cD. But this means that the cost along path P, from S to D via N, must be cD + cN = c, contradicting our
selection of P as a path longer than its advertised cost.
There is one final observation to make about route costs: any cost-minimization can occur only within a single routing domain,
where full information about all links is available. If a path traverses multiple routing domains, each separate routing domain may
calculate the optimum path traversing that domain. But these “local minimums” do not necessarily add up to a globally minimal
path length, particularly when one domain calculates the minimum cost from one of its routers only to the other domain rather than
to a router within that other domain. Here is a simple example. Routers BR1 and BR2 are the border routers connecting the
domain LD to the left of the vertical dotted line with domain RD to the right. From A to B, LD will choose the shortest path to RD

9.4.1 https://eng.libretexts.org/@go/page/11195
(not to B, because LD is not likely to have information about links within RD). This is the path of length 3 through BR2. But this
leads to a total path length of 3+8=11 from A to B; the global minimum path length, however, is 4+1=5, through BR1.
border.svg

In this example, domains LD and RD join at two points. For a route across two domains joined at only a single point, the domain-
local shortest paths do add up to the globally shortest path.

9.4: Observations on Minimizing Route Cost is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.4.2 https://eng.libretexts.org/@go/page/11195
9.5: Loop-Free Distance Vector Algorithms
It is possible for routing-update algorithms based on the distance-vector idea to eliminate routing loops – and thus the slow-
convergence problem – entirely. We present brief descriptions of two such algorithms.

9.4.1 DSDV
DSDV, or Destination-Sequenced Distance Vector, was proposed in [PB94]. It avoids routing loops by the introduction of
sequence numbers: each router will always prefer routes with the most recent sequence number, and bad-news information will
always have a lower sequence number then the next cycle of corrected information.
DSDV was originally proposed for MANETs (3.7.8 MANETs) and has some additional features for traffic minimization that, for
simplicity, we ignore here. It is perhaps best suited for wired networks and for small, relatively stable MANETs.
DSDV forwarding tables contain entries for every other reachable node in the system. One successor of DSDV, Ad Hoc On-
Demand Distance Vector routing or AODV, 9.4.2 AODV, allows forwarding tables to contain only those destinations in active use;
a mechanism is provided for discovery of routes to newly active destinations.
Under DSDV, each forwarding table entry contains, in addition to the destination, cost and next_hop, the current sequence number
for that destination. When neighboring nodes exchange their distance-vector reachability reports, the reports include these per-
destination sequence numbers.
When a router R receives a report from neighbor N for destination D, and the report contains a sequence number larger than the
sequence number for D currently in R’s forwarding table, then R always updates to use the new information. The three cost-
minimization rules of 9.1.1 Distance-Vector Update Rules above are used only when the incoming and existing sequence numbers
are equal.
Each time a router R sends a report to its neighbors, it includes a new value for its own sequence number, which it always
increments by 2. This number is then entered into each neighbor’s forwarding-table entry for R, and is then propagated throughout
the network via continuing report exchanges. Any sequence number originating this way will be even, and whenever another
node’s forwarding-table sequence number for R is even, then its cost for R will be finite.
Infinite-cost reports are generated in the usual way when former neighbors discover they can no longer reach one another; however,
in this case each node increments the sequence number for its former neighbor by 1, thus generating an odd value. Any forwarding-
table entry with infinite cost will thus always have an odd sequence number. If A and B are neighbors, and A’s current sequence
number is s, and the A–B link breaks, then B will start reporting A at cost ∞ with sequence number s+1 while A will start reporting
its own new sequence number s+2. Any other node now receiving a report originating with B (with sequence number s+1) will
mark A as having cost ∞, but will obtain a valid route to A upon receiving a report originating from A with new (and larger)
sequence number s+2.
The triggered-update mechanism is used: if a node receives a report with some destinations newly marked with infinite cost, it will
in turn forward this information immediately to its other neighbors, and so on. This is, however, not essential; “bad” and “good”
reports are distinguished by sequence number, not by relative arrival time.
It is now straightforward to verify that the slow-convergence problem is solved. After a link break, if there is some alternative path
from router R to destination D, then R will eventually receive D’s latest even sequence number, which will be greater than any
sequence number associated with any report listing D as unreachable. If, on the other hand, the break partitioned the network and
there is no longer any path to D from R, then the highest sequence number circulating in R’s half of the original network will be
odd and the associated table entries will all list D at cost ∞. One way or another, the network will quickly settle down to a state
where every destination’s reachability is accurately described.
In fact, a stronger statement is true: not even transient routing loops are created. We outline a proof. First, whenever router R has
next_hop N for a destination D, then N’s sequence number for D must be greater than or equal to R’s, as R must have obtained its
current route to D from one of N’s reports. A consequence is that all routers participating in a loop for destination D must have the
same (even) sequence number s for D throughout. This means that the loop would have been created if only the reports with
sequence number s were circulating. As we noted in 9.1.1 Distance-Vector Update Rules, any application of the next_hop-increase

9.5.1 https://eng.libretexts.org/@go/page/11196
rule must trace back to a broken link, and thus must involve an odd sequence number. Thus, the loop must have formed from the
sequence-number-s reports by the application of the first two rules only. But this violates the claim in Exercise 10.0.
There is one drawback to DSDV: nodes may sometimes briefly switch to routes that are longer than optimum (though still correct).
This is because a router is required to use the route with the newest sequence number, even if that route is longer than the existing
route. If A and B are two neighbors of router R, and B is closer to destination D but slower to report, then every time D’s sequence
number is incremented R will receive A’s longer route first, and switch to using it, and B’s shorter route shortly thereafter.
DSDV implementations usually address this by having each router R keep track of the time interval between the first arrival at R of
a new route to a destination D with a given sequence number, and the arrival of the best route with that sequence number. During
this interval following the arrival of the first report with a new sequence number, R will use the new route, but will refrain from
including the route in the reports it sends to its neighbors, anticipating that a better route will soon arrive.
This works best when the hopcount cost metric is being used, because in this case the best route is likely to arrive first (as the news
had to travel the fewest hops), and at the very least will arrive soon after the first route. However, if the network’s cost metric is
unrelated to the hop count, then the time interval between first-route and best-route arrivals can involve multiple update cycles, and
can be substantial.

9.4.2 AODV
AODV, or Ad-hoc On-demand Distance Vector routing, is another routing mechanism often proposed for MANETs, though it is
suitable for some wired networks as well. Unlike DSDV, above, AODV messages circulate only if a link breaks, or when a node is
looking for a route to some other node; this second case is the rationale for the “on-demand” in the name. For larger MANETs, this
may result in a significant reduction in routing-management traffic. AODV is described in [PR99] and RFC 3561
[https://tools.ietf.org/html/rfc3561.html].
The “ad hoc” in the name was intended to suggest that the protocol is well-suited for mobile nodes forming an ad hoc network
(3.7.4 Access Points). It is, but the protocol is also works well with infrastructure (those with access points) Wi-Fi networks.
AODV has three kinds of messages: RouteRequest or RREQ, for nodes that are looking for a path to a destination, RouteReply or
RREP, as the response, and RouteError or RERR for the reporting of broken links.
AODV performs reasonably well for MANETs in which the nodes are highly mobile, though it does assume all routing nodes are
trustworthy.
AODV is loop-free, due to the way it uses sequence numbers. However, it does not always find the shortest route right away, and
may in fact not find the shortest route for an arbitrarily long interval.
Each AODV node maintains a node sequence number and also a broadcast counter. Every routing message contains a sequence
number for the destination, and every routing record kept by a node includes a field for the destination’s sequence number. Copies
of a node’s sequence number held by other nodes may not be the most current; however, nodes always discard routes with an older
(smaller) sequence number as soon as they hear about a route with a newer sequence number.
AODV nodes also keep track of other nodes that are directly reachable; in the diagram below we will assume these are the nodes
connected by a line.
If node A wishes to find a route to node F, as in the diagram below, the first step is for A to increment its sequence number and
send out a RouteRequest. This message contains the addresses of A and F, A’s just-incremented sequence number, the highest
sequence number of any previous route to F that is known to A (if any), a hopcount field set initially to 1, and A’s broadcast
counter. The end result should be a route from A to F, entered at each node along the path, and also a return route from F back to A.
aodv1.svg

The RouteRequest is sent initially to A’s direct neighbors, B and C in the diagram above, using UDP. We will assume for the
moment that the RouteRequest reaches all the way to F before a RouteReply is generated. This is always the case if the “destination
only” flag is set, though if not then it is possible for an intermediate node to generate the RouteReply.
A node that receives a RouteRequest must flood it (“broadcast” it) out all its interfaces to all its directly reachable neighbors, after
incrementing the hopcount field. B therefore sends A’s message to C and D, and C sends it to B and E. For this example, we will
assume that C is a bit slow sending the message to E.

9.5.2 https://eng.libretexts.org/@go/page/11196
Each node receiving a RouteRequest must hang on to it for a short interval (typically 3 seconds). During this period, if it sees a
duplicate of the RouteRequest, identified by having the same source and the same broadcast counter, it discards it. This discard rule
ensures that RouteRequest messages do not circulate endlessly around loops; it may be compared to the reliable-flooding algorithm
in 9.5 Link-State Routing-Update Algorithm.
A node receiving a new RouteRequest also records (or updates) a routing-table entry for reaching the source of the RouteRequest.
Unless there was a pre-existing newer route (that is, with larger sequence number), the entry is marked with the sequence number
contained in the message, and with next_hop the neighbor from which the RouteRequest was received. This process ensures that, as
part of each node’s processing of a RouteRequest message, it installs a return route back to the originator.
We will suppose that the following happen in the order indicated:
B forwards the RouteRequest to D*
D forwards the RouteRequest to E and G
C forwards the RouteRequest to E
E forwards the RouteRequest to F
Because E receives D’s copy of the RouteRequest first, it ignores C’s copy. This will mean that, at least initially, the return path will
be longer than necessary. Variants of AODV (such as HWMP below) sometimes allow E to accept C’s message on the grounds that
C has a shorter path back to A. This does mean that initial RouteRequest messages farther on in the network now have incorrect
hopcount values, though these will be corrected by later RouteRequest messages.
After the above messages have been received, each node has a path back to A as indicated by the blue arrows below:
aodv2.svg

F now increments its own sequence number and creates a RouteReply message; F then sends it to A by following the highlighted
(unicast) arrows above, F→E→D→B→A. As each node on the path processes the message, it creates (or updates) its route to the
final destination, F; the return route to A had been created earlier when the node processed the corresponding RouteRequest.
At this point, A and F can communicate bidirectionally. (Each RouteRequest is acknowledged to ensure bidirectionality of each
individual link.)
This F→E→D→B→A is longer than necessary; a shorter path is F→E→C→A. The shorter path will be adopted if, at some future
point, E learns that E→C→A is a better path, though there is no mechanism to seek out this route.
If the “destination only” flag were not set, any intermediate node reached by the RouteRequest flooding could have answered with
a route to F, if it had one. Such a node would generate the RouteReply on its own, without involving F. The sequence number of the
intermediate node’s route to F must be greater than the sequence number in the RouteRequest message.
If two neighboring nodes can no longer reach one another, each sends out a RouteError message, to invalidate the route. Nodes
keep track of what routes pass through them, for just this purpose. One node’s message will reach the source and the other’s the
destination, at which point the route is invalidated.
In larger networks, it is standard for the originator of a RouteRequest to set the IPv4 header TTL value (or the IPv6 Hop_Limit) to a
smallish value (RFC 3561 [https://tools.ietf.org/html/rfc3561.html] recommends an intial value of 1) to limit the scope of the
RequestRoute messages. If no answer is received, the originator tries again, with a slightly larger TTL value. In a large network,
this reduces the volume of RouteRequest messages that have gone too far and therefore cannot be of use in finding a route.
AODV cannot form even short-term loops. To show this, we start with the observation that whenever a ⟨destination,next_hop⟩
forwarding entry installed at a node, due either to a RouteRequest or to a RouteReply, the next_hop is always the node from which
the RouteRequest or RouteReply was received, and therefore the destination sequence number cannot get smaller as we move from
the original node to its next_hop. That is, as we follow any route to a destination, the destination sequence numbers are
nondecreasing. It immediately follows that, for a routing loop, the destination sequence number is constant along the loop. This
means that each node on the route must have heard of the route via the same RouteRequest or RouteReply message, as forwarded.
The second observation, completing the argument, is that the hopcount field must strictly decrease as we travel along the route to
the destination; the processing rules for RouteRequests and RouteReplies mean that each node installs a hopcount of one more than
that of the neighboring node from which the route was received. This is impossible for a route that returns to the same node.

9.5.3 https://eng.libretexts.org/@go/page/11196
9.4.3 HWMP
The Hybrid Wireless Mesh Protocol is based on AODV, and has been chosen for the IEEE 802.11s Wi-Fi mesh networking
standard (3.7.4.4 Mesh Networks). In the discussion here, we will assume HWMP is being used in a Wi-Fi network, though the
protocol applies to any type of network. A set of nodes is designated as the routing (or forwarding) nodes; ordinary Wi-Fi stations
may or may not be included here.
HWMP replaces the hopcount metric used in AODV with an “airtime link metric” which decreases as the link throughput increases
and as the link error rate decreases. This encourages the use of higher-quality wireless links.
HWMP has two route-generating modes: an on-demand mode very similar to AODV, and a proactive mode used when there is at
least one identified “root” node that connects to the Internet. In this case, the route-generating protocol determines a loop-free
subset of the relevant routing links (that is, a spanning tree) by which each routing node can reach the root (or one of the roots).
This tree-building process does not attempt to find best paths between pairs of non-root nodes, though such nodes can use the on-
demand mode as necessary.
In the first, on-demand, mode, HWMP implements a change to classic AODV in that if a node receives a RouteRequest message
and then later receives a second RouteRequest message with the same sequence number but a lower-cost route, then the second
route replaces the first.
In the proactive mode, the designated root node – typically the node with wired Internet access – periodically sends out specially
marked RouteRequest messages. These are sent to the broadcast address, rather than to any specific destination, but otherwise
propagate in the usual way. Routing nodes receiving two copies from two different neighbors pick the one with the shortest path.
Once this process stabilizes, each routing node knows the best path to the root (or to a root); the fact that each routing node chooses
the best path from among all RouteRequest messages received ensures eventual route optimality. Routing nodes that have traffic to
send can at any time generate a RouteReply, which will immediately set up a reverse route from the root to the node in question.
Finally, reversing each link to the root allows the root to send broadcast messages.
HWMP has yet another mode: the root nodes can send out RootAnnounce (RANN) messages. These let other routing nodes know
what the root is, but are not meant to result in the creation of routes to the root.

9.4.4 EIGRP
EIGRP, or the Enhanced Interior Gateway Routing Protocol, is a once-proprietary Cisco distance-vector protocol that was
released as an Internet Draft in February 2013. As with DSDV, it eliminates the risk of routing loops, even ephemeral ones. It is
based on the “distributed update algorithm” (DUAL) of [JG93]. EIGRP is an actual protocol; we present here only the general
algorithm. Our discussion follows [CH99].
Each router R keeps a list of neighbor routers NR, as with any distance-vector algorithm. Each R also maintains a data structure
known (somewhat misleadingly) as its topology table. It contains, for each destination D and each N in NR, an indication of
whether N has reported the ability to reach D and, if so, the reported cost c(D,N). The router also keeps, for each N in NR, the cost
cN of the link from R to N. Finally, the forwarding-table entry for any destination can be marked “passive”, meaning safe to use, or
“active”, meaning updates are in process and the route is temporarily unavailable.
Initially, we expect that for each router R and each destination D, R’s next_hop to D in its forwarding table is the neighbor N for
which the following total cost is a minimum:

c(D,N) + cN
Now suppose R receives a distance-vector report from neighbor N1 that it can reach D with cost c(D,N1). This is processed in the
usual distance-vector way, unless it represents an increased cost and N1 is R’s next_hop to D; this is the third case in 9.1.1
Distance-Vector Update Rules. In this case, let C be R’s current cost to D, and let us say that neighbor N of R is a feasible next_hop
(feasible successor in Cisco’s terminology) if N’s cost to D (that is, c(D,N)) is strictly less than C. R then updates its route to D to
use the feasible neighbor N for which c(D,N) + cN is a minimum. Note that this may not in fact be the shortest path; it is possible
that there is another neighbor M for which c(D,M)+cM is smaller, but c(D,M)≥C. However, because N’s path to D is loop-free, and
because c(D,N) < C, this new path through N must also be loop-free; this is sometimes summarized by the statement “one cannot
create a loop by adopting a shorter route”.

9.5.4 https://eng.libretexts.org/@go/page/11196
If no neighbor N of R is feasible – which would be the case in the D—A—B example of 9.2 Distance-Vector Slow-Convergence
Problem, then R invokes the “DUAL” algorithm. This is sometimes called a “diffusion” algorithm as it invokes a diffusion-like
spread of table changes proceeding away from R.
Let C in this case denote the new cost from R to D as based on N1’s report. R marks destination D as “active” (which suppresses
forwarding to D) and sends a special query to each of its neighbors, in the form of a distance-vector report indicating that its cost to
D has now increased to C. The algorithm terminates when all R’s neighbors reply back with their own distance-vector reports; at
that point R marks its entry for D as “passive” again.
Some neighbors may be able to process R’s report without further diffusion to other nodes, remain “passive”, and reply back to R
immediately. However, other neighbors may, like R, now become “active” and continue the DUAL algorithm. In the process, R
may receive other queries that elicit its distance-vector report; as long as R is “active” it will report its cost to D as C. We omit the
argument that this process – and thus the network – must eventually converge.

9.5: Loop-Free Distance Vector Algorithms is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.5.5 https://eng.libretexts.org/@go/page/11196
9.6: Link-State Routing-Update Algorithm
Link-state routing is an alternative to distance-vector. It is often – though certainly not always – considered to be the routing-update
algorithm class of choice for networks that are “sufficiently large”, such as those of ISPs. There are two specific link-state
protocols: the IETF’s Open Shortest Path First (OSPF, RFC 2328 [https://tools.ietf.org/html/rfc2328.html]), and OSI’s
Intermediate Systems to Intermediate Systems (IS-IS, documented unofficially in RFC 1142
[https://tools.ietf.org/html/rfc1142.html]).
In distance-vector routing, each node knows a bare minimum of network topology: it knows nothing about links beyond those to its
immediate neighbors. In the link-state approach, each node keeps a maximum amount of network information: a full map of all
nodes and all links. Routes are then computed locally from this map, using the shortest-path-first algorithm. The existence of this
map allows, in theory, the calculation of different routes for different quality-of-service requirements. The map also allows
calculation of a new route as soon as news of the failure of the existing route arrives; distance-vector protocols on the other hand
must wait for news of a new route after an existing route fails.
Link-state protocols distribute network map information through a modified form of broadcast of the status of each individual link.
Whenever either side of a link notices the link has died (or if a node notices that a new link has become available), it sends out
link-state packets (LSPs) that “flood” the network. This broadcast process is called reliable flooding. In general, broadcast
mechanisms are not compatible with networks that have topological looping (that is, redundant paths); broadcast packets may
circulate around the loop endlessly. Link-state protocols must be carefully designed to ensure that both every router sees every LSP,
and also that no LSPs circulate repeatedly. (The acronym LSP is used by IS-IS; the preferred acronym used by OSPF is LSA,
where A is for advertisement.) LSPs are sent immediately upon link-state changes, like triggered updates in distance-vector
protocols except there is no “race” between “bad news” and “good news”.
It is possible for ephemeral routing loops to exist; for example, if one router has received a LSP but another has not, they may have
an inconsistent view of the network and thus route to one another. However, as soon as the LSP has reached all routers involved,
the loop should vanish. There are no “race conditions”, as with distance-vector routing, that can lead to persistent routing loops.
The link-state flooding algorithm avoids the usual problems of broadcast in the presence of loops by having each node keep a
database of all LSP messages. The originator of each LSP includes its identity, information about the link that has changed status,
and also a sequence number. Other routers need only keep in their databases the LSP packet with the largest sequence number;
older LSPs can be discarded. When a router receives a LSP, it first checks its database to see if that LSP is old, or is current but has
been received before; in these cases, no further action is taken. If, however, an LSP arrives with a sequence number not seen
before, then in typical broadcast fashion the LSP is retransmitted over all links except the arrival interface.
As an example, consider the following arrangement of routers:
loop_plus_E.svg

Suppose the A–E link status changes. A sends LSPs to C and B. Both these will forward the LSPs to D; suppose B’s arrives first.
Then D will forward the LSP to C; the LSP traveling C→D and the LSP traveling D→C might even cross on the wire. D will
ignore the second LSP copy that it receives from C and C will ignore the second copy it receives from D.
It is important that LSP sequence numbers not wrap around. (Protocols that do allow a numeric field to wrap around usually have a
clear-cut idea of the “active range” that can be used to conclude that the numbering has wrapped rather than restarted; this is harder
to do in the link-state context.) OSPF uses lollipop sequence-numbering here: sequence numbers begin at -231 and increment to
231-1. At this point they wrap around back to 0. Thus, as long as a sequence number is less than zero, it is guaranteed unique; at the
same time, routing will not cease if more than 231 updates are needed. Other link-state implementations use 64-bit sequence
numbers.
Actual link-state implementations often give link-state records a maximum lifetime; entries must be periodically renewed.

9.5.1 Shortest-Path-First Algorithm


The next step is to compute routes from the network map, using the shortest-path-first (SPF) algorithm. This algorithm computes
shortest paths from a given node, A in the example here, to all other nodes. Below is our example network; we are interested in the
shortest paths from A to B, C and D.

9.6.1 https://eng.libretexts.org/@go/page/11197
SPFexample.svg

Before starting the algorithm, we note the shortest path from A to D is A-B-C-D, which has cost 3+4+2=9.
The algorithm builds the set R of all shortest-path routes iteratively. Initially, R contains only the 0-length route to the start node;
one new destination and route is added to R at each stage of the iteration. At each stage we have a current node, representing the
node most recently added to R. The initial current node is our starting node, in this case, A.
We will also maintain a set T, for tentative, of routes to other destinations. This is also initialized to empty.
At each stage, we find all nodes which are immediate neighbors of the current node and which do not already have routes in the
set R. For each such node N, we calculate the cost of the route from the start node to N that goes through the current node. We see
if this is our first route to N, or if the route improves on any route to N already in T; if so, we add or update the route in T
accordingly. Doing this, the routes will be discovered in order of increasing (or nondecreasing) cost.
At the end of this process, we choose the shortest path in T, and move the route and destination node to R. The destination node of
this shortest path becomes the next current node. Ties can be resolved arbitrarily, but note that, as with distance-vector routing, we
must choose the minimum or else the accurate-costs property will fail.
We repeat this process until all nodes have routes in the set R.
For the example above, we start with current = A and R = {⟨A,A,0⟩}. The set T will be {⟨B,B,3⟩, ⟨C,C,10⟩, ⟨D,D,11⟩}. The
lowest-cost entry is ⟨B,B,3⟩, so we move that to R and continue with current = B. No path through C or D can possibly have lower
cost.
For the next stage, the neighbors of B without routes in R are C and D; the routes from A to these through B are ⟨C,B,7⟩ and
⟨D,B,12⟩. The former is an improvement on the existing T entry ⟨C,C,10⟩ and so replaces it; the latter is not an improvement over
⟨D,D,11⟩. T is now {⟨C,B,7⟩, ⟨D,D,11⟩}. The lowest-cost route in T is that to C, so we move this node and route to R and set C to
be current.
Again, ⟨C,B,7⟩ must be the shortest path to C. If any lower-cost path to C existed, then we would be selecting that shorter path – or
a prefix of it – at this point, instead of the ⟨C,B,7⟩ path; see the proof below.
For the next stage, D is the only non-R neighbor; the path from A to D via C has entry ⟨D,B,9⟩, an improvement over the existing
⟨D,D,11⟩ in T. The only entry in T is now ⟨D,B,9⟩; this has the lowest cost and thus we move it to R.
We now have routes in R to all nodes, and are done.
Here is another example, again with links labeled with costs:
SPF2a.svg

We start with current = A. At the end of the first stage, ⟨B,B,3⟩ is moved into R, T is {⟨D,D,12⟩}, and current is B. The second
stage adds ⟨C,B,5⟩ to T, and then moves this to R; current then becomes C. The third stage introduces the route (from A)
⟨D,B,10⟩; this is an improvement over ⟨D,D,12⟩ and so replaces it in T; at the end of the stage this route to D is moved to R.
In both the examples above, the current nodes progressed along a path, A→B→C→D. This is not generally the case; here is a
similar example but with different lengths in which current jumps from B to D:
SPF2b.svg

As in the previous example, at the end of the first stage ⟨B,B,3⟩ is moved into R, with T = {⟨D,D,4⟩}, and B becomes current. The
second stage adds ⟨C,B,6⟩ to T. However, the shortest path in T is now ⟨D,D,4⟩, and so it is D that becomes the next current. The
final stage replaces ⟨C,B,6⟩ in T with ⟨C,D,5⟩. At that point this route is added to R and the algorithm is completed.

Proof that SPF paths are shortest: suppose, by contradiction, that, for some
node, a shorter path exists than the one generated by SPF. Let A be the start
node, and let U be the first node generated for which the SPF path is not
shortest. Let T be the Tentative set and let R be the set of completed routes at the
point when we choose U as current, and let d be the cost of the new route to U.

9.6.2 https://eng.libretexts.org/@go/page/11197
Let P = ⟨A,…,X,Y,…,U⟩ be the shorter path to U, with cost c<d, where Y is the
first node along the path not to have a route in R (it is possible Y=U).
spfproof.svg

At some strictly earlier stage in the algorithm, we must have added a route to
node X, as the route to X is in R. In the following stage, we would have included
the prefix ⟨A,…,X,Y⟩ of P in Tentative. This path to Y has cost ≤c. This route must
still be in T at the point we chose U as current, as there is no route to Y in R, but
this means we should instead have chosen Y as current, contradicting the choice
of U.
A link-state source node S computes the entire path to a destination D (in fact it computes the path to every destination). But as far
as the actual path that a packet sent by S will take to D, S has direct control only as far as the first hop N. While the accurate-cost
rule we considered in distance-vector routing will still hold, the actual path taken by the packet may differ from the path computed
at the source, in the presence of alternative paths of the same length. For example, S may calculate a path S–N–A–D, and yet a
packet may take path S–N–B–D, so long as the N–A–D and N–B–D paths have the same length.
Link-state routing allows calculation of routes on demand (results are then cached), or larger-scale calculation. Link-state also
allows routes calculated with quality-of-service taken into account, via straightforward extension of the algorithm above.
Because the starting node is fixed, the shortest-path-first algorithm can be classified as a single-source approach. If the goal is to
compute the shortest paths between all pairs of nodes in a network, the Floyd-Warshall algorithm
[en.Wikipedia.org/wiki/Floyd%...all_algorithm] is an alternative, with slightly better performance in networks with large numbers
of links.

9.6: Link-State Routing-Update Algorithm is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.6.3 https://eng.libretexts.org/@go/page/11197
9.7: Routing on Other Attributes
There is sometimes a desire to route on packet attributes other than the destination, or the destination and QoS bits. For example,
we might want to route packets based in part on the packet source, or on the TCP port number. This kind of routing is decidedly
nonstandard, though it is often available, and often an important component of traffic engineering.
This option is often known as policy-based routing, because packets are routed according to attributes specified by local
administrative policy. (This term should not be confused with BGP routing policy (10.6 Border Gateway Protocol, BGP), which
means something quite different.)
Policy-based routing is not used frequently, but one routing decision of this type can have far-reaching effects. If an ISP wishes to
route customer voice traffic differently from customer data traffic, for example, it need only apply policy-based routing to classify
traffic at the point of entry, and send the voice traffic to its own router. After that, ordinary routers on the voice path and on the
separate data path can continue the forwarding without using policy-based methods.
Sometimes policy-based routing is used to mark packets for special processing; this might mean different routing further
downstream or it might mean being sent along the same path as the other traffic but with preferential treatment. For two packet-
marking strategies, see 20.7 Differentiated Services and 20.12 Multi-Protocol Label Switching (MPLS).
On Linux systems policy-based routing is part of the Linux Advanced Routing facility, often grouped with some advanced queuing
features known as Traffic Control; the combination is referred to as LARTC [www.lartc.org/].
As a simple example of what can be done, suppose a site has two links L1 and L2 to the Internet, with L1 the default route to the
Internet. Perhaps L1 is faster and L2 serves more as a backup; perhaps L2 has been added to increase outbound capacity. A site
may wish to route some outbound traffic via L2 for any of the following reasons:
the traffic may involve protocols deemed lower in priority (eg email)
the traffic may be real-time traffic that can benefit from reduced competition on L2
the traffic may come from lower-priority senders; eg some customers within the site may be relegated to using L2 because they
are paying less
a few large-volume elephant flows [en.Wikipedia.org/wiki/Elephant_flow] may be offloaded from L1 to L2
In the first two cases, routing might be based on the destination port numbers; in the third, it might be based on the source IP
address. In the fourth case, a site’s classification of its elephant flows may have accumulated over time.
Note that nothing can be done in the inbound direction unless L1 and L2 lead to the same ISP, and even there any special routing
would be at the discretion of that ISP.
The trick with LARTC is to be compatible with existing routing-update protocols; this would be a problem if the kernel forwarding
table simply added columns for other packet attributes that neighboring non-LARTC routers knew nothing about. Instead, the
forwarding table is split up into multiple ⟨dest, next_hop⟩ (or ⟨dest, QoS, next_hop⟩) tables. One of these tables is the main table,
and is the table that is updated by routing-update protocols interacting with neighbors. Before a packet is forwarded,
administratively supplied rules are consulted to determine which table to apply; these rules are allowed to consult other packet
attributes. The collection of tables and rules is known as the routing policy database.
As a simple example, in the situation above the main table would have an entry ⟨default, L1⟩ (more precisely, it would have the IP
address of the far end of the L1 link instead of L1 itself). There would also be another table, perhaps named slow, with a single
entry ⟨default, L2⟩. If a rule is created to have a packet routed using the “slow” table, then that packet will be forwarded via L2.
Here is one such Linux rule, applying to traffic from host 10.0.0.17:

ip rule add from 10.0.0.17 table slow

Now suppose we want to route traffic to port 25 (the SMTP port) via L2. This is harder; Linux provides no support here for routing
based on port numbers. However, we can instead use the iptables [en.Wikipedia.org/wiki/Iptables] mechanism to “mark” all
packets destined for port 25, and then create a routing-policy rule to have such marked traffic use the slow table. The mark is
known as the forwarding mark, or fwmark ; its value is 0 by default. The fwmark is not actually part of the packet; it is
associated with the packet only while the latter remains within the kernel.

9.7.1 https://eng.libretexts.org/@go/page/11198
iptables --table mangle --append PREROUTING \\
--protocol tcp --dest-port 25 --jump MARK --set-mark 1

ip rule add fwmark 1 table slow

Consult the applicable man pages for further details.


The iptables mechanism can also be used to set the appropriate QoS bits – the IPv4 DS bits (7.1 The IPv4 Header) or the IPv6
Traffic Class bits (8.1 The IPv6 Header) – so that a single standard IP forwarding table can be used, though support for the IPv4
QoS bits is limited.

9.7: Routing on Other Attributes is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.7.2 https://eng.libretexts.org/@go/page/11198
9.8: ECMP
Equal-Cost MultiPath routing, or ECMP, is a technique for combining two (or more) routes to a destination into a single unit, so
that traffic to that destination is distributed (not necessarily equally) among the routes. ECMP is supported by EIGRP (9.4.4
EIGRP) and the link-state implementations OSPF and IS-IS (9.5 Link-State Routing-Update Algorithm). At the Ethernet level,
ECMP is supported in spirit (if not in name) by TRILL and SPB (2.7 TRILL and SPB). It is also supported by BGP (10.6 Border
Gateway Protocol, BGP) for inter-AS routing.
A simpler alternative to ECMP is channel bonding, also known as link aggregation, and often based on the IEEE 802.3ad
[en.Wikipedia.org/wiki/Link_aggregation] standard. In channel bonding, two parallel Ethernet links are treated as a single unit. In
many cases it is simpler and cheaper to bond two or three 1 Gbps Ethernet links than to upgrade everything to support 10 Gbps.
Channel bonding applies, however, in limited circumstances; for example, the two channels must both be Ethernet, and must
represent a single link.
In the absence of channel bonding, equal-cost does not necessarily mean equal-propagation-delay. Even for two short parallel links,
queuing delays on one link may mean that packet delivery order is not preserved. As TCP usually interprets out-of-order packet
delivery as evidence of packet loss (13.3 TCP Tahoe and Fast Retransmit), this can lead to large numbers of spurious
retransmissions. For this reason, ECMP is usually configured to send all the packets of any one TCP connection over just one of
the links (as determined by a hash function); some channel-bonding implementations do the same, in fact. A consequence is that
ECMP configured this way must see many parallel TCP connections in order to utilize all participating paths reasonably equally. In
special cases, however, it may be practical to configure ECMP to alternate between the paths on a per-packet basis, using round-
robin transmission; this approach typically achieves much better load-balancing between the paths.
In terms of routing-update protocols, ECMP can be viewed as allowing two (or more) next_hop values, each with the same cost, to
be associated with the same destination.
See 18.9.4 multitrunk.py for an example of the use of software-defined networking to have multiple TCP connections take different
paths to the same destination, in a way similar to the ECMP approach.

9.8: ECMP is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.8.1 https://eng.libretexts.org/@go/page/11199
9.9: Epilog and Exercises

At this point we have concluded the basics of IP routing, involving routing within large (relatively) homogeneous organizations
such as multi-site corporations or Internet Service Providers. Every router involved must agree to run the same protocol, and must
agree to a uniform assignment of link costs.
At the very largest scales, these requirements are impractical. The next chapter is devoted to this issue of very-large-scale IP
routing, eg on the global Internet.

9.9 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 2.5 is distinct, for
example, from exercises 2.0 and 3.0. Exercises marked with a ♢ have solutions or hints at 24.8 Solutions for Routing-Update
Algorithms.
1.0. Suppose the network is as follows, where distance-vector routing update is used. Each link has cost 1, and each router has
entries in its forwarding table only for its immediate neighbors (so A’s table contains ⟨B,B,1⟩, ⟨D,D,1⟩ and B’s table contains
⟨A,A,1⟩, ⟨C,C,1⟩).
exercise1.svg
(a). Suppose each node creates a report from its initial configuration and sends that to each of its neighbors. What will each node’s
forwarding table be after this set of exchanges? The exchanges, in other words, are all conducted simultaneously; each node first
sends out its own report and then processes the reports arriving from its two neighbors.
(b). What will each node’s table be after the simultaneous-and-parallel exchange process of part (a) is repeated a second time?
Hint: you do not have to go through each exchange in detail; the only information added by an exchange is additional reachability
information.
2.0. Now suppose the configuration of routers has the link weights shown below.
exercise2.svg
(a). As in the previous exercise, give each node’s forwarding table after each node exchanges with its immediate neighbors
simultaneously and in parallel.
(b). How many iterations of such parallel exchanges will it take before C learns to reach F via B; that is, before it creates the entry
⟨F,B,11⟩? Count the answer to part (a) as the first iteration.
2.5.♢ A router R has the following distance-vector table:

destination cost next hop

A 2 R1

B 3 R2

C 4 R1

D 5 R3

R now receives the following report from R1; the cost of the R–R1 link is 1.

destination cost

A 1

B 2

C 4

D 3

Give R’s updated table after it processes R1’s report. For each entry that changes, give a brief explanation
3.0. A router R has the following distance-vector table:

9.9.1 https://eng.libretexts.org/@go/page/11200
destination cost next hop

A 5 R1

B 6 R1

C 7 R2

D 8 R2

E 9 R3

R now receives the following report from R1; the cost of the R–R1 link is 1.

destination cost

A 4

B 7

C 7

D 6

E 8

F 8

Give R’s updated table after it processes R1’s report. For each entry that changes, give a brief explanation, in the style of 9.1.5
Example 4.
3.5. At the start of Example 3 (9.1.4 Example 3), we changed C’s routing table so that it reached D via A instead of via E: C’s entry
⟨D,E,2⟩ was changed to ⟨D,A,2⟩. This meant that C had a valid route to D at the start.
How might the scenario of Example 3 play out if C’s table had not been altered? Give a sequence of reports that leads to correct
routing between D and E.
4.0. In the following exercise, A-D are routers and the attached subnets N1-N6, which are the ultimate destinations, are shown
explicitly. In the case of N1 through N4, the links are the subnets. Routers still exchange distance-vector reports with neighboring
routers, as usual. In the tables requested below, if a router has a direct connection to a subnet, you may report the next_hop as
“direct”, eg, from A’s table, ⟨N1,direct,0⟩
routers_and_nets.svg
(a). Give the initial tables for A through D, before any distance-vector exchanges.
(b). Give the tables after each router A-D exchanges with its immediate neighbors simultaneously and in parallel.
(c). At the end of (b), what subnets are not known by what routers?
5.0. Suppose A, B, C, D and E are connected as follows. Each link has cost 1, and so each forwarding table is uniquely determined;
B’s table is ⟨A,A,1⟩, ⟨C,C,1⟩, ⟨D,A,2⟩, ⟨E,C,2⟩. Distance-vector routing update is used.
fiveloop_costs.svg

Now suppose the D–E link fails, and so D updates its entry for E to ⟨E,-,∞⟩.
(a). Give A’s table after D reports ⟨E,∞⟩ to A
(b). Give B’s table after A reports to B
(c). Give A’s table after B reports to A; note that B has an entry ⟨E,C,2⟩
(d). Give D’s table after A reports to D.
5.5. In the network below, A receives alternating reports about destination D from neighbors B and C. Suppose A uses a modified
form of Rule 2 of 9.1.1 Distance-Vector Update Rules, in which it updates its forwarding table whenever new cost c is less than or
equal to cold.
dv_ties.svg

Explain why A’s forwarding entry for destination D never stabilizes.

9.9.2 https://eng.libretexts.org/@go/page/11200
6.0. Consider the network in 9.2.1.1 Split Horizon:, using distance-vector routing updates. B and C’s table entries for destination D
are shown. All link costs are 1.
split_horizon_loop.svg

Suppose the D–A link breaks and then these update reports occur:
A reports ⟨D,∞⟩ to B (as before)
C reports ⟨D,2⟩ to B (as before)
A now reports ⟨D,∞⟩ to C (instead of B reporting ⟨D,3⟩ to A)
(a). Give A, B and C’s forwarding-table records for destination D, including the cost, after these three reports.
(b). What additional reports (a pair should suffice) will lead to the formation of the routing loop?
(c). What (single) additional report will eliminate the possibility of the routing loop?
7.0. Suppose the network of 9.2 Distance-Vector Slow-Convergence Problem is changed to the following. Distance-vector update is
used; again, the D–A link breaks.
hold_down.svg
(a). Explain why B’s report back to A, after A reports ⟨D,-,∞⟩, is now valid.
(b). Explain why hold down (9.2.1.3 Hold Down) will delay the use of the new route A–B–E–D.
8.0. Suppose the routers are A, B, C, D, E and F, and all link costs are 1. The distance-vector forwarding tables for A and F are
below. Give the network with the fewest links that is consistent with these tables. Hint: any destination reached at cost 1 is directly
connected; if X reaches Y via Z at cost 2, then Z and Y must be directly connected.
A’s table

dest cost next_hop

B 1 B

C 1 C

D 2 C

E 2 C

F 3 B

F’s table

dest cost next_hop

A 3 E

B 2 D

C 2 D

D 1 D

E 1 E

9.0. (a) Suppose routers A and B somehow end up with respective forwarding-table entries ⟨D,B,n⟩ and ⟨D,A,m⟩, thus creating a
routing loop. Explain why the loop may be removed more quickly if A and B both use poison reverse with split horizon (9.2.1.1
Split Horizon), versus if A and B use split horizon only.
(b). Suppose the network looks like the following. The A–B link is extremely slow.
poison_reverse.svg

Suppose A and B send reports to each other advertising their routes to D, and immediately afterwards the C–D link breaks and C
reports to A and B that D is unreachable. After those unreachability reports from C are processed, A and B’s reports sent to each
other before the break finally arrive. Explain why the network is now in the state described in part (a).
10.0. Suppose the distance-vector algorithm is run on a network and no links break (so by the last paragraph of 9.1.1 Distance-
Vector Update Rules the next_hop-increase rule is never applied).

9.9.3 https://eng.libretexts.org/@go/page/11200
(a). Prove that whenever A is B’s next_hop to destination D, then A’s cost to D is strictly less than B’s. Hint: assume that if this
claim is true, then it remains true after any application of the rules in 9.1.1 Distance-Vector Update Rules. If the lower-cost rule is
applied to B after receiving a report from A, resulting in a change to B’s cost to D, then one needs to show A’s cost is less than B’s,
and also B’s new cost is less than that of any neighbor C that uses B as its next_hop to D.
(b). Use (a) to prove that no routing loops ever form.
11.0. It was mentioned in 9.5 Link-State Routing-Update Algorithm that link-state routing might give rise to an ephemeral routing
loop. Give a concrete scenario illustrating creation (and then dissolution) of such a loop.
12.0. Use the Shortest-Path-First algorithm to find the shortest path from A to E in the network below. Show the sets R and T, and
the node current, after each step.
zigzag.svg

13.0. Suppose you take a laptop, plug it into an Ethernet LAN, and connect to the same LAN via Wi-Fi. From laptop to LAN there
are now two routes. Which route will be preferred? How can you tell which way traffic is flowing? How can you configure your
OS to prefer one path or another? (See also 7.9.5 ARP and multihomed hosts, 7 IP version 4 exercise 11.0, and 12 TCP Transport
exercise 13.0.)

9.9: Epilog and Exercises is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

9.9.4 https://eng.libretexts.org/@go/page/11200
CHAPTER OVERVIEW

10: Large-Scale IP Routing


In the previous chapter we considered two classes of routing-update algorithms: distance-vector and link-state. Each of these
approaches requires that participating routers have agreed first to a common protocol, and then to a common understanding of how
link costs are to be assigned. We will address this further below in 10.6 Border Gateway Protocol, BGP, but a basic problem is that
if one site prefers the hop-count approach, assigning every link a cost of 1, while another site prefers to assign link costs in
proportion to their bandwidth, then meaningful path cost comparisons between the two sites simply cannot be done.
The term routing domain is used to refer to a set of routers under common administration, using a common link-cost assignment.
Another term for this is autonomous system. While use of a common routing-update protocol within the routing domain is not an
absolute requirement – for example, some subnets may internally use distance-vector while the site’s “backbone” routers use link-
state – we can assume that all routers have a uniform view of the site’s topology and cost metrics.
One of the things included in the term “large-scale” IP routing is the coordination of routing between multiple routing domains.
Even in the earliest Internet there were multiple routing domains, if for no other reason than that how to measure link costs was
(and still is) too unsettled to set in stone. However, another component of large-scale routing is support for hierarchical routing,
above the level of subnets; we turn to this next.
10.1: Classless Internet Domain Routing - CIDR
10.2: Hierarchical Routing
10.3: Legacy Routing
10.4: Provider-Based Routing
10.5: Geographical Routing
10.6: Border Gateway Protocol, BGP
10.7: Epilog and Exercises
Index

This page titled 10: Large-Scale IP Routing is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

1
10.1: Classless Internet Domain Routing - CIDR
CIDR is the mechanism for supporting hierarchical routing in the Internet backbone. Subnetting moves the network/host division
line further rightwards; CIDR allows moving it to the left as well. With subnetting, the revised division line is visible only within
the organization that owns the IP network address; subnetting is not visible outside. CIDR allows aggregation of IP address blocks
in a way that is visible to the Internet backbone.
When CIDR was introduced in 1993, the following were some of the justifications for it, all relating to the increasing size of the
backbone IP forwarding tables, and expressed in terms of the then-current Class A/B/C mechanism:
The Internet is running out of Class B addresses (this happened in the mid-1990’s)
There are too many Class C’s (the most numerous) for backbone forwarding tables to be efficient
Eventually IANA (the Internet Assigned Numbers Authority) will run out of IP addresses (this happened in 2011)
Assigning non-CIDRed multiple Class C’s in lieu of a single Class B would have helped with the first point in the list above, but
made the second point worse.
Ironically, the current (2013) very tight market for IP address blocks is likely to lead to larger and larger backbone IP forwarding
tables, as sites are forced to use multiple small address blocks instead of one large block.
By the year 2000, CIDR had essentially eliminated the Class A/B/C mechanism from the backbone Internet, and had more-or-less
completely changed how backbone routing worked. You purchased an address block from a provider or some other IP address
allocator, and it could be whatever size you needed, from /32 to /15.
What CIDR enabled is IP routing based on an address prefix of any length; the Class A/B/C mechanism of course used fixed prefix
lengths of 8, 16 and 24 bits. Furthermore, CIDR allows different routers, at different levels of the backbone, to route on prefixes of
different lengths. If organization P were allocated a /10 block, for example, then P could suballocate into /20 blocks. At the top
level, routing to P would likely be based on the first 10 bits, while routing within P would be based on the first 20 bits.
CIDR was formally introduced by RFC 1518 [https://tools.ietf.org/html/rfc1518.html] and RFC 1519
[https://tools.ietf.org/html/rfc1519.html]. For a while there were strategies in place to support compatibility with non-CIDR-aware
routers; these are now obsolete. In particular, it is no longer appropriate for large-scale routers to fall back on the Class A/B/C
mechanism in the absence of CIDR information; if the latter is missing, the routing should fail.
One way to look at the basic strategy of CIDR is as a mechanism to consolidate multiple network blocks going to the same
destination into a single entry. Suppose a router has four class C’s all to the same destination:

200.7.0.0/24 ⟶ foo
200.7.1.0/24 ⟶ foo
200.7.2.0/24 ⟶ foo
200.7.3.0/24 ⟶ foo
The router can replace all these with the single entry

200.7.0.0/22 ⟶ foo
It does not matter here if foo represents a single ultimate destination or if it represents four sites that just happen to be routed to the
same next_hop.
It is worth looking closely at the arithmetic to see why the single entry uses /22. This means that the first 22 bits must match
200.7.0.0; this is all of the first and second bytes and the first six bits of the third byte. Let us look at the third byte of the network
addresses above in binary:

200.7.000000 00.0/24 ⟶ foo


200.7.000000 01.0/24 ⟶ foo
200.7.000000 10.0/24 ⟶ foo
200.7.000000 11.0/24 ⟶ foo

10.1.1 https://eng.libretexts.org/@go/page/11206
The /24 means that the network addresses stop at the end of the third byte. The four entries above cover every possible combination
of the last two bits of the third byte; for an address to match one of the entries above it suffices to begin 200.7 and then to have 0-
bits as the first six bits of the third byte. This is another way of saying the address must match 200.7.0.0/22.
Most implementations actually use a bitmask, eg 255.255.252.0, rather than the number 22. Note 252 is, in binary, 1111 1100, with
6 leading 1-bits, so 255.255.252.0 has 8+8+6=22 1-bits followed by 10 0-bits.
The IP delivery algorithm of 7.5 The Classless IP Delivery Algorithm still works with CIDR, with the understanding that the
router’s forwarding table can now have a network-prefix length associated with any entry. Given a destination D, we search the
forwarding table for network-prefix destinations B/k until we find a match; that is, equality of the first k bits. In terms of masks,
given a destination D and a list of table entries ⟨prefix,mask⟩ = ⟨B[i],M[i]⟩, we search for i such that (D & M[i]) = B[i].
But what about the possibility of multiple matches? For subnets, avoiding this was the responsibility of the subnetting site, but
responsibility for avoiding this with CIDR is much too distributed to be declared illegal by IETF mandate. Instead, CIDR
introduced the longest-match rule: if destination D matches both B1/k1 and B2/k2, with k1 < k2, then the longer match B2/k2 match
is to be used. (Note that if D matches two distinct entries B1/k1 and B2/k2 then either k1 < k2 or k2 < k1).

10.1: Classless Internet Domain Routing - CIDR is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.1.2 https://eng.libretexts.org/@go/page/11206
10.2: Hierarchical Routing
Strictly speaking, CIDR is simply a mechanism for routing to IP address blocks of any prefix length; that is, for setting the
network/host division point to an arbitrary place within the 32-bit IP address.
However, by making this network/host division point variable, CIDR introduced support for routing on different prefix lengths at
different places in the backbone routing infrastructure. For example, top-level routers might route on /8 or /9 prefixes, while
intermediate routers might route based on prefixes of length 14. This feature of routing on fewer bits at one point in the Internet and
more bits at another point is exactly what is meant by hierarchical routing.
We earlier saw hierarchical routing in the context of subnets: traffic might first be routed to a class-B site 147.126.0.0/16, and then,
within that site, to subnets such as 147.126.1.0/24, 147.126.2.0/24, etc. But with CIDR the hierarchy can be much more flexible:
the top level of the hierarchy can be much larger than the “customer” level, lower levels need not be administratively controlled by
the higher levels (as is the case with subnets), and more than two levels can be used.
CIDR is an address-block-allocation mechanism; it does not directly speak to the kinds of policy we might wish to implement with
it. Here are four possible applications; the latter two involve hierarchical routing:
Application 1 (legacy): CIDR allows the allocation of multiple blocks of Class C, or fragments of a Class A, to a single
customer, so as to require only a single forwarding-table entry for that customer
Application 2 (legacy): CIDR allows opportunistic aggregation of routes: a router that sees the four 200.7.x.0/24 routes above
in its table may consolidate them into a single entry.
Application 3 (current): CIDR allows huge provider blocks, with suballocation by the provider. This is known as provider-
based routing.
Application 4 (hypothetical): CIDR allows huge regional blocks, with suballocation within the region, somewhat like the
original scheme for US phone numbers with area codes. This is known as geographical routing.
Each of these has the potential to achieve a considerable reduction in the size of the backbone forwarding tables, which is arguably
the most important goal here. Each involves using CIDR to support the creation of arbitrary-sized address blocks and then routing
to them as a single unit. For example, the Internet backbone might be much happier if all its routers simply had to maintain a single
entry ⟨200.0.0.0/8, R1⟩, versus 256 entries ⟨200.x.0.0/16, R1⟩ for every value of x. (As we will see below, this is still useful even if
a few of the x’s have a different next_hop.) Secondary CIDR goals include bringing some order to IP address allocation and (for
the last two items in the list above) enabling a routing hierarchy that mirrors the actual flow of most traffic.
Hierarchical routing does introduce one new wrinkle: the routes chosen may no longer be globally optimal, at least if we also apply
the routing-update algorithms hierarchically. Suppose, for example, at the top level forwarding is based on the first eight bits of the
address, and all traffic to 200.0.0.0/8 is routed to router R1. At the second level, R1 then routes traffic (hierarchically) to
200.20.0.0/16 via R2. A packet sent to 200.20.1.2 by an independent router R3 might therefore pass through R1, even if there were
a lower-cost path R3→R4→R2 that bypassed R1. The top-level forwarding entry ⟨200.0.0.0/8,R1⟩, in other words, may represent a
simplification of the real situation. Prohibiting “back-door” routes like R3→R4→R2 is impractical (and would not be helpful
either); customers are independent entities.
This non-optimal routing issue cannot happen if all routers agree upon one of the shortest-path mechanisms of 9 Routing-Update
Algorithms; in that case R3 would learn of the lower-cost R3→R4→R2 path. But then the potential hierarchical benefits of
decreasing the size of forwarding tables would be lost. More seriously, complete global agreement of all routers on one common
update protocol is simply not practical; in fact, one of the goals of hierarchical routing is to provide a workable alternative. We will
return to this below in 10.4.3 Hierarchical Routing via Providers.

10.2: Hierarchical Routing is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.2.1 https://eng.libretexts.org/@go/page/11207
10.3: Legacy Routing
Back in the days of NSFNet, the Internet backbone was a single routing domain. While most customers did not connect directly to
the backbone, the intervening providers tended to be relatively compact, geographically – that is, regional – and often had a single
primary routing-exchange point with the backbone. IP addresses were allocated to subscribers directly by the IANA, and the
backbone forwarding tables contained entries for every site, even the Class C’s.
Because the NSFNet backbone and the regional providers did not necessarily share link-cost information, routes were even at this
early point not necessarily globally optimal; compromises and approximations were made. However, in the NSFNet model routers
generally did find a reasonable approximation to the shortest path to each site referenced by the backbone tables. While the legacy
backbone routing domain was not all-encompassing, if there were differences between two routes, at least the backbone portions –
the longest components – would be identical.

10.3: Legacy Routing is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.3.1 https://eng.libretexts.org/@go/page/11208
10.4: Provider-Based Routing
In provider-based routing, large CIDR blocks are allocated to large-scale providers. The different providers each know how to route
to one another. Subscribers (usually) obtain their IP addresses from within their providers’ blocks; thus, traffic from the outside is
routed first to the provider, and then, within the provider’s routing domain, to the subscriber. We may even have a hierarchy of
providers, so packets would be routed first to the large-scale provider, and eventually to the local provider. There may no longer be
a central backbone; instead, multiple providers may each build parallel transcontinental networks.
Here is a simpler example, in which providers have unique paths to one another. Suppose we have providers P0, P1 and P2, with
customers as follows:
P0: customers A,B,C
P1: customers D,E
P2: customers F,G
We will also assume that each provider has an IP address block as follows:
P0: 200.0.0.0/8
P1: 201.0.0.0/8
P2: 202.0.0.0/8
Let us now allocate addresses to the customers:

A: 200.0.0.0/16
B: 200.1.0.0/16
C: 200.2.16.0/20 (16 = 0001 0000)
D: 201.0.0.0/16
E: 201.1.0.0/16
F: 202.0.0.0/16
G: 202.1.0.0/16
The routing model is that packets are first routed to the appropriate provider, and then to the customer. While this model may not in
general guarantee the shortest end-to-end path, it does in this case because each provider has a single point of interconnection to the
others. Here is the network diagram:
providers1.svg

With this diagram, P0’s forwarding table looks something like this:

P0

destination next_hop

200.0.0.0/16 A

200.1.0.0/16 B

200.2.16.0/20 C

201.0.0.0/8 P1

202.0.0.0/8 P2

That is, P0’s table consists of


one entry for each of P0’s own customers
one entry for each other provider
If we had 1,000,000 customers divided equally among 100 providers, then each provider’s table would have only 10,099 entries:
10,000 for its own customers and 99 for the other providers. Without CIDR, each provider’s forwarding table would have
1,000,000 entries.

10.4.1 https://eng.libretexts.org/@go/page/11209
CIDR enables hierarchical routing by allowing the routing decision to be made on different prefix lengths in different contexts. For
example, when a packet is sent from D to A, P1 looks at the first 8 bits while P0 looks at the first 16 bits. Within customer A,
routing might be made based on the first 24 bits.
Even if we have some additional “secondary” links, that is, additional links that do not create alternative paths between providers,
the routing remains relatively straightforward. Shown here are the private customer-to-customer links C–D and E–F; these are
likely used only by the customers they connect. Two customers, A and E, are multihomed; that is, they have connections to
alternative providers: A–P1 and E–P2. (The term “multihomed” is often applied to any host with multiple network interfaces on
different LANs, which includes any router; here we mean more specifically that there are multiple network interfaces connecting to
different providers.)
Typically, though, while A and E may use their alternative-provider links all they want for outbound traffic, their respective
inbound traffic would still go through their primary providers P0 and P1 respectively.
providers2.svg

10.4.1 Internet Exchange Points


The long links joining providers in these diagrams are somewhat misleading; providers do not always like sharing long links and
the attendant problems of sharing responsibility for failures. Instead, providers often connect to one another at Internet eXchange
Points or IXPs; the link P0──────P1 might actually be P0───IXP───P1, where P0 owns the left-hand link and P1 the right-
hand. IXPs can either be third-party sites open to all providers, or private exchange points. The term “Metropolitan Area
Exchange”, or MAE, appears in the names of the IXPs MAE-East, originally near Washington DC, and MAE-West, originally in
San Jose, California; each of these is now actually a set of IXPs. MAE in this context is now a trademark.

10.4.2 CIDR and Staying Out of Jail


Suppose we want to change providers. One way we can do this is to accept a new IP-address block from the new provider, and
change all our IP addresses. The paper Renumbering: Threat or Menace [LKCT96] was frequently cited – at least in the early days
of CIDR – as an intimation that such renumbering was inevitably a Bad Thing. In principle, therefore, we would like to allow at
least the option of keeping our IP address allocation while changing providers.
An address-allocation standard that did not allow changing of providers might even be a violation of the US Sherman Antitrust Act;
see American Society of Mechanical Engineers v Hydrolevel Corporation, 456 US 556 (1982). The IETF thus had the added
incentive of wanting to stay out of jail, when writing the CIDR standard so as to allow portability between providers (actually,
antitrust violations usually involve civil penalties).
The CIDR longest-match rule turns out to be exactly what we (and the IETF) need. Suppose, in the diagrams above, that customer
C wants to move from P0 to P1, and does not want to renumber. What routing changes need to be made? One solution is for P0 to
add a route ⟨200.2.16.0/20, P1⟩ that routes all of C’s traffic to P1; P1 will then forward that traffic on to C. P1’s table will be as
follows, and P1 will use the longest-match rule to distinguish traffic for its new customer C from traffic bound for P0.

P1

destination next_hop

200.0.0.0/8 P0

202.0.0.0/8 P2

201.0.0.0/16 D

201.1.0.0/16 E

200.2.16.0/20 C

This does work, but all C’s inbound traffic except for that originating in P1 will now be routed through C’s ex-provider P0, which
as an ex-provider may not be on the best of terms with C. Also, the routing is inefficient: C’s traffic from P2 is routed P2→P0→P1
instead of the more direct P2→P1.

10.4.2 https://eng.libretexts.org/@go/page/11209
A better solution is for all providers other than P1 to add the route ⟨200.2.16.0/20, P1⟩. While traffic to 200.0.0.0/8 otherwise goes
to P0, this particular sub-block is instead routed by each provider to P1. The important case here is P2, as a stand-in for all other
providers and their routers: P2 routes 200.0.0.0/8 traffic to P0 except for the block 200.2.16.0/20, which goes to P1.
Having every other provider in the world need to add an entry for C has the potential to cost some money, and, one way or another,
C will be the one to pay. But at least there is a choice: C can consent to renumbering (which is not difficult if they have been
diligent in using DHCP and perhaps NAT too), or they can pay to keep their old address block.
As for the second diagram above, with the various private links (shown as dashed lines), it is likely that the longest-match rule is
not needed for these links to work. A’s “private” link to P1 might only mean that
A can send outbound traffic via P1
P1 forwards A’s traffic to A via the private link
P2, in other words, is still free to route to A via P0. P1 may not advertise its route to A to anyone else.

10.4.3 Hierarchical Routing via Providers


With provider-based routing, the route taken may no longer be end-to-end optimal; we have replaced the problem of finding an
optimal route from A to B with the two problems of finding an optimal route from A to B’s provider P, and then from P’s entry
point to B. This strategy mirrors the two-stage hierarchical routing process of first routing on the address bits that identify the
provider, and then routing on the address bits including the subscriber portion.
This two-stage strategy may not yield the same result as finding the globally optimal route. The result will be the same if B’s
customers can only be reached through P’s single entry-point router RP, which models the situation that P and its customers look
like a single site. However, either or both of the following can disrupt this model:
There may be multiple entry-point routers into provider P’s network, eg RP1, RP2 and RP3, with different costs from A.
P’s customer B may have an alternative connection to the outside world via a different provider, as in the second diagram in
10.4 Provider-Based Routing.
Consider the following example representing the first situation (the more important one in practice), in which providers P1 and P2
have three interconnection points IX1, IX2, IX3 (from Internet eXchange, 10.4.1 Internet Exchange Points). Links are labeled with
costs; we assume that P1’s costs happen to be comparable with P2’s costs.
three_ixp1.svg

The globally shortest path between A and B is via the R2–IX2–S2 crossover, with total length 5+1+0+4=10. However, traffic from
A to B will be routed by P1 to its closest crossover to P2, namely the R3–IX3–S3 link. The total path is 3+0+1+8+4=16. Traffic
from B to A will be routed by P2 via the R1–IX1–S1 crossover, for a length of 3+0+1+7+5=16. This routing strategy is sometimes
called hot-potato routing; each provider tries to get rid of any traffic (the potatoes) as quickly as possible, by routing to the closest
exit point.
Not only are the paths taken inefficient, but the A⟶B and B⟶A paths are now asymmetric. This can be a problem if forward
and reverse timings are critical, or if one of P1 or P2 has significantly more bandwidth or less congestion than the other. In practice,
however, route asymmetry is seldom important.
As for the route inefficiency itself, this also is not necessarily a significant problem; the primary reason routing-update algorithms
focus on the shortest path is to guarantee that all computed paths are loop-free. As long as each half of a path is loop-free, and the
halves do not intersect except at their common midpoint, these paths too will be loop-free.
The BGP “MED” value (10.6.6.3 MULTI_EXIT_DISC) offers an optional mechanism for P1 to agree that A⟶B traffic should
take the r1–s1 crossover. This might be desired if P1’s network were “better” and customer A was willing to pay extra to keep its
traffic within P1’s network as long as possible.

10.4.4 IP Geolocation
In principle, provider-based addressing may mean that consecutive IP addresses are scattered all over a continent. In practice,
providers (even many mobile providers) do not do this; any given small address block – perhaps /24 – is used in a limited

10.4.3 https://eng.libretexts.org/@go/page/11209
geographical area. Different blocks are used in different areas. A consequence of this is that it is possible in principle to determine,
from a given IP address, the corresponding approximate geographical location; this is known as IP geolocation. Even satellite-
Internet users can be geolocated, although sometimes only to within a couple hundred miles. Several companies have created
detailed geolocation maps, identifying many locations roughly down to the zip code [en.Wikipedia.org/wiki/ZIP_Code], and
typically available as an online service.
IP geolocation was originally developed so that advertisers could serve up regionally appropriate advertisements. It is, however,
now used for a variety of purposes including identification of the closest CDN edge server (1.12.2 Content-Distribution Networks),
network security, compliance with national regulations, higher-level user tracking, and restricting the streaming of copyrighted
content.

10.4: Provider-Based Routing is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.4.4 https://eng.libretexts.org/@go/page/11209
10.5: Geographical Routing
The classical alternative to provider-based routing is geographical routing; the archetypal model for this is the telephone area code
system. A call from anywhere in the US to Loyola University’s main switchboard, 773-274-3000, would traditionally be routed
first to the 773 area code in Chicago. From there the call would be routed to the north-side 274 exchange, and from there to
subscriber 3000. A similar strategy can be used for IP routing.
Geographical addressing has some advantages. Figuring out a good route to a destination is usually straightforward, and close to
optimal in terms of the path physical distance. Changing providers never involves renumbering (though moving may). And
approximate IP address geolocation (determining a host’s location from its IP address) is automatic.
Geographical routing has some minor technical problems. First, routing may be inefficient between immediate neighbors A and B
that happen to be split by a boundary for larger geographical areas; the path might go from A to the center of A’s region to the
center of B’s region and then to B. Another problem is that some larger sites (eg large corporations) are themselves geographically
distributed; if efficiency is the goal, each office of such a site would need a separate IP address block appropriate for its physical
location.
But the real issue with geographical routing is apparently the business question of who carries the traffic. The provider-based
model has a very natural answer to this: every link is owned by a specific provider. For geographical IP routing, my local provider
might know at once from the prefix that a packet of mine is to be delivered from Chicago to San Francisco, but who will carry it
there? My provider might have to enter into different traffic contracts for multiple different regions. If different local providers
make different arrangements for long-haul packet delivery, the routing efficiency (at least in terms of table size) of geographical
routing is likely lost. Finally, there is no natural answer for who should own those long inter-region links. It may be useful to recall
that the present area-code system was created when the US telephone system was an AT&T monopoly, and the question of who
carried traffic did not exist.
That said, the top five Regional Internet Registries represent geographical regions (usually continents), and provider-based
addressing is below that level. That is, the IANA handed out address blocks to the geographical RIRs, and the RIRs then allocated
address blocks to providers.
At the intercontinental level, geography does matter: some physical link paths are genuinely more expensive than other (shorter)
paths. It is much easier to string terrestrial cable than undersea cable. However, within a continent physical distance does not
always matter as much as might be supposed. Furthermore, a large geographically spread-out provider can always divide up its
address blocks by region, allowing internal geographical routing to the correct region.

10.5.1 https://eng.libretexts.org/@go/page/11210
Here is a diagram of IP address allocation as of 2006: http://xkcd.com/195.

10.5: Geographical Routing is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.5.2 https://eng.libretexts.org/@go/page/11210
10.6: Border Gateway Protocol, BGP
In 9 Routing-Update Algorithms, we considered interior routing-update protocols: those in which all the routers involved are
under common management. That management can then dictate the routing-update protocol to be used, and also the rules for
assigning per-link costs. For both Distance-Vector and Link State methods, the per-link cost played an essential role: by trying to
minimize the cost, we were assured that no routing loops would be present in a stable network (9.3 Observations on Minimizing
Route Cost).
But now consider the problem of exterior routing; that is, of choosing among routes that pass through independent organizations.
In the diagram below, suppose that A, B, C and D are each managed independently; it may be useful to think of A, B and C as three
ISPs and D as some destination.
diamond.svg

Organization (or ISP) A has two routes to destination D – one via B and one via C – and must choose between them.
If A wanted to use one of the interior routing-update protocols to choose its path to D, it would face several purely technical
problems. First, what if B uses distance-vector while C speaks only in link-state LSP messages? Second, what if B measures its
path costs using the hopcount metric, while C assigns costs based on bandwidth, or congestion, or pecuniary considerations?
The mixing of unrelated metrics isn’t necessarily useless: all that is required for the shortest-path-is-loop-free result mentioned
above is that the two ends of each link agree on the cost assigned to that link. But apples-and-oranges comparison of different
metrics would completely undermine the intended use of those metrics to influence the selection of which links should carry the
most traffic. Sharing link-cost information without a common administrative policy to set those costs does not, in practical terms,
make sense.
But A also faces a larger issue: to reach D it must rely on having its traffic carried by an outsider – either B or C. Outsiders are
likely not inclined to offer this service without some form of compensation, either monetary or through reciprocal exchange. If A
reaches an understanding with B on this matter of traffic carriage, then A does not want its traffic routed via C even if that latter
route is of lower technical cost. If A is paying B, it is going to expect to use B. If A is not paying C, C is going to expect that A not
use C.
The Border Gateway Protocol, or BGP, is assigned the job of handling exterior routing; that is, of handling exchange of routing
information between neighboring independent organizations. The current version is BGP-4, documented in RFC 4271
[https://tools.ietf.org/html/rfc4271.html].
BGP’s primary goal is to provide support for what are sometimes called routing policies; that is, for choosing routes based on
managerial or administrative input. We address this below in 10.6.4 BGP Filtering and Routing Policies. (Routing policies have
nothing to do with the policy-based routing described in 9.6 Routing on Other Attributes, in which different packets with the same
destination address may be routed differently because a site has a “policy” to take packet attributes other than destination into
account. With BGP, once a site’s policies to choose a route to a given destination are applied, all traffic to that destination takes that
single route.)
Ultimately, the administrative input used by BGP very likely relates to who is paying what for the traffic carried. It is also possible,
though less common, to use BGP to implement other preferences, such as for domestic traffic to remain within national boundaries.
The BGP term for a routing domain under coordinated administration, and using one consistent interior protocol and link-cost
metric throughout, is Autonomous System, or AS. That said, all that is strictly required is that all BGP routers within an AS have
the same consistent view of routing, and in fact some Autonomous Systems do run multiple routing protocols and may even use
different metrics at different points. As indicated above, BGP does not support the exchange of link-cost information between
Autonomous Systems. Autonomous Systems are identified by a globally unique AS number, originally 16 bits but now extended
to 32 bits.
A site needs to run BGP (and so needs to have an AS number) if it connects to (or might connect to) more than one other AS; sites
that connect only to a single ISP do not need BGP. Every site running BGP will have one or more BGP speakers: the devices that
run BGP. If there is more than one, they must remain coordinated with one another so as to present a consistent view of the site’s
connections and advertisements; this coordination process is sometimes called internal BGP to distinguish it from the
communication with neighboring Autonomous Systems. The latter process is then known as external BGP.

10.6.1 https://eng.libretexts.org/@go/page/11211
The BGP speakers of a site are often not the busy border routers that connect directly to the neighboring AS, though they are
usually located near them and are often on the same subnet. Each interconnection point with a neighboring AS generally needs its
own BGP speaker. Connections between BGP speakers of neighboring Autonomous Systems – sometimes called BGP peers – are
generally configured administratively; they are not subject to a “neighbor discovery” process like that used by most interior routers.
The BGP speakers must maintain a database of all routes received, not just of the routes actually used. However, the speakers
forward to their neighbors only routes they (and thus their AS) actually use themselves; this is a firm BGP rule.
Many BGP implementations support Equal-Cost Multi-Path routing (9.7 ECMP), by which two (or more) links to the same
neighbor may be treated as one. The Internet Draft draft-lapukhov-bgp-ecmp-considerations-01 [https://tools.ietf.org/html/draft-
la...iderations-01] addresses this further.

10.6.1 AS-paths
At its most basic level, BGP involves the exchange of lists of reachable destinations, like distance-vector routing without the
distance information. But that strategy, alone, cannot detect routing loops. BGP solves the loop problem by having routers
exchange, not just destination information, but also the entire path used to reach each destination. Paths including each router
would be cumbersome; instead, BGP abbreviates the path to the list of AS’s traversed. This is called the AS-path. This allows
routers to make sure their routes do not traverse any AS more than once, and thus do not have loops.
As an example of this, consider the network below, in which we consider Autonomous Systems also to be destinations. Initially, we
will assume that each AS discovers its immediate neighbors. AS3 and AS5 will then each advertise to AS4 their routes to AS2, but
AS4 will have no reason at this level to prefer one route to the other (BGP does use the shortest AS-path as part of its tie-breaking
rule, but, before falling back on that rule, AS4 is likely to have a commercial preference for which of AS3 and AS5 it uses to reach
AS2).
five_ASs.svg

Also, AS2 will advertise to AS3 its route to reach AS1; that advertisement will contain the AS-path ⟨AS2,AS1⟩. Similarly, AS3
will advertise this route to AS4 and then AS4 will advertise it to AS5.
When AS5 in turn advertises this AS1-route to AS2, it has the potential to create a loop. It does not, however, because it will
include the entire AS-path ⟨AS5,AS4,AS3,AS2,AS1⟩ in the advertisement it sends to AS2. AS2 will know not to use this route
because it will see that it is a member of the AS-path. Thus, BGP is spared the kind of slow-convergence problem that traditional
distance-vector approaches were subject to.
It is theoretically possible that the shortest path (in the sense, say, of the hopcount metric) from one host to another traverses some
AS twice. If so, BGP will not allow this route.
AS-paths potentially add considerably to the size of the AS database. The number of paths a site must keep track of is proportional
to the number of AS’s, because there will be one AS-path to each destination AS. (Actually, an AS may have to record many times
that many AS-paths, as an AS may hear of AS-paths that it elects not to use.) As of 2019 there were about 80 thousand AS’s in the
world. Let A be the number of AS’s. Typically the average length of an AS-path is about log(A), although this depends on
connectivity; in 2019 this average length was about six. The amount of memory required by BGP is

C×A×log(A) + K×N,
where C and K are constants.
The other major goal of BGP is to allow administrative input to what, for interior routing, is largely a technical calculation
(though an interior-routing administrator can set link costs). BGP is the interface between ISPs (and between ISPs and their larger
customers), and can be used to implement contractual agreements made regarding which ISPs will carry other ISPs’ traffic. If ISP2
tells ISP1 it has a route to destination D, but ISP1 chooses not to send traffic to ISP2, BGP can be used to implement this. Perhaps
more likely, if ISP2 has a route to D but does not want ISP1 to use it until they pay for the privilege, BGP can be used to implement
this as well.
Despite the exchange of AS-path information, temporary routing loops may still exist. This is because BGP may first decide to use
a route and only then export the new AS-path; the AS on the other side may realize there is a problem as soon as the AS-path is

10.6.2 https://eng.libretexts.org/@go/page/11211
received but by then the loop will have at least briefly been in existence. See the first example below in 10.6.10 Examples of BGP
Instability.
BGP’s predecessor was EGP, which guaranteed loop-free routes by allowing only a single route to any AS, thus forcing the Internet
into a tree topology, at least at the level of Autonomous Systems. The AS graph could contain no cycles or alternative routes, and
hence there could be no redundancy provided by alternative paths. EGP also thus avoided having to make decisions as to the
preferred path; there was never more than one choice. EGP was sometimes described as a reachability protocol; its only concern
was whether a given network was reachable.

10.6.2 AS-Paths and Route Aggregation


There is some conflict between the goal of reporting precise AS-paths to each destination, and of consolidating as many address
prefixes as possible into a single prefix (single CIDR block). Consider the following network:
four_ASs.svg

Suppose AS2 has paths

path=⟨AS2⟩, destination 200.0.0/23


path=⟨AS2,AS3⟩, destination 200.0.2/24
path=⟨AS2,AS4⟩, destination 200.0.3/24
If AS2 wants to optimize address-block aggregation using CIDR, it may prefer to aggregate the three destinations into the single
block 200.0.0/22. In this case there would be two options for how AS2 reports its routes to AS1:
Option 1: report 200.0.0/22 with path ⟨AS2⟩. But this ignores the AS’s AS3 and AS4! These are legitimately part of the AS-
paths to some of the destinations within the block 200.0.0/22; loop detection could conceivably now fail.
Option 2: report 200.0.0/22 with path ⟨AS2,AS3,AS4⟩, which is not a real path but which does include all the AS’s involved.
This ensures that the loop-detection algorithm works, but artificially inflates the length of the AS-path, which is used for certain
tie-breaking decisions.
As neither of these options is ideal, the concept of the AS-set was introduced. A list of Autonomous Systems traversed in order
now becomes an AS-sequence. In the example above, AS2 can thus report net 200.0.0/22 with
AS-sequence=⟨AS2⟩
AS-set={AS3,AS4}
AS2 thus both achieves the desired aggregation and also accurately reports the AS-path length.
The AS-path can in general be an arbitrary list of AS-sequence and AS-set parts, but in cases of simple aggregation such as the
example here, there will be one AS-sequence followed by one AS-set.
RFC 6472 [https://tools.ietf.org/html/rfc6472.html] now recommends against using AS-sets entirely, and recommends that
aggregation as above be avoided. One consequence of this recommendation is that every IP-address prefix announced by any
public Autonomous System will result in a corresponding entry in the forwarding tables of the backbone routers.

10.6.3 Transit Traffic


It is helpful to distinguish between two kinds of traffic, as seen from a given AS. Local traffic is traffic that either originates or
terminates at that AS; this is traffic that “belongs” to that AS. At leaf sites (that is, sites that connect only to their ISP and not to
other sites), all traffic is local.
The other kind of traffic is transit traffic; the AS is forwarding it along on behalf of some nonlocal party. For ISPs, most traffic is
transit traffic. A large almost-leaf site might also carry a small amount of transit traffic for one particular related (but autonomous!)
organization.
The decision as to whether to carry transit traffic is a classic example of an administrative choice, implemented by BGP’s support
for routing policies. Most real-world BGP configuration issues relate to the carriage (or non-carriage) of transit traffic.

10.6.3 https://eng.libretexts.org/@go/page/11211
10.6.4 BGP Filtering and Routing Policies
As stated above, one of the goals of BGP is to support routing policies; that is, routing based on managerial or administrative
concerns in addition to technical ones. A BGP speaker may be aware of multiple routes to a destination. To choose the one route
that we will use, it may combine a mixture of optimization rules and policy rules. Some examples of policy rules might be:
do not use AS13 as we have an adversarial relationship with them
do not allow transit traffic
BGP implements policy through filtering rules – that is, rules that allow rejection of certain routes – at three different stages:

1. Import filtering is applied to the lists of routes a BGP speaker receives from its
neighbors.
2. Best-path selection is then applied as that BGP speaker chooses which of the routes
accepted by the first step it will actually use.
3. Export filtering is done to decide what routes from the previous step a BGP speaker
will actually advertise. A BGP speaker can only advertise paths it uses, but does not
have to advertise every such path.
While there are standard default rules for all these (accept everything imported, use simple tie-breakers, export everything), a site
will usually implement at least some policy rules through this filtering process (eg “prefer routes through the ISP we have a
contract with”).
As an example of import filtering, a site might elect to ignore all routes from a particular neighbor, or to ignore all routes whose
AS-path contains a particular AS, or to ignore temporarily all routes from a neighbor that has demonstrated too much recent “route
instability” (that is, rapidly changing routes). Import filtering can also be done in the best-path-selection stage, by having the best-
path-selection process ignore routes from selected neighbors.
BGP Breakdowns
In the real world, it sometimes happens that a small regional ISP is misconfigured to attempt to report to some high-level provider
that it can reach, say, every site in the world. Export filtering on the part of the small ISP and best-path selection and import
filtering on the part of the large ISP usually – though not always – catches this. Occasionally, such incidents represent malicious
BGP hijacking.
The next stage is best-path selection, to pick the preferred routes from among all those just imported. The first step is to eliminate
AS-paths with loops. Even if the neighbors have been diligent in not advertising paths with loops, an AS will still need to reject
routes that contain itself in the associated AS-path.
The next step in the best-path-selection stage, generally the most important in BGP configuration, is to assign a local_preference,
or weight, to each route received. An AS may have policies that add a certain amount to the local_preference for routes that use a
certain AS, etc. Very commonly, larger sites will have preferences based on contractual arrangements with particular neighbors.
Provider AS’s, for example, will in general prefer routes learned from their customers, as these are “cheaper”. A smaller ISP that
connects to two larger ones might be paying to route the majority of its outbound traffic through a particular one of the two; its
local_preference values will then implement this choice. After BGP calculates the local_preference value for every route, the routes
with the best local_preference are then selected.
Domains are free to choose their local_preference rules however they wish. In principle this can involve rather strange criteria; for
example, in 10.6.10 Examples of BGP Instability we will consider an example where AS1 prefers routes with AS-path ⟨AS3,AS2⟩
to the strictly shorter path ⟨AS2⟩. That example, however, demonstrates instability; domains are encouraged to set their rules in
accordance with some standard principles, below, to avoid this.
Local_preference values are communicated internally via the LOCAL_PREF path attribute, below. They are not shared with other
Autonomous Systems.
In the event of ties – two routes to the same destination with the same local_preference – a first tie-breaker rule is to prefer the
route with the shorter AS-path. While this superficially resembles a shortest-path algorithm, the real work should have been done in

10.6.4 https://eng.libretexts.org/@go/page/11211
administratively assigning local_preference values. The shorter-AS-path tie-breaker is perhaps best thought of as similar in spirit to
the smaller-AS-number tie-breaker (although the sometimes-significant Multi-Exit-Discriminator tie-breaker, next, comes between
them).
The final significant step of the route-selection phase is to apply the Multi-Exit-Discriminator value, 10.6.6.3
MULTI_EXIT_DISC. A site may very well choose to ignore this value entirely.
Finally we get to the trivial tie-breaker rules, though if a tie-breaker rule assigns significant traffic to one AS over another then it
may have economic consequences and shouldn’t be considered “trivial”. If this situation is detected, it would probably be
addressed in the local-preferences phase. The trivial tie-breakers take into account the internal routing cost, the numeric value of
the AS number, and the numeric value of the neighbor’s IP address.
After the best-path-selection stage is complete, the BGP speaker has now selected the routes its own Autonomous System will use.
These routes are then communicated to the actual routers, which are often different devices.
The final stage is to decide what rules will be exported to which neighbors. Only routes the BGP speaker has decided it will use –
that is, routes that have made it to this point – can be exported; a site cannot route to destination D through AS1 but export a route
claiming D can be reached through AS2.
It is at the export-filtering stage that an AS can enforce no-transit rules. If it does not wish to carry transit traffic to destination D, it
will not advertise D to any of its AS-neighbors.
The export stage can lead to anomalies. Suppose, for example, that AS1 reaches D and AS5 via AS2, and announces this to AS4.
AS_loop.svg

Later, we imagine, AS1 switches to reaching D via AS3, but is forbidden by policy to announce to AS4 any routes with AS-path
containing AS3; such a policy is straightforward to implement via export filtering. Then AS1 must simply withdraw the
announcement to AS4 that it could reach D at all, even though the route to D via AS2 is still there.

10.6.5 BGP Table Size


In principle, there is a one-to-one correspondence between IP address prefixes announced by public Autonomous Systems and
entries in the backbone IP forwarding table. (The now-obsolete technique of route aggregation, 10.6.2 AS-Paths and Route
Aggregation, used to create a modest discrepancy here.)
The set of all routes received by a BGP speaker, after import filtering, is sometimes called the Routing Information Base, or RIB.
The resultant forwarding table created after best-path selection is then the Forwarding Information Base, or FIB, although the full
FIB may also contain routes learned via non-BGP protocols. Each FIB entry will also contain the actual next-hop router, versus the
next-AS information actually received via BGP. For simplicity, we will refer to the forwarding table generated from BGP records
only as the BGP FIB.
The size of the IPv4 BGP FIB – that is, the number of distinct prefixes in a backbone IPv4 forwarding table – is plotted in the chart
below, based on data courtesy of bgp.potaroo.net [bgp.potaroo.net/], with some modest smoothing applied.
Graph of backbone IP forwarding table size vs time, 1990-2019

The time range is from 1994 to July 2019; at the end, there are 788 thousand IP prefixes from (not shown in the graph) around 65
thousand Autonomous Systems. The graph is flat from 2001 to 2002, reflecting the aftereffects of the so-called dot-com bubble
[en.Wikipedia.org/wiki/Dot-com_bubble]. Overall the increase with time is roughly quadratic, but in the last decade has been closer
to linear.
The graph does not entirely represent growth of the Internet; it also represents fragmentation. In recent years, only smaller address
blocks have been available, and so many sites and providers have cobbled together their Internet presence from multiple such
blocks, where they might have preferred a single block.

10.6.6 BGP Path attributes


BGP supports the inclusion of various path attributes when exchanging routing information. Attributes exchanged with neighbors
can be transitive or non-transitive; the difference is that if a neighbor AS does not recognize a received path attribute then it

10.6.5 https://eng.libretexts.org/@go/page/11211
should pass it along anyway if it is marked transitive, but not otherwise. Some path attributes are entirely local, that is, internal to
the AS of origin. Other flags are used to indicate whether recognition of a path attribute is required or optional, and whether
recognition can be partial or must be complete.
The AS-path itself is perhaps the most fundamental path attribute. Here are a few other common attributes:

10.6.6.1 NEXT_HOP
This mandatory external attribute allows BGP speaker B1 of AS1 to inform its BGP peer B2 of AS2 what actual router to use to
reach a given destination. If B1, B2 and AS1’s actual border router R1 are all on the same subnet, B1 will include R1’s IP address
as its NEXT_HOP attribute. If B1 is not on the same subnet as B2, it may not know R1’s IP address; in this case it may include its
own IP address as the NEXT_HOP attribute. Routers on AS2’s side will then look up the “immediate next hop” they would use as
the first step to reach B1, and forward traffic there. This should either be R1 or should lead to R1, which will then route the traffic
properly (not necessarily on to B1).
AS_border.svg

10.6.6.2 LOCAL_PREF
If one BGP speaker in an AS has been configured with local_preference values, used in the best-path-selection phase above, it uses
the LOCAL_PREF path attribute to share those preferences with all other BGP speakers at a site. In other words, once one BGP
speaker has determined the local_preference value of a given route, the LOCAL_PREF attribute is used to distribute that value
uniformly throughout the AS.

10.6.6.3 MULTI_EXIT_DISC
The Multi-Exit Discriminator, or MED, attribute allows one AS to learn something of the internal structure of another AS, should it
elect to do so. Using the MED information provided by a neighbor has the potential to cause an AS to incur higher costs, as it may
end up carrying traffic for longer distances internally; MED values received from a neighboring AS are therefore only recognized
when there is an explicit administrative decision to do so.
Specifically, if an autonomous system AS1 has multiple links to neighbor AS2, then AS1 can, when advertising an internal
destination D to AS2, have each of its BGP speakers provide associated MED values so that AS2 can know which link AS1 would
prefer that AS2 use to reach D. This allows AS2 to route traffic to D so that it is carried primarily by AS2 rather than by AS1. The
alternative is for AS2 to use only the closest gateway to AS1, which means traffic is likely carried primarily by AS1.
MED values are considered late in the best-path-selection process; in this sense the use of MED values is a tie-breaker when two
routes have the same local_preference.
As an example, consider the following network (from 10.4.3 Hierarchical Routing via Providers, with providers now replaced by
Autonomous Systems); the numeric values on links are their relative costs. We will assume that each site has three BGP speakers
co-located at the exchange points IX1, IX2 and IX3.
three_ixp2.svg

In the absence of the MED, AS1 will send traffic from A to B via the R3–IX3–S3 link, and AS2 will return the traffic via S1–IX1–
R1. These are the links that are closest to R and S, respectively, representing AS1 and AS2’s desire to hand off the outbound traffic
as quickly as possible.
However, AS1’s BGP speakers at IX1, IX2 and IX3 can provide MED values to AS2 when advertising destination A, indicating a
preference for AS2→AS1 traffic to use the rightmost link:
IX1: destination A has MED 200
IX2: destination A has MED 150
IX3: destination A has MED 100
If this is done, and AS2 abides by this information, then AS2 will route traffic from B to A via IX3; that is, via the exchange point
with the lowest MED value. Note the importance of fact that AS2 is allowed to ignore the MED; use of it may shift costs from AS1
to AS2!

10.6.6 https://eng.libretexts.org/@go/page/11211
The relative order of the MED values for R1 and R2 is irrelevant, unless the IX3 exchange becomes disabled, in which case the
numeric MED values above would mean that AS2 should then prefer IX2 for reaching A.
We cannot use MED values to cause A–B traffic to take the path through IX2; that path has minimal cost only in the global sense,
and the only way to achieve global cost minimization is for the two AS’s to agree to use a common distance metric and a common
metric-based routing algorithm, in effect becoming one AS. While AS1 does provide different numeric MED values for the three
exchange points, they are used only in ranking precedence, not as numeric measures of cost (though they are sometimes derived
from that).
In the example above, importing and using MED values raises AS2’s costs, by causing it to route AS2-to-AS1 traffic so that it
stays for a longer path within AS2’s network. This is, in fact, almost always the case when using MED values. Why, then, would
AS2 agree to this? One simple reason might be that AS2 and AS1 have, together, negotiated this arrangement; perhaps AS1 gives
AS2 a break on interconnection (“peering”) fees in exchange for AS2’s accepting and using AS1’s MED data. It is also possible
that AS2’s use of AS1’s MED data may improve the quality of service AS2 can offer to its customers; we will return to an example
of this in 10.6.7.1 MED values and traffic engineering.
Also in the example above, the MED values are used to decide between multiple routes to the same destination that all pass through
the same AS, namely AS1. Some BGP implementations allow the use of MED values to decide between different routes through
different neighbor AS’s. The different neighbors must all have the same local_preference values. For example, AS2 might connect
to AS3 and AS4 and receive the following BGP information:
AS3: destination A has MED 200
AS4: destination A has MED 100
Assuming AS2 assigns the same local_preference to AS3 and AS4, it might be configured to use these MED values as the tie-
breaker, and thus routing traffic to A via AS3. On Cisco routers, the always-compare-med command is used to create this
behavior.
MED values are not intended to be used to communicate routing preferences to non-neighboring AS’s.
Additional information on the use of MED values can be found in RFC 4451 [https://tools.ietf.org/html/rfc4451.html].

10.6.6.4 COMMUNITY
This is simply a tag to attach to routes. Routes can have multiple tags corresponding to membership in multiple communities. Some
communities are defined globally; for example, NO_EXPORT and NO_ADVERTISE. A route marked with one of these two
communities will not be shared further. Other communities may be relevant only to a particular AS.
The importance of communities is that they allow one AS to place some of its routes into specific categories when advertising them
to another AS; the categories must have been created and recognized by the receiving AS. The receiving AS is not obligated to
honor the community memberships, of course, but doing so has the effect of allowing the original AS to “configure itself” without
involving the receiving AS in the process. Communities are often used, for example, by (large) customers of an ISP to request
specific routing treatment.
A customer would have to find out from the provider what communities the provider defines, and what their numeric codes are. At
that point the customer can place itself into the provider’s community at will.
Here are some of the community values once supported by a no-longer-extant ISP that we shall call AS1. The full community value
would have included AS1’s AS-number.

value action

90 set local_preference used by AS1 to 90

100 set local_preference used by AS1 to 100, the default

105 set local_preference used by AS1 to 105

110 set local_preference used by AS1 to 110

990 the route will not leave AS1’s domain; equivalent to NO_EXPORT

991 route will only be exported to AS1’s other customers

10.6.7 https://eng.libretexts.org/@go/page/11211
10.6.7 BGP and Traffic Engineering
BGP is the mechanism for inter-autonomous-system traffic engineering. The first-line tools are import and export filtering and
best-path selection. For autonomous systems with multiple interconnection points, the Multi-Exit Discriminator above also may
play a large role.
After establishing basic connectivity, perhaps the most important decision a site makes via its BGP configuration is whether or not
it will accept transit traffic. As a first example of this, let us consider the case of configuring a private link, such as the dashed
link1 below between “friendly” but unaffiliated sites A and B (link1 can be either a shared “real” link or a short “jumper” link
within an Internet exchange point):
linked_sites.svg

Suppose A exports its link1 route to B to its provider ISP1. Then ISP1 may in turn announce this route to the Internet at large, and
so some or all of B’s inbound traffic may be routed through ISP1 (paid by A) and through A itself. Similarly, B may end up paying
to carry A’s traffic if B exports its link1 route to A to ISP2.
Economically, carrying someone else’s transit traffic not desirable unless you are compensated for it. The primary issue here is the
use of the ISP1–A link by B and the ISP2–B link by A; use of the shared link1 might be a secondary issue depending on the
relative bandwidths and A and B’s understandings of appropriate uses for link1.
Two common options A and B might agree to regarding link1 are no-transit and backup.
For the no-transit option, A and B simply do not export the route to their respective ISPs at all. This is done via export filtering. If
ISP1 does not know A can reach B, it will not send any of B’s traffic to A.
For the backup option, the intent is that traffic to A will normally arrive via ISP1, but if the ISP1 link is down then A’s traffic will
be allowed to travel through ISP2 and B. To achieve this, A and B can export their link1-route to each other, but arrange for ISP1
and ISP2 respectively to assign this route a low local_preference value. As long as ISP1 hears of a route to B from its upstream
provider, it will reach B that way, and will not advertise the existence of the link1 route to B; ditto ISP2. However, if the ISP2 route
to B fails, then A’s upstream provider will stop advertising any route to B, and so ISP1 will begin to use the link1 route to B and
begin advertising it to the Internet. The link1 route will be the primary route to B until ISP2’s service is restored.
A and B must convince their respective ISPs to assign the link1 route a low local_preference; they cannot mandate this directly.
However, if their ISPs recognize community attributes that, as above, allow customers to influence their local_preference value,
then A and B can use this to create the desired local_preference.
To use the shared link for backup outbound traffic, A and B will need a way to send through one another if their own ISP link is
down. If A detects that its ISP link is down, it can simply change its default route to point to B. One way to automate this is for A
and B to view their default-route path (eg to 0.0.0.0/0) to be a concrete destination within BGP. ISP1 advertises this to A, using
BGP, but so does B, and A has configured its import rules so B’s route to 0.0.0.0/0 has a higher cost. Then A will route to 0.0.0.0/0
through ISP1 – that is, will use ISP1 as its default route – as long as it is available, and will switch to B when it is not.
A and B might also wish to use their shared private link for load balancing, but for this BGP offers limited help. If ISP1 and ISP2
both export routes to A, then A has lost all control over how other sites will prefer one to the other. A may be able to make one path
artificially appear more expensive, perhaps by duplicating one of the ISPs in the AS-path. A might then be able to keep tweaking
this cost until the inbound loads are comparable, but there is no guarantee (or even likelihood) this will be stable. Outbound load-
balancing is up to A and B’s respective internal routers.
Providers in the business of carrying transit traffic must also make decisions about exactly whose traffic they will carry; these
decisions are again implemented with BGP. In the diagram below, two transit-providing Autonomous Systems B and C connect to
individual sites (or regional ISPs) A and D.
dual_transit2.svg

In the diagram above, the left and right interconnections are shown taking place at Internet exchange points IXP1 and IXP2 (10.4.1
Internet Exchange Points). IXPs are typically where such interconnections take place but are not required; the essential topology is
simply this:

10.6.8 https://eng.libretexts.org/@go/page/11211
dual_transit1.svg

B would like to make sure C does not attempt to save on its long-haul transit costs by forwarding A⟶D traffic over to B at IXP1,
and D⟶A traffic over to B at IXP2. B avoids this problem by not advertising to C that it can reach A and D, and similarly with C.
Transit providers are quite careful about not advertising reachability to any other AS for whom they do not intend to provide transit
service, because to do so is likely to mean getting stuck with that traffic.
If B advertises to A that it can reach D, then A may accept that route, and send all its D-bound traffic via B, with C not involved at
all. B is not likely to do this unless A pays for the privilege. If B and C both advertise to A that they can reach D, then A has a
choice, which it will make via its best-path-selection rules. But in such a case A will want to be sure that it does not end up paying
full price to both B and C to carry its traffic while using only one of them. Site A might, for example, agree to payment based on
the actual volume of carried traffic, meaning that if it prefers B’s route then it will pay only B.
It is quite possible that B advertises to A that it can reach D, but does not advertise to D that it can reach A. As we have seen, B
advertises to A that it can reach D only if A has paid for this privilege; perhaps D prefers to do business with C rather than with B.
In that case, A-to-D traffic would travel via B, while D-to-A traffic would travel via C.
In the unlikely event that B and C both advertise to one another at IXP1 their route to D, a routing loop may even be created. B
might forward D-bound traffic to C while C forwards it back to B. But in that case B would state, in its next BGP advertisement to
C at IXP1, that it reaches D via an AS-path that begins with C, and C would do similarly. B and C would then see themselves in the
AS-paths they receive and would stop using these routes.

10.6.7.1 MED values and traffic engineering


Let us now address why an AS would bother with importing and using MED values, given that doing so will almost always
increase the site’s cost. Consider the following diagram of autonomous systems AS1 and AS2, with link costs shown:
two_ixp.svg

Site DC in the diagram above is a datacenter that wants its user – at site A – to experience high-performance downloads. Perhaps
DC delivers high-performance streaming video, and needs to minimize both congestion and packet losses. In order to achieve this
superior quality, it builds a particularly robust network R1–R–R2, shown above as AS1.
A first step is to have AS1 connect (or peer) directly to customer networks such as AS2, rather than relying on the Internet
backbone. Two such interconnection points are shown above, IX1 and IX2.
At this point, traffic from A to DC will take IX1 (on the shortest path from A to AS1), and so will travel most of the way in AS1.
This is good, but traffic from A to DC is probably mostly acknowledgments; these are unlikely to benefit from the special network.
The actual data, sent from DC to A, will take IX2, because that is AS1’s shortest path to reach AS2. The data will thus travel most
of the way in AS2, bypassing AS1’s high-performance network. This is not what DC wants.
However, the picture changes if AS1 agrees to accept MED information from AS2 (and other providers). If AS2 tells AS1 that
AS2’s preferred link for reaching A is via IX1, then traffic from DC to A will travel through R1 to IX1, and from there onto A.
This keeps DC’s outbound traffic in the AS1 network as long as possible, instead of handing it off to the other network of lower
quality. This is what DC wants; this is why DC built the high-performance network.
Rather than building its own high-performance network, DC might simply contract with an existing high-performance network.
That would make AS1’s business model the following:
peering with as many potential customer networks as possible
importing and using the MED information from those networks
advertising to potential customers like DC that their network will give DC’s users a better experience

10.6.8 BGP and Anycast


In 7.8.4 DNS and CDNs we discussed how some CDNs use DNS tricks to arrange for user traffic to be delivered to the closest edge
server. Another CDN option is anycast: using the same IP address for all the edge servers, and arranging for routers to deliver to
the closest server. IPv6 routers can be configured to have some awareness of anycast delivery, but in IPv4 this must be done more
passively, using BGP.

10.6.9 https://eng.libretexts.org/@go/page/11211
To implement the anycast approach, the CDN uses the same IP address block at each of its datacenter locations. Each customer has
a server at each CDN datacenter, and each of these servers is assigned the same IP address. It is up to the CDN to make sure that
the content made available at each server is identical.
At each of its locations, the CDN then announces this address block to its local BGP neighbors. Reachability information for the
address block then propagates, via BGP, throughout the Internet. An AS connected to a single CDN datacenter will route the
CDN’s address block to that datacenter. If AS1 hears about the CDN from neighbors AS2 and AS3, then AS1 will apply its usual
best-path-selection process to determine whether to route the CDN’s block via AS2 or AS3. Ultimately, every AS on the Internet
will deploy exactly one route to the CDN. Each such route will lead to one of the CDN’s datacenters, but different ASes may
deploy routes to different datacenters.
One advantage to the anycast approach, over the DNS approach, is that users who use a geographically distant DNS resolver will
not pay a penalty. Another is that the BGP best-path-selection process is likely to produce better routes in general than a process
based solely on geographical distance; for example, ASes may choose best paths based on available bandwidth rather than distance.
In IP routing, geography is not destiny.
It may at first seem odd to have multiple servers with the same public IP address, given that such configuration within an
organization usually represents a dire error. However, none of the CDN’s data centers will use these addresses to talk to one
another; the CDN will arrange for the use of other IP addresses for inter-datacenter traffic.

10.6.9 BGP Relationships


Arbitrarily complex policies may be created through BGP, and, as we shall see in the following section, convergence to a stable set
of routes is not guaranteed. Nonconvergence does not mean distance-vector’s “slow convergence to infinity”, but rather a regular
oscillation of routes among competing alternatives.
It turns out, however, that if some constraints are applied to the different AS-to-AS relationships, then better behavior is obtained.
The paper [LG01] analyzed BGP networks in which each AS-to-AS relationship fit one of the following three business patterns,
discussed further below:
1. Customer to provider (the most common pattern)
2. Peer to peer (eg two top-level providers mutually exchanging traffic)
3. Sibling to sibling (for very close AS-to-AS relationships)
A major consequence these relationships is the extent to which the autonomous systems involved accept one another’s “non-
customer” routes (below), and hence the extent to which they provide each other with transit services. We start with the most basic
case, that of customer and provider.
If autonomous systems C and P have a customer-to-provider relationship, with C as the customer and P as the provider, then C is
paying P to carry some or all of its traffic to the “outside world”. P may not carry all such traffic, because C may also be a customer
of another provider Pʹ. C may also have its own sub-customers, such as Cʹ:
customerprovider.svg

In offering itself as a provider, P will export all the routes it has, from all sources, to C, in effect telling C “this is what I can reach”.
If C has no other providers it might accept these routes in the form of a single default-route entry pointing to P; if C has another
provider Pʹ then it might accept some routes from P and some from Pʹ.
Similarly, C will always export its own routes to P. If C has customers of its own, such as Cʹ, then it will also export those routes to
P. Collectively, we will say that C’s own routes and the routes of its own customers and sub-customers are its customer routes.
But what about non-customer routes, eg routes learned from other providers? These C generally does not export. If C were to
export to P a route to destination D that it learned from second provider Pʹ, then C might end up providing transport service to P,
carrying P’s D-bound traffic to Pʹ. As a customer, this is probably not what C intends.
To summarize, a provider does export its non-customer routes to its customer, but a customer generally does not export its non-
customer routes to its providers. This rule is not, in the world of real business relationships, absolute; AS’s may negotiate all sorts
of special arrangements. A nominal customer might, for example, agree to provide transit service for some set of destinations, in
exchange for a lower-priced rate for the handling of its other traffic. Nonetheless, the rule is largely accurate, and provides a helpful

10.6.10 https://eng.libretexts.org/@go/page/11211
starting point to understanding customer-provider relationships. Below, in 10.6.9.1 BGP No-Valley Theorem, we will in effect use
this rule as a definition of customer-provider relationships.
Now let us consider a peer-to-peer relationship, which is a connection between two transit providers that have agreed to exchange
all their customer traffic with each other; thus carrying transit traffic for one another. Often the idea is for the interconnection to be
seen as equally valuable by both parties (eg because the parties exchange comparable volumes of traffic); in such a case the
relationship would likely be “settlement-free”, that is, involving no monetary exchange. If, however, the volume flow is
significantly asymmetric then compensation can certainly be negotiated, making the relationship more like customer-to-provider.
As with customers and providers, two peers P1 and P2 each export all their customer routes to the other; that way, P2 knows it how
to reach P1’s customers and vice-versa. By doing this, P1 and P2 each carry transit traffic for their own customers.
Peers do not, however, generally export their non-customer routes, in either direction. If P1 learns of a route to destination D from
another peer (or provider) P3, it does not export this to P2. If it were to do so, then P1 would carry non-customer transit traffic from
P2 to P3. Instead, P2 is expected also to peer with P3, and learn of P3’s route to D that way. Alternatively, P3 can become a
customer of P1, and thus pay for P1’s transit carriage of P3’s traffic.
The so-called tier-1 providers are those that are not customers of anyone; these represent the top-level “backbone” providers. Each
tier-1 AS must, as a rule, peer with every other tier-1 AS, though AS’s are free to negotiate exceptions.
Finally, some autonomous system relationships that do not fit the customer-to-provider or peer-to-peer patterns can be
characterized as sibling-to-sibling. Siblings are ISPs that have a close relationship; often siblings are AS’s that, due to mergers, are
now part of the same organization. Siblings may also be nominal competitors who intend to use their mutual link as a cooperative
backup, as in 10.6.7 BGP and Traffic Engineering. Two siblings may or may not have the same upstream ISP as provider.
Siblings typically export everything to one another – both customer and non-customer routes – and thus do potentially use their
connection for transit traffic in both directions (although they may rank routes through one another at low preference, so as to use
the shared link only when nothing else is available).
We can summarize the three kinds of relationships in terms of how they export non-customer routes:
in peer-to-peer relationships, non-customer routes are not exported in either direction.
in customer-to-provider relationships, non-customer routes are exported only from the provider to the customer.
in sibling-to-sibling relationships, non-customer routes are exported in both directions.
It is possible to make at least some inferences about BGP relationships from sites’ actual export information, though accuracy is
imperfect because sites may negotiate non-standard arrangements; see [LG01].
In the real world, BGP sibling relationships are relatively rare, probably because they do not really fit the model of traffic carriage
as a service. This may be fortunate, as sibling relationships, with universal and bidirectional route export, tend to introduce the
greatest complexity. The non-convergence examples of 10.6.10 Examples of BGP Instability all require sibling relationships.
One problematic sibling case is the following, in which P1 and P2 are providers for C1 and C2, respectively, and C1 and C2 are
siblings:
siblingprovider.svg

Suppose P1 exports to C1 a route to destination D. C1 then exports it to sibling C2. If C2 treats this as a customer route, it will
export it to P2, in which case C1 and C2 are now providing transit service to traffic from P2 bound for D.
Sibling relationships can be tamed considerably, however, if we adopt a requirement that collections of linked siblings act as a unit,
keeping track of the original non-sibling source (that is, customer, provider or peer) of each route. Let us say that autonomous
systems S and Sʹ are in the same sibling family if there is a chain of autonomous systems S0…Sn so that S=S0, Sn=Sʹ, and each
consecutive Si-1 and Si, i≤n, are siblings. We can then define the following property:

Selective Export Property: A sibling family satisfies this property if, whenever one
member of the family learns of a route from a provider (respectively peer or customer)
then all other members of the family treat the route as a provider (respectively peer or
customer) route when deciding whether to export.

10.6.11 https://eng.libretexts.org/@go/page/11211
In other words, in the situation diagrammed above, in which C1 has learned of a route to D from its provider P1, C2 will also treat
this route as a non-customer route and will not export it to P2.
In the real world, BGP relationships may not fit any of the above three categories, or else there may be many sibling relationships
for which the selective-export property fails. However, quite often these relationships do hold to a useful degree.
We can also specialize the relationships to a particular set of destinations, or even to an individual destination; for example,
autonomous systems C and P might be said to have a customer-to-provider relationship for destination D if C learned its route to D
from a non-customer, does not export this route to P, and P does export to C its own route to D.
BGP certainly allows for complicated variations: if a regional provider is a customer of a large transit backbone, then the backbone
might only announce routes listed in transit agreement (rather than all routes, as above). There is a supposition here that the
regional provider has multiple connections, and has contracted with that particular transit backbone only for certain routes. But we
can fit this into the classification above either by restricting attention to the set of routes listed in the agreement, or by declaring that
in principle the transit provider exports all routes, but the regional customer doesn’t import the ones it hasn’t paid for.

10.6.9.1 BGP No-Valley Theorem


A consequence of adherence to the above classification and attendant export rules is the no-valley theorem of [LG01]: Suppose
every pair of adjacent AS’s has a relationship described by the customer-provider, peer-to-peer or sibling rules above (now taken to
be definitions of these three relationships). In addition, every sibling family abides by the selective-export property. Let A=A0 be an
autonomous system that has received a route to destination D with AS-path ⟨A1,A2,…,An⟩. Then: in this AS-path, there is at most
one peer-to-peer link. Links to the left of the peer-to-peer link (that is, closer to A) are either customer→provider links or
sibling→sibling links; that is, they are non-downwards. To the right of the peer-to-peer link, there are only provider→customer or
sibling→sibling links; that is, these are non-upwards. If there is no peer-to-peer link, then we can still divide the AS-path into a
non-downwards first part and a non-upwards second part.
Intuitively, autonomous systems on the right (non-upwards) part of the path export the route to D as a customer route. Autonomous
systems on the left (non-downwards) part of the path export the route from provider to customer.
The no-valley theorem can be seen as an illustration of the power of the restrictions built into the customer-to-provider and peer-to-
peer export rules.
We give an informal argument for the case in which the AS-path has no peer-to-peer link. First, note that BGP rules mean that each
autonomous system ASi in the path has received the route to D from neighbor ASi+1 with AS-path ⟨Ai+1,…,An⟩.
If the no-valley theorem were to fail, then somewhere along the AS-path in order of increasing i we would have a downward link
followed by, eventually, an upward link. Choose the largest i for which this arrangement appears, and let k be the position of the
first subsequent upward link, so that
Ai to Ai+1 is provider-to-customer
Aj to Aj+1 is sibling-to-sibling for i<j<k-1
Ak-1 to Ak is customer-to-provider.
Then the route to D was acquired by Ak-1 from its provider Ak, and so is a provider route. The set {Ai+1,…,Ak-1} is a sibling
family, and so by the selective-export rule Ai+1 also treats this route to D as a provider route. It therefore cannot export this non-
customer route to different provider Ai, a contradiction.
For the case with a peer-to-peer edge, see exercise 12.0.
If the hypotheses of the no-valley theorem hold only for routes involving a particular destination or set of destinations, then the
theorem is still true for those routes.
The hypotheses of the no-valley theorem are not quite sufficient to guarantee convergence of the BGP system to a stable set of
routes. To ensure convergence in the case without sibling relationships, it is shown in [GR01] that the following simple
local_preference rule suffices:

If AS1 gets two routes r1 and r2 to a destination D, and the first AS of the r1 route is a
customer of AS1, and the first AS of r2 is not, then r1 will be assigned a higher
local_preference value than r2.

10.6.12 https://eng.libretexts.org/@go/page/11211
More complex rules exist that allow for cases when the local_preference values can be equal; one such rule states that strict
inequality is only required when r2 is a provider route. Other straightforward rules handle the case of sibling relationships, eg by
requiring that siblings have local_preference rules consistent with the use of their shared connection only for backup.
As a practical matter, whether or not actual BGP relationships are consistent with the rules above, arrangements resulting in actual
BGP instability appear rare on the Internet.

10.6.10 Examples of BGP Instability


What if the “normal” rules regarding BGP preferences are not followed? It turns out that BGP allows genuinely unstable situations
to occur; this is a consequence of allowing each AS a completely independent hand in selecting preference functions. Here are two
simple examples, from [GR01].
Example 1: A stable state exists, but convergence to it is not guaranteed. Consider the following network arrangement:
bgp_instability1.svg

We assume AS1 prefers AS-paths to destination D in the following order:

⟨AS2,AS0⟩, ⟨AS0⟩
That is, ⟨AS2,AS0⟩ is preferred to the direct path ⟨AS0⟩ (one way to express this preference might be “prefer routes for which the
AS-PATH begins with AS2”; perhaps the AS1–AS0 link is more expensive). Similarly, we assume AS2 prefers paths to D in the
order ⟨AS1,AS0⟩, ⟨AS0⟩. Both AS1 and AS2 start out using path ⟨AS0⟩; they advertise this to each other. As each receives the
other’s advertisement, they apply their preference order and therefore each switches to routing D’s traffic to the other; that is, AS1
switches to the route with AS-path ⟨AS2,AS0⟩ and AS2 switches to ⟨AS1,AS0⟩. This, of course, causes a routing loop! However,
as soon as they export these paths to one another, they will detect the loop in the AS-path and reject the new route, and so both will
switch back to ⟨AS0⟩ as soon as they announce to each other the change in what they use.
This oscillation may continue indefinitely, as long as both AS1 and AS2 switch away from ⟨AS0⟩ at the same moment. If, however,
AS1 switches to ⟨AS2,AS0⟩ while AS2 continues to use ⟨AS0⟩, then AS2 is “stuck” and the situation is stable. In practice,
therefore, eventual convergence to a stable state is likely.
AS1 and AS2 might choose not to export their D-route to each other to avoid this instability. Because they do export this route to
one another, they are siblings in the sense of the previous section.
Example 2: No stable state exists. This example is from [VGE00]. Assume that the destination D is attached to AS0, and that AS0
in turn connects to AS1, AS2 and AS3 as in the following diagram:
bgp_example2.svg

AS1-AS3 each have a direct route to AS0, but we assume each prefers the AS-path that takes their clockwise neighbor; that is, AS1
prefers ⟨AS3,AS0⟩ to ⟨AS0⟩; AS3 prefers ⟨AS2,AS0⟩ to ⟨AS0⟩, and AS2 prefers ⟨AS1,AS0⟩ to ⟨AS0⟩. This is a peculiar, but legal,
example of input filtering.
Suppose all initially adopt AS-path ⟨AS0⟩, and advertise this, and AS1 is the first to look at the incoming advertisements. AS1
switches to the route ⟨AS3,AS0⟩, and announces this to AS2 and AS3.
At this point, AS2 sees that AS1 uses ⟨AS3,AS0⟩; if AS2 switches to AS1 then its path would be ⟨AS1,AS3,AS0⟩ rather than
⟨AS1,AS0⟩ and so it does not make the switch.
But AS3 does switch: it prefers ⟨AS2,AS0⟩ and this is still available. Once it makes this switch, and advertises it, AS1 sees that the
route it had been using, ⟨AS3,AS0⟩, has become ⟨AS3,AS1,AS0⟩. At this point AS1 switches back to ⟨AS0⟩.
Now AS2 can switch to using ⟨AS1,AS0⟩, and does so. After that, AS3 finds it is now using ⟨AS2,AS1,AS0⟩ and it switches back
to ⟨AS0⟩. This allows AS1 to switch to the longer route, and then AS2 switches back to the direct route, and then AS3 gets the
longer route, then AS2 again, etc, forever rotating clockwise.
Because each of AS1, AS2 and AS3 export their route to D to both their neighbors, they must all be siblings of one another.

10.6: Border Gateway Protocol, BGP is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.6.13 https://eng.libretexts.org/@go/page/11211
10.7: Epilog and Exercises
CIDR was a deceptively simple idea. At first glance it is a straightforward extension of the subnet concept, moving the net/host
division point to the left as well as to the right. But it has ushered in true hierarchical routing, most often provider-based. While
CIDR was originally offered as a solution to some early crises in IPv4 address-space allocation, it has been adopted into the core of
IPv6 routing as well.
Interior routing – using either distance-vector or link-state protocols – is neat and mathematical. Exterior routing with BGP is
messy and arbitrary. Perhaps the most surprising thing about BGP is that the Internet works as well as it does, given the complexity
of provider interconnections. The business side of routing almost never has an impact on ordinary users. To an extent, BGP works
well because providers voluntarily limit the complexity of their filtering preferences, but that seems to be largely because the
business relationships of real-world ISPs do not seem to require complex filtering.

10.8 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 5.5 is distinct, for
example, from exercises 5.0 and 6.0. Exercises marked with a ♢ have solutions or hints at 24.9 Solutions for Large-Scale IP
Routing.
0.5.♢ Consider the following IP forwarding table that uses CIDR.

destination next_hop

200.0.0.0/8 A

200.64.0.0/10 B

200.64.0.0/12 C

200.64.0.0/16 D

For each of the following IP addresses, indicate to what destination it is forwarded. 64 is 0x40, or 0100 0000 in binary.

(i) 200.63.1.1
(ii) 200.80.1.1
(iii) 200.72.1.1
(iv) 200.64.1.1
1.0. Consider the following IP forwarding table that uses CIDR. IP address bytes are in hexadecimal here, so each hex digit
corresponds to four address bits. This makes prefixes such as /12 and /20 align with hex-digit boundaries. As a reminder of the
hexadecimal numbering, “:” is used as the separator rather than “.”

destination next_hop

81:30:0:0/12 A

81:3c:0:0/16 B

81:3c:50:0/20 C

81:40:0:0/12 D

81:44:0:0/14 E

For each of the following IP addresses, give the next_hop for each entry in the table above that it matches. If there are multiple
matches, use the longest-match rule to identify where the packet would be forwarded.

(i) 81:3b:15:49
(ii) 81:3c:56:14

10.7.1 https://eng.libretexts.org/@go/page/11212
(iii) 81:3c:85:2e
(iv) 81:4a:35:29
(v) 81:47:21:97
(vi) 81:43:01:c0
2.0. Consider the following IP forwarding table, using CIDR. As in exercise 1, IP address bytes are in hexadecimal, and “:” is used
as the separator as a reminder.

destination next_hop

00:0:0:0/2 A

40:0:0:0/2 B

80:0:0:0/2 C

c0:0:0:0/2 D

(a). To what next_hop would each of the following be routed? 63:b1:82:15, 9e:00:15:01, de:ad:be:ef
(b). Explain why every IP address is routed somewhere, even though there is no default entry. Hint: convert the first bytes to binary.
3.0. Give an IPv4 forwarding table – using CIDR – that will route all Class A addresses (first bit 0) to next_hop A, all Class B
addresses (first two bits 10) to next_hop B, and all Class C addresses (first three bits 110) to next_hop C.
4.0. Suppose a router using CIDR has the following entries. Address bytes are in decimal except for the third byte, which is in
binary.

destination next_hop

37.149.0000 0000.0/18 A

37.149.0100 0000.0/18 A

37.149.1000 0000.0/18 A

37.149.1100 0000.0/18 B

If the next_hop for the last entry were also A, we could consolidate these four into a single entry 37.149.0.0/16 → A. But with the
final next_hop as B, how could these four be consolidated into two entries? You will need to assume the longest-match rule.
5.0. Suppose P, Q and R are ISPs with respective CIDR address blocks (with bytes in decimal) 51.0.0.0/8, 52.0.0.0/8 and
53.0.0.0/8. P then has customers A and B, to which it assigns address blocks as follows:

A: 51.10.0.0/16
B: 51.23.0.0/16
Q has customers C and D and assigns them address blocks as follows:

C: 52.14.0.0/16
D: 52.15.0.0/16
(a).♢ Give forwarding tables for P, Q and R assuming they connect to each other and to each of their own customers.
(b). Now suppose A switches from provider P to provider Q, and takes its address block with it. Give the changes to the forwarding
tables for P, Q and R; the longest-match rule will be needed to resolve conflicts.
5.5 Let P, Q and R be the ISPs of exercise 5.0. This time, suppose customer C switches from provider Q to provider R. R will now
have a new entry 52.14.0.0/16 → C. Give the changes to the forwarding tables of P and Q.
6.0. Suppose P, Q and R are ISPs as in exercise 5.0. This time, P and R do not connect directly; they route traffic to one another via
Q. In addition, customer B is multihomed and has a secondary connection to provider R; customer D is also multihomed and has a
secondary connection to provider P. R and P use these secondary connections to send to B and D respectively; however, these
secondary connections are not advertised to other providers. Give forwarding tables for P, Q and R.

10.7.2 https://eng.libretexts.org/@go/page/11212
7.0. Consider the following network of providers P-S, all using BGP. The providers are the horizontal lines; each provider is its
own AS.
PQRS.svg
(a).♢ What routes to network NS will P receive, assuming each provider exports all its routes to its neighbors without filtering?
For each route, list the AS-path.
(b). What routes to network NQ will P receive? For each route, list the AS-path.
(c). Suppose R now uses export filtering so as not to advertise any of its routes to P, though it does continue to advertise its routes
to S. What routes to network NR will P receive, with AS-paths?
8.0. Consider the following network of Autonomous Systems AS1 through AS6, which double as destinations. When AS1
advertises itself to AS2, for example, the AS-path it provides is ⟨AS1⟩.

AS1────────AS2────────AS3
│ :
│ :
│ :
AS4────────AS5────────AS6

(a). If neither AS3 nor AS6 exports their AS3–AS6 link to their neighbors AS2 and AS5 to the left, what routes will AS2 receive to
reach AS5? Specify routes by AS-path.
(b). What routes will AS2 receive to reach AS6?
(c). Suppose AS3 exports to AS2 its link to AS6, but AS6 continues not to export the AS3–AS6 link to AS5. How will AS5 now
reach AS3? How will AS2 now reach AS6? Assume that there are no local preferences in use in BGP best-path selection, and that
the shortest AS-path wins.
9.0. Suppose that Internet routing in the US used geographical routing, and the first 12 bits of every IP address represent a
geographical area similar in size to a telephone area code. Megacorp gets the prefix 12.34.0.0/16, based geographically in Chicago,
and allocates subnets from this prefix to its offices in all 50 states. Megacorp routes all its internal traffic over its own network.
(a). Assuming all Megacorp traffic must enter and exit in Chicago, what is the route of traffic to and from the San Diego office to a
client also in San Diego?
(b). Now suppose each office has its own link to a local ISP, but still uses its 12.34.0.0/16 IP addresses. Now what is the route of
traffic between the San Diego office and its neighbor?
(c). Suppose Megacorp gives up and gets a separate geographical prefix for each office, eg 12.35.1.0/24 for San Diego and
12.37.3.0/24 for Boston. How must it configure its internal IP forwarding tables to ensure that its internal traffic is still routed
entirely over its own network?
10.0. Suppose we try to use BGP’s strategy of exchanging destinations plus paths as an interior routing-update strategy, perhaps
replacing distance-vector routing. No costs or hop-counts are used, but routers attach to each destination a list of the routers used to
reach that destination. Routers can also have route preferences, such as “prefer my link to B whenever possible”.
(a). Consider the network of 9.2 Distance-Vector Slow-Convergence Problem:

D───────────A───────────B

The D–A link breaks, and B offers A what it thinks is its own route to D. Explain how exchanging path information prevents a
routing loop here.
(b). Suppose the network is as below, and initially each router knows about itself and its immediately adjacent neighbors. What
sequence of router announcements can lead to A reaching F via A→D→E→B→C→F, and what individual router preferences
would be necessary? (Initially, for example, A would reach B directly; what preference might make it prefer A→D→E→B?)

A────────B────────C
│ │ │
│ │ │
│ │ │
D────────E────────F

10.7.3 https://eng.libretexts.org/@go/page/11212
(c). Explain why this method is equivalent to using the hopcount metric with either distance-vector or link-state routing, if routers
are not allowed to have preferences and if the router-path length is used as a tie-breaker.
11.0. In the following AS-path from AS0 to AS4, with customers lower than providers, how far can a customer route of AS0 be
exported towards AS4? How far can a customer route of AS4 be exported towards AS0?

AS1
/ \
/ \
/ AS2--peer--AS3
/ \
AS0 AS4

12.0. Complete the proof of the no-valley theorem of 10.6.9 BGP Relationships to include peer-to-peer links.
(a). Show that the existing argument also works if the Ai-to-Ai+1 link was peer-to-peer rather than provider-to-customer,
establishing that an upwards link cannot appear to the right of a peer-to-peer link.
(b). Show that the existing argument works if the Ak-1-to-Ak link was peer-to-peer rather than customer-to-provider, establishing
that a downwards link cannot appear to the left of a peer-to-peer link.
(c). Show that there cannot be two peer-to-peer links.

10.7: Epilog and Exercises is shared under a not declared license and was authored, remixed, and/or curated by LibreTexts.

10.7.4 https://eng.libretexts.org/@go/page/11212
11: UDP Transport
The standard transport protocols riding above the IP layer are TCP and UDP. As we saw in Chapter 1, UDP provides simple
datagram delivery to remote sockets, that is, to ⟨host,port⟩ pairs. TCP provides a much richer functionality for sending data, but
requires that the remote socket first be connected. In this chapter, we start with the much-simpler UDP, including the UDP-based
Trivial File Transfer Protocol.
We also review some fundamental issues any transport protocol must address, such as lost final packets and packets arriving late
enough to be subject to misinterpretation upon arrival. These fundamental issues will be equally applicable to TCP connections.

11.1 User Datagram Protocol – UDP


RFC 1122 [https://tools.ietf.org/html/rfc1122.html] refers to UDP as “almost a null protocol”; while that is something of a harsh
assessment, UDP is indeed fairly basic. The two features it adds beyond the IP layer are port numbers and a checksum. The UDP
header consists of the following:
udp_header.svg

The port numbers are what makes UDP into a real transport protocol: with them, an application can now connect to an individual
server process (that is, the process “owning” the port number in question), rather than simply to a host.
UDP is unreliable, in that there is no UDP-layer attempt at timeouts, acknowledgment and retransmission; applications written for
UDP must implement these. As with TCP, a UDP ⟨host,port⟩ pair is known as a socket (though UDP ports are considered a
separate namespace from TCP ports). UDP is also unconnected, or stateless; if an application has opened a port on a host, any
other host on the Internet may deliver packets to that ⟨host,port⟩ socket without preliminary negotiation.
An old bit of Internet humor about UDP’s unreliability has it that if I send you a UDP joke, you might not get it.
UDP packets use the 16-bit Internet checksum (5.4 Error Detection) on the data. While it is seldom done today, the checksum can
be disabled and the field set to the all-0-bits value, which never occurs as an actual ones-complement sum. The UDP checksum
covers the UDP header, the UDP data and also a “pseudo-IP header” that includes the source and destination IP addresses. If a NAT
router rewrites an IP address or port, the UDP checksum must be updated.
UDP packets can be dropped due to queue overflows either at an intervening router or at the receiving host. When the latter
happens, it means that packets are arriving faster than the receiver can process them. Higher-level protocols that define ACK
packets (eg UDP-based RPC, below) typically include some form of flow control to prevent this.
UDP is popular for “local” transport, confined to one LAN. In this setting it is common to use UDP as the transport basis for a
Remote Procedure Call, or RPC, protocol. The conceptual idea behind RPC is that one host invokes a procedure on another host;
the parameters and the return value are transported back and forth by UDP. We will consider RPC in greater detail below, in 11.5
Remote Procedure Call (RPC); for now, the point of UDP is that on a local LAN we can fall back on rather simple mechanisms for
timeout and retransmission.
UDP is well-suited for “request-reply” semantics beyond RPC; one can use TCP to send a message and get a reply, but there is the
additional overhead of setting up and tearing down a connection. DNS uses UDP, largely for this reason. However, if there is any
chance that a sequence of request-reply operations will be performed in short order then TCP may be worth the overhead.
UDP is also popular for real-time transport; the issue here is head-of-line blocking. If a TCP packet is lost, then the receiving host
queues any later data until the lost data is retransmitted successfully, which can take several RTTs; there is no option for the
receiving application to request different behavior. UDP, on the other hand, gives the receiving application the freedom simply to
ignore lost packets. This approach is very successful for voice and video, which are loss-tolerant in that small losses simply
degrade the received signal slightly, but delay-intolerant in that packets arriving too late for playback might as well not have
arrived at all. Similarly, in a computer game a lost position update is moot after any subsequent update. Loss tolerance is the reason
the Real-time Transport Protocol, or RTP, is built on top of UDP rather than TCP. It is common for VoIP telephone calls to use
RTP and UDP. See also the NoTCP Manifesto [notcp.io/].
There is a dark side to UDP: it is sometimes the protocol of choice in flooding attacks on the Internet, as it is easy to send UDP
packets with spoofed source address. See the Internet Draft draft-byrne-opsec-udp-advisory [http://tools.ietf.org/html/draft-byr...p-
advisory-00]. That said, it is not especially hard to send TCP connection-request (SYN) packets with spoofed source address. It is,

11.1 https://eng.libretexts.org/@go/page/10996
however, quite difficult to get TCP source-address spoofing to work for long enough that data is delivered to an application
process; see 12.10.1 ISNs and spoofing.
UDP also sometimes enables what are called traffic amplification attacks: the attacker sends a small message to a server, with
spoofed source address, and the server then responds to the spoofed address with a much larger response message. This creates a
larger volume of traffic to the victim than the attacker would be able to generate directly. One approach is for the server to limit the
size of its response – ideally to the size of the client’s request – until it has been able to verify that the client actually receives
packets sent to its claimed IP address. QUIC uses this approach; see 12.22.4.4 Connection handshake and TLS encryption.

11.1.1 QUIC
Sometimes UDP is used simply because it allows new or experimental protocols to run entirely as user-space applications; no
kernel updates are required, as would be the case with TCP changes. Google has created a protocol named QUIC (Quick UDP
Internet Connections, chromium.org/quic [www.chromium.org/quic]) in this category, rather specifically to support the HTTP
protocol. QUIC can in fact be viewed as a transport protocol specifically tailored to HTTPS [en.Wikipedia.org/wiki/HTTPS]:
HTTP plus TLS encryption (22.10.2 TLS).
QUIC also takes advantage of UDP’s freedom from head-of-line blocking. For example, one of QUIC’s goals includes supporting
multiplexed streams in a single connection (eg for the multiple components of a web page). A lost packet blocks its own stream
until it is retransmitted, but the other streams can continue without waiting. An early version of QUIC supported error-correcting
codes (5.4.2 Error-Correcting Codes); this is another feature that would be difficult to add to TCP.
In many cases QUIC eliminates the initial RTT needed for setting up a TCP connection, allowing data delivery with the very first
packet. This usually this requires a recent previous connection, however, as otherwise accepting data in the first packet opens the
recipient up to certain spoofing attacks. Also, QUIC usually eliminates the second (and maybe third) RTT needed for negotiating
TLS encryption (22.10.2 TLS).
QUIC provides support for advanced congestion control, currently (2014) including a UDP analog of TCP CUBIC (15.15 TCP
CUBIC). QUIC does this at the application layer but new congestion-control mechanisms within TCP often require client
operating-system changes even when the mechanism lives primarily at the server end. (QUIC may require kernel support to make
use of ECN congestion feedback, 14.8.3 Explicit Congestion Notification (ECN), as this requires setting bits in the IP header.)
QUIC represents a promising approach to using UDP’s flexibility to support innovative or experimental transport-layer features.
One downside of QUIC is its nonstandard programming interface, but note that Google can (and does) achieve widespread web
utilization of QUIC simply by distributing the client side in its Chrome browser. Another downside, more insidious, is that QUIC
breaks the “social contract” that everyone should use TCP so that everyone is on the same footing regarding congestion. It turns
out, though, that TCP users are not in fact all on the same footing, as there are now multiple TCP variants (15 Newer TCP
Implementations). Furthermore, QUIC is supposed to compete fairly with TCP. Still, QUIC does open an interesting can of worms.
Because many of the specific features of QUIC were chosen in response to perceived difficulties with TCP, we will explore the
protocol’s details after introducing TCP, in 12.22.4 QUIC Revisited.

11.1.2 DCCP
The Datagram Congestion Control Protocol, or DCCP, is another transport protocol build atop UDP, preserving UDP’s
fundamental tolerance to packet loss. It is outlined in RFC 4340 [https://tools.ietf.org/html/rfc4340.html]. DCCP adds a number of
TCP-like features to UDP; for our purposes the most significant are connection setup and teardown (see 12.22.3 DCCP) and TCP-
like congestion management (see 14.6.3 DCCP Congestion Control).
DCCP data packets, while numbered, are delivered to the application in order of arrival rather than in order of sequence number.
DCCP also adds acknowledgments to UDP, but in a specialized form primarily for congestion control. There is no assumption that
unacknowledged data packets will ever be retransmitted; that decision is entirely up to the application. Acknowledgments can
acknowledge single packets or, through the DCCP acknowledgment-vector format, all packets received in a range of recent
sequence numbers (SACK TCP, 13.6 Selective Acknowledgments (SACK), also supports this).
DCCP does support reliable delivery of control packets, used for connection setup, teardown and option negotiation. Option
negotiation can occur at any point during a connection.

11.2 https://eng.libretexts.org/@go/page/10996
DCCP packets include not only the usual application-specific UDP port numbers, but also a 32-bit service code. This allows finer-
grained packet handling as it unambiguously identifies the processing requested by an incoming packet. The use of service codes
also resolves problems created when applications are forced to use nonstandard port numbers due to conflicts.
DCCP is specifically intended to run in in the operating-system kernel, rather than in user space. This is because the ECN
congestion-feedback mechanism (14.8.3 Explicit Congestion Notification (ECN)) requires setting flag bits in the IP header, and
most kernels do not allow user-space applications to do this.

11.1.3 UDP Simplex-Talk


One of the early standard examples for socket programming is simplex-talk. The client side reads lines of text from the user’s
terminal and sends them over the network to the server; the server then displays them on its terminal. The server does not
acknowledge anything sent to it, or in fact send any response to the client at all. “Simplex” here refers to the one-way nature of the
flow; “duplex talk” is the basis for Instant Messaging, or IM.
Even at this simple level we have some details to attend to regarding the data protocol: we assume here that the lines are sent with a
trailing end-of-line marker. In a world where different OS’s use different end-of-line marks, including them in the transmitted data
can be problematic. However, when we get to the TCP version, if arriving packets are queued for any reason then the embedded
end-of-line character will be the only thing to separate the arriving data into lines.
As with almost every Internet protocol, the server side must select a port number, which with the server’s IP address will form the
socket address to which clients connect. Clients must discover that port number or have it written into their application code.
Clients too will have a port number, but it is largely invisible.
On the server side, simplex-talk must do the following:
ask for a designated port number
create a socket, the sending/receiving endpoint
bind the socket to the socket address, if this is not done at the point of socket creation
receive packets sent to the socket
for each packet received, print its sender and its content
The client side has a similar list:
look up the server’s IP address, using DNS
create an “anonymous” socket; we don’t care what the client’s port number is
read a line from the terminal, and send it to the socket address ⟨server_IP,port⟩

11.1.3.1 The Server


We will start with the server side, presented here in Java. The Java socket implementation is based mostly on the BSD socket
library, 1.16 Berkeley Unix. We will use port 5432; this can easily be changed if, for example, on startup an error message like
“cannot create socket with port 5432” appears. The port we use here, 5432, has also been adopted by PostgreSQL for TCP
connections. (The client, of course, would also need to be changed.)
Java DatagramPacket type
Java DatagramPacket objects contain the packet data and the ⟨IP_address,port⟩ source or destination. Packets themselves
combine both data and address, of course, but nonetheless combining these in a single programming-language object is not an
especially common design choice. The original BSD socket library implemented data and address as separate parameters, and
many other languages have followed that precedent. A case can be made that the Java approach violates the single-responsibility
principle [en.Wikipedia.org/wiki/Single...ity_principle], because data and address are so often handled separately.
The socket-creation and port-binding operations are combined into the single operation
new DatagramSocket(destport) . Once created, this socket will receive packets from any host that addresses a packet to
it; there is no need for preliminary connection. In the original BSD socket library, a socket is created with socket() and bound
to an address with the separate operation bind() .

11.3 https://eng.libretexts.org/@go/page/10996
The server application needs no parameters; it just starts. (That said, we could make the port number a parameter, to allow easy
change.) The server accepts both IPv4 and IPv6 connections; we return to this below.
Though it plays no role in the protocol, we will also have the server time out every 15 seconds and display a message, just to show
how this is done. Implementations of real UDP protocols essentially always must arrange when attempting to receive a packet to
time out after a certain interval with no response.
The file below is at udp_stalks.java.

/* simplex-talk server, UDP version */

import java.net.*;
import java.io.*;

public class stalks {

static public int destport = 5432;


static public int bufsize = 512;
static public final int timeout = 15000; // time in milliseconds

static public void main(String args[]) {


DatagramSocket s; // UDP uses DatagramSockets

try {
s = new DatagramSocket(destport);
}
catch (SocketException se) {
System.err.println("cannot create socket with port " + destport);
return;
}
try {
s.setSoTimeout(timeout); // set timeout in milliseconds
} catch (SocketException se) {
System.err.println("socket exception: timeout not set!");
}

// create DatagramPacket object for receiving data:


DatagramPacket msg = new DatagramPacket(new byte[bufsize], bufsize);

while(true) { // read loop


try {
msg.setLength(bufsize); // max received packet size
s.receive(msg); // the actual receive operation
System.err.println("message from <" +
msg.getAddress().getHostAddress() + "," + msg.getPort() + ">");
} catch (SocketTimeoutException ste) { // receive() timed out
System.err.println("Response timed out!");
continue;
} catch (IOException ioe) { // should never happen!

11.4 https://eng.libretexts.org/@go/page/10996
System.err.println("Bad receive");
break;
}

String str = new String(msg.getData(), 0, msg.getLength());


System.out.print(str); // newline must be part of str
}
s.close();
} // end of main
}

This page titled 11: UDP Transport is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

11.5 https://eng.libretexts.org/@go/page/10996
12: TCP Transport
The standard transport protocols riding above the IP layer are TCP and UDP. As we saw in 11 UDP Transport, UDP provides
simple datagram delivery to remote sockets, that is, to ⟨host,port⟩ pairs. TCP provides a much richer functionality for sending data
to (connected) sockets.
TCP is quite different in several dimensions from UDP. TCP is stream-oriented, meaning that the application can write data in
very small or very large amounts and the TCP layer will take care of appropriate packetization (and also that TCP transmits a
stream of bytes, not messages or records; cf 12.22.2 SCTP). TCP is connection-oriented, meaning that a connection must be
established before the beginning of any data transfer. TCP is reliable, in that TCP uses sequence numbers to ensure the correct
order of delivery and a timeout/retransmission mechanism to make sure no data is lost short of massive network failure. Finally,
TCP automatically uses the sliding windowsalgorithm to achieve throughput relatively close to the maximum available.
These features mean that TCP is very well suited for the transfer of large files. The two endpoints open a connection, the file data is
written by one end into the connection and read by the other end, and the features above ensure that the file will be received
correctly. TCP also works quite well for interactive applications where each side is sending and receiving streams of small packets.
Examples of this include ssh or telnet, where packets are exchanged on each keystroke, and database connections that may carry
many queries per second. TCP even works reasonably well for request/reply protocols, where one side sends a single message, the
other side responds, and the connection is closed. The drawback here, however, is the overhead of setting up a new connection for
each request; a better application-protocol design might be to allow multiple request/reply pairs over a single TCP connection.
Note that the connection-orientation and reliability of TCP represent abstract features built on top of the IP layer, which supports
neither of them.
The connection-oriented nature of TCP warrants further explanation. With UDP, if a server opens a socket (the OS object, with
corresponding socket address), then any client on the Internet can send to that socket, via its socket address. Any UDP application,
therefore, must be prepared to check the source address of each packet that arrives. With TCP, all data arriving at a connected
socket must come from the other endpoint of the connection. When a server S initially opens a socket s, that socket is
“unconnected”; it is said to be in the LISTEN state. While it still has a socket address consisting of its host and port, a LISTENing
socket will never receive data directly. If a client C somewhere on the Internet wishes to send data to s, it must first establish a
connection, which will be defined by the socketpair consisting of the socket addresses (that is, the ⟨IP_addr,port⟩ pairs) at both C
and S. As part of this connection process, a new connected child socket sC will be created; it is sC that will receive any data sent
from C. Usually, the server will also create a new thread or process to handle communication with sC. Typically the server will
have multiple connected children of the original socket s, and, for each one, a process attached to it.
If C1 and C2 both connect to s, two connected sockets at S will be created, s1 and s2, and likely two separate processes. When a
packet arrives at S addressed to the socket address of s, the source socket address will also be examined to determine whether the
data is part of the C1–S or the C2–S connection, and thus whether a read on s1 or on s2, respectively, will see the data.
If S is acting as an ssh server, the LISTENing socket listens on port 22, and the connected child sockets correspond to the separate
user login connections; the process on each child socket represents the login process of that user, and may run for hours or days.
In Chapter 1 we likened TCP sockets to telephone connections, with the server like one high-volume phone number 800-BUY-
NOWW. The unconnected socket corresponds to the number everyone dials; the connected sockets correspond to the actual calls.
(This analogy breaks down, however, if one looks closely at the way such multi-operator phone lines are actually configured: each
typically does have its own number.)

12.1 The End-to-End Principle


The End-to-End Principle is spelled out in [SRC84]; it states in effect that transport issues are the responsibility of the endpoints in
question and thus should not be delegated to the core network. This idea has been very influential in TCP design.
Two issues falling under this category are data corruption and congestion. For the first, even though essentially all links on the
Internet have link-layer checksums to protect against data corruption, TCP still adds its own checksum (in part because of a history
of data errors introduced within routers). For the latter, TCP is today essentially the only layer that addresses congestion
management.
Saltzer, Reed and Clark categorized functions that were subject to the End-to-End principle this way:

12.1 https://eng.libretexts.org/@go/page/10997
The function in question can completely and correctly be implemented only with the
knowledge and help of the application standing at the end points of the communication
system. Therefore, providing that questioned function as a feature of the communication
system itself is not possible. (Sometimes an incomplete version of the function provided by
the communication system may be useful as a performance enhancement.)
This does not mean that the backbone Internet should not concern itself with congestion; it means that backbone congestion-
management mechanisms should not completely replace end-to-end congestion management.

12.2 TCP Header


Below is a diagram of the TCP header. As with UDP, source and destination ports are 16 bits. The 4-bit Data Offset field specifies
the number of 32-bit words in the header; if no options are present its value is 5.
As with UDP, the checksum covers the TCP header, the TCP data and an IP “pseudo header” that includes the source and
destination IP addresses. The checksum must be updated by a NAT router that modifies any header values. (Although the IP and
TCP layers are theoretically separate, and RFC 793 in some places appears to suggest that TCP can be run over a non-IP
internetwork layer, RFC 793 also explicitly defines 4-byte addresses for the pseudo header. RFC 2460 officially redefined the
pseudo header to allow IPv6 addresses.)
0 8 16 24 32

Source Port Destination Port

Sequence Number

Acknowledgment Number
C E U A P R S F
Data Window Size
Reserved W C R C S S Y I
Offset R E G K H T N N

Checksum Urgent Pointer

Options Padding

The sequence and acknowledgment numbers are for numbering the data, at the byte level. This allows TCP to send 1024-byte
blocks of data, incrementing the sequence number by 1024 between successive packets, or to send 1-byte telnet packets,
incrementing the sequence number by 1 each time. There is no distinction between DATA and ACK packets; all packets carrying
data from A to B also carry the most current acknowledgment of data sent from B to A. Many TCP applications are largely
unidirectional, in which case the sender would include essentially the same acknowledgment number in each packet while the
receiver would include essentially the same sequence number.
It is traditional to refer to the data portion of TCP packets as segments.
TCP History
The clear-cut division between the IP and TCP headers did not spring forth fully formed. See [CK74] for a discussion of a proto-
TCP in which the sequence number (but not the acknowledgment number) appeared in the equivalent of the IP header (perhaps so
it could be used for fragment reassembly).
The value of the sequence number, in relative terms, is the position of the first byte of the packet in the data stream, or the position
of what would be the first byte in the case that no data was sent. The value of the acknowledgment number, again in relative terms,
represents the byte position for the next byte expected. Thus, if a packet contains 1024 bytes of data and the first byte is number 1,
then that would be the sequence number. The data bytes would be positions 1-1024, and the ACK returned would have
acknowledgment number 1025.
The sequence and acknowledgment numbers, as sent, represent these relative values plus an Initial Sequence Number, or ISN,
that is fixed for the lifetime of the connection. Each direction of a connection has its own ISN; see below.
TCP acknowledgments are cumulative: when an endpoint sends a packet with an acknowledgment number of N, it is
acknowledging receipt of all data bytes numbered less than N. Standard TCP provides no mechanism for acknowledging receipt of
packets 1, 2, 3 and 5; the highest cumulative acknowledgment that could be sent in that situation would be to acknowledge packet
3.

12.2 https://eng.libretexts.org/@go/page/10997
The TCP header defines some important flag bits; the brief definitions here are expanded upon in the sequel:
SYN: for SYNchronize; marks packets that are part of the new-connection handshake
ACK: indicates that the header Acknowledgment field is valid; that is, all but the first packet
FIN: for FINish; marks packets involved in the connection closing
PSH: for PuSH; marks “non-full” packets that should be delivered promptly at the far end
RST: for ReSeT; indicates various error conditions
URG: for URGent; part of a now-seldom-used mechanism for high-priority data
CWR and ECE: part of the Explicit Congestion Notification mechanism, 14.8.3 Explicit Congestion Notification (ECN)

12.3 TCP Connection Establishment


TCP connections are established via an exchange known as the three-way handshake. If A is the client and B is the LISTENing
server, then the handshake proceeds as follows:
A sends B a packet with the SYN bit set (a SYN packet)
B responds with a SYN packet of its own; the ACK bit is now also set
A responds to B’s SYN with its own ACK
A B
SYN

SYN+ACK

ACK

TCP three-way handshake

Normally, the three-way handshake is triggered by an application’s request to connect; data can be sent only after the handshake
completes. This means a one-RTT delay before any data can be sent. The original TCP standard RFC 793 does allow data to be
sent with the first SYN packet, as part of the handshake, but such data cannot be released to the remote-endpoint application until
the handshake completes. Most traditional TCP programming interfaces offer no support for this early-data option.
There are recurrent calls for TCP to support data transmission within the handshake itself, so as to achieve request/reply turnaround
comparable to that with RPC (11.5 Remote Procedure Call (RPC)). We return to this in 12.12 TCP Faster Opening.
The three-way handshake is vulnerable to an attack known as SYN flooding. The attacker sends a large number of SYN packets to
a server B. For each arriving SYN, B must allocate resources to keep track of what appears to be a legitimate connection request;
with enough requests, B’s resources may face exhaustion. SYN flooding is easiest if the SYN packets are simply spoofed, with
forged, untraceable source-IP addresses; see spoofing at 7.1 The IPv4 Header, and 12.10.1 ISNs and spoofing below. SYN-flood
attacks can also take the form of a large number of real connection attempts from a large number of real clients – often
compromised and pressed into service by some earlier attack – but this requires considerably more resources on the part of the
attacker. See 12.22.2 SCTP for an alternative handshake protocol (unfortunately not available to TCP) intended to mitigate SYN-
flood attacks, at least from spoofed SYNs.
To close the connection, a superficially similar exchange involving FIN packets may occur:
A sends B a packet with the FIN bit set (a FIN packet), announcing that it has finished sending data
B sends A an ACK of the FIN
B may continue to send additional data to A
When B is also ready to cease sending, it sends its own FIN to A
A sends B an ACK of the FIN; this is the final packet in the exchange
Here’s the ladder diagram for this:

12.3 https://eng.libretexts.org/@go/page/10997
A B
FIN
ACK

al)
More DATA (option

FIN
ACK

A typical TCP close

The FIN handshake is really more like two separate two-way FIN/ACK handshakes. We will return to TCP connection closing in
12.7.1 Closing a connection.
Now let us look at the full exchange of packets in a representative connection, in which A sends strings “abc”, “defg”, and “foobar”
(RFC 3092). B replies with “hello”, and which point A sends “goodbye” and closes the connection. In the following table, relative
sequence numbers are used, which is to say that sequence numbers begin with 0 on each side. The SEQ numbers in bold on the A
side correspond to the ACK numbers in bold on the B side; they both count data flowing from A to B.

A sends B sends

1 SYN, seq=0

2 SYN+ACK, seq=0, ack=1 (expecting)

3 ACK, seq=1, ack=1 (ACK of SYN)

4 “abc”, seq=1, ack=1

5 ACK, seq=1, ack=4

6 “defg”, seq=4, ack=1

7 seq=1, ack=8

8 “foobar”, seq=8, ack=1

9 seq=1, ack=14, “hello”

10 seq=14, ack=6, “goodbye”

11,1
seq=21, ack=6, FIN seq=6, ack=21 ;; ACK of “goodbye”, crossing packets
2

13 seq=6, ack=22 ;; ACK of FIN

14 seq=6, ack=22, FIN

15 seq=22, ack=7 ;; ACK of FIN

(We will see below that this table is slightly idealized, in that real sequence numbers do not start at 0.)
Here is the ladder diagram corresponding to this connection:

12.4 https://eng.libretexts.org/@go/page/10997
A B
SYN

SYN+ACK
ACK
“abc”

ACK
“defg”
ACK
“foobar”
“hello”
“goodbye”
FIN ACK
Crossing packets
ACK
FIN
ACK

In terms of the sequence and acknowledgment numbers, SYNs count as 1 byte, as do FINs. Thus, the SYN counts as sequence
number 0, and the first byte of data (the “a” of “abc”) counts as sequence number 1. Similarly, the ack=21 sent by the B side is the
acknowledgment of “goodbye”, while the ack=22 is the acknowledgment of A’s subsequent FIN.
Whenever B sends ACN=n, A follows by sending more data with SEQ=n.
TCP does not in fact transport relative sequence numbers, that is, sequence numbers as transmitted do not begin at 0. Instead, each
side chooses its Initial Sequence Number, or ISN, and sends that in its initial SYN. The third ACK of the three-way handshake is
an acknowledgment that the server side’s SYN response was received correctly. All further sequence numbers sent are the ISN
chosen by that side plus the relative sequence number (that is, the sequence number as if numbering did begin at 0). If A chose
ISNA=1000, we would add 1000 to all the bold entries above: A would send SYN(seq=1000), B would reply with ISNB and
ack=1001, and the last two lines would involve ack=1022 and seq=1022 respectively. Similarly, if B chose ISNB=7000, then we
would add 7000 to all the seq values in the “B sends” column and all the ackvalues in the “A sends” column. The table above up to
the point B sends “goodbye”, with actual sequence numbers instead of relative sequence numbers, is below:

A, ISN=1000 B, ISN=7000

1 SYN, seq=1000

2 SYN+ACK, seq=7000, ack=1001

3 ACK, seq=1001, ack=7001

4 “abc”, seq=1001, ack=7001

5 ACK, seq=7001, ack=1004

6 “defg”, seq=1004, ack=7001

7 seq=7001, ack=1008

8 “foobar”, seq=1008, ack=7001

9 seq=7001, ack=1014, “hello”

1
seq=1014, ack=7006, “goodbye”
0

If B had not been LISTENing at the port to which A sent its SYN, its response would have been RST (“reset”), meaning in this
context “connection refused”. Similarly, if A sent data to B before the SYN packet, the response would have been RST.
Finally, a RST can be sent by either side at any time to abort the connection. Sometimes routers along the path send “spoofed”
RSTs to tear down TCP connections they are configured to regard as undesired; see 7.7.2 Middleboxes and RFC 3360. Worse,
sometimes external attackers are able to tear down a TCP connection with a spoofed RST; this requires brute-force guessing the
endpoint port numbers and the current SYN value (RFC 793 does not require the RST packet’s ACK value to match). In the days
of 4 kB window sizes, guessing a valid SYN was a one-in-a-million chance, but window sizes have steadily increased (14.9 The

12.5 https://eng.libretexts.org/@go/page/10997
High-Bandwidth TCP Problem); a 4 MB window size makes SYN guessing quite feasible. See also RFC 4953, the RST-validation
fix proposed in RFC 5961 §3.2, and exercise 6.5.
If A sends a series of small packets to B, then B has the option of assembling them into a full-sized I/O buffer before releasing
them to the receiving application. However, if A sets the PSH bit on each packet, then B should release each packet immediately to
the receiving application. In Berkeley Unix and most (if not all) BSD-derived socket-library implementations, there is in fact no
way to set the PSH bit; it is set automatically for each write. (But even this is not guaranteed as the sender may leave the bit off or
consolidate several PuSHed writes into one packet; this makes using the PSH bit as a record separator difficult. In the program
written to generate the WireShark packet trace, below, most of the time the strings “abc”, “defg”, etc were PuSHed separately but
occasionally they were consolidated into one packet.)
As for the URG bit, imagine a telnet (or ssh) connection, in which A has sent a large amount of data to B, which is momentarily
stalled processing it. The application at A wishes to abort that processing by sending the interrupt character CNTL-C. Under
normal conditions, the application at B would have to finish processing all the pending data before getting to the CNTL-C;
however, the use of the URG bit can enable immediate asynchronous delivery of the CNTL-C. The bit is set, and the TCP header’s
Urgent Pointer field points to the CNTL-C in the current packet, far ahead in the normal data stream. The receiving application
then skips ahead in its processing of the arriving data stream until it reaches the urgent data. For this to work, the receiving
application process must have signed up to receive an asynchronous signal when urgent data arrives.
The urgent data does appear as part of the ordinary TCP data stream, and it is up to the protocol to determine the start of the data
that is to be considered urgent, and what to do with the unread, buffered data sent ahead of the urgent data. For the CNTL-C
example in the telnet protocol (RFC 854), the urgent data might consist of the telnet “Interrupt Process” byte, preceded by the
“Interpret as Command” escape byte, and the earlier data is simply discarded.
Officially, the Urgent Pointer value, when the URG bit is set, contains the offset from the start of the current packet data to the end
of the urgent data; it is meant to tell the receiver “you should read up to this point as soon as you can”. The original intent was for
the urgent pointer to mark the last byte of the urgent data, but §3.1 of RFC 793got this wrong and declared that it pointed to the
first byte following the urgent data. This was corrected in RFC 1122, but most implementations to this day abide by the “incorrect”
interpretation. RFC 6093 discusses this and proposes, first, that the near-universal “incorrect” interpretation be accepted as
standard, and, second, that developers avoid the use of the TCP urgent-data feature.

12.4 TCP and WireShark


Below is a screenshot of the WireShark program displaying a tcpdump capture intended to represent the TCP exchange above.
Both hosts involved in the packet exchange were Linux systems. Side A uses socket address ⟨10.0.0.3,45815⟩ and side B (the
server) uses ⟨10.0.0.1,54321⟩.
WireShark is displaying relative TCP sequence numbers. The first three packets correspond to the three-way handshake, and packet
4 is the first data packet. Every data packet has the flags [PSH, ACK] displayed. The data in the packet can be inferred from the
WireShark Len field, as each of the data strings sent has a different length.

12.6 https://eng.libretexts.org/@go/page/10997
The packets are numbered the same as in the table above up through packet 8, containing the string “foobar”. At that point the table
shows B replying by a combined ACK plus the string “hello”; in fact, TCP sent the ACK alone and then the string “hello”; these
are WireShark packets 9 and 10 (note packet 10 has Len=5). Wireshark packet 11 is then a standalone ACK from A to B,
acknowledging the “hello”. WireShark packet 12 (the packet highlighted) then corresponds to table packet 10, and contains
“goodbye” (Len=7); this string can be seen at the right side of the bottom pane.
The table view shows A’s FIN (packet 11) crossing with B’s ACK of “goodbye” (packet 12). In the WireShark view, A’s FIN is
packet 13, and is sent about 0.01 seconds after “goodbye”; B then ACKs them both with packet 14. That is, the table-view packet
12 does not exist in the WireShark view.
Packets 11, 13, 14 and 15 in the table and 13, 14, 15 and 16 in the WireShark screen dump correspond to the connection closing.
The program that generated the exchange at B’s side had to include a “sleep” delay of 40 ms between detecting the closed
connection (that is, reading A’s FIN) and closing its own connection (and sending its own FIN); otherwise the ACK of A’s FIN
traveled in the same packet with B’s FIN.
The ISN for A in this example was 551144795 and B’s ISN was 1366676578. The actual pcap packet-capture file is at
demo_tcp_connection.pcap.

12.5 TCP Offloading


In the Wireshark example above, the hardware involved used TCP checksum offloading, or TCO, to have the network-interface
card do the actual checksum calculations; this permits a modest amount of parallelism. As a result, the checksums for outbound
packets are wrong in the capture file. WireShark has an option to disable the reporting of this.
It is also possible, with many newer network-interface cards, to offload the TCP segmentation process to the LAN hardware; this is
most useful when the application is writing data continuously and is known as TCP segmentation offloading, or TSO. The use of
TSO requires TCO, but not vice-versa.
TSO can be divided into large send offloading, LSO, for outbound traffic, and large receive offloading, LRO, for inbound. For
outbound offloading, the host system transfers to the network card a large buffer of data (perhaps 64 kB), together with information

12.7 https://eng.libretexts.org/@go/page/10997
about the headers. The network card then divides the buffer into 1500-byte packets, with proper TCP/IP headers, and sends them
off.
For inbound offloading, the network card accumulates multiple inbound packets that are part of the same TCP connection, and
consolidates them in proper sequence to one much larger packet. This means that the network card, upon receiving one packet,
must wait to see if there will be more. This wait is very short, however, at most a few milliseconds. Specifically, all consolidated
incoming packets must have the same TCP Timestamp value (12.11 Anomalous TCP scenarios).
TSO is of particular importance at very high bandwidths. At 10 Gbps, a system can send or receive close to a million packets per
second, and offloading some of the packet processing to the network card can be essential to maintaining high throughput. TSO
allows a host system to behave as if it were reading or writing very large packets, and yet the actual packet size on the wire remains
at the standard 1500 bytes.
On Linux systems, the status of TCO and TSO can be checked using the command ethtool --show-offload interface.
TSO can be disabled with ethtool --offload interface tso off .

12.6 TCP simplex-talk


Here is a Java version of the simplex-talk server for TCP. As with the UDP version, we start by setting up the socket, here a
ServerSocket called ss . This socket remains in the LISTEN state throughout. The main while loop then begins with
the call ss.accept() at the start; this call blocks until an incoming connection is established, at which point it returns the
connected child socket s. The accept() call models the TCP protocol behavior of waiting for three-way handshakes initiated
by remote hosts and, for each, setting up a new connection.
Connections will be accepted from all IP addresses of the server host, eg the “normal” IP address, the loopback address 127.0.0.1
and, if the server is multihomed, any additional IP addresses. Unlike the UDP case (11.1.3.2 UDP and IP addresses), RFC 1122
requires (§4.2.3.7) that server response packets always be sent from the same server IP address that the client first used to contact
the server. (See exercise 13.0 for an example of non-compliance.)
A server application can process these connected children either serially or in parallel. The stalk version here can handle both
situations, either one connection at a time ( THREADING = false ), or by creating a new thread for each connection (
THREADING = true ). Either way, the connected child socket is turned over to line_talker() , either as a synchronous
procedure call or as a new thread. Data is then read from the socket’s associated InputStream using the ordinary read()
call, versus the receive() used to read UDP packets. The main loop within line_talker() does not terminate until the
client closes the connection (or there is an error).
In the serial, non-threading mode, if a second client connection is made while the first is still active, then data can be sent on the
second connection but it sits in limbo until the first connection closes, at which point control returns to the ss.accept() call,
the second connection is processed, and the second connection’s data suddenly appears.
In the threading mode, the main loop spends almost all its time waiting in ss.accept() ; when this returns a child connection
we immediately spawn a new thread to handle it, allowing the parent process to go back to ss.accept() . This allows the
program to accept multiple concurrent client connections, like the UDP version.
The code here serves as a very basic example of the creation of Java threads. The inner class Talker has a run() method,
needed to implement the runnable interface. To start a new thread, we create a new Talker instance; the start() call
then begins Talker.run() , which runs for as long as the client keeps the connection open. The file here is tcp_stalks.java.

/* THREADED simplex-talk TCP server */


/* can handle multiple CONCURRENT client connections */
/* newline is to be included at client side */

import java.net.*;
import java.io.*;

public class tstalks {

12.8 https://eng.libretexts.org/@go/page/10997
static public int destport = 5431;
static public int bufsize = 512;
static public boolean THREADING = true;

static public void main(String args[]) {


ServerSocket ss;
Socket s;
try {
ss = new ServerSocket(destport);
} catch (IOException ioe) {
System.err.println("can't create server socket");
return;
}
System.err.println("server starting on port " + ss.getLocalPort());

while(true) { // accept loop


try {
s = ss.accept();
} catch (IOException ioe) {
System.err.println("Can't accept");
break;
}

if (THREADING) {
Talker talk = new Talker(s);
(new Thread(talk)).start();
} else {
line_talker(s);
}
} // accept loop
} // end of main

public static void line_talker(Socket s) {


int port = s.getPort();
InputStream istr;
try { istr = s.getInputStream(); }
catch (IOException ioe) {
System.err.println("cannot get input stream"); // most likely cause
return;
}
System.err.println("New connection from <" +
s.getInetAddress().getHostAddress() + "," + s.getPort() + ">");
byte[] buf = new byte[bufsize];
int len;

while (true) { // while not done reading the socket


try {

12.9 https://eng.libretexts.org/@go/page/10997
len = istr.read(buf, 0, bufsize);
}
catch (SocketTimeoutException ste) {
System.out.println("socket timeout");
continue;
}
catch (IOException ioe) {
System.err.println("bad read");
break; // probably a socket ABORT; treat as a close
}
if (len == -1) break; // other end closed gracefully
String str = new String(buf, 0, len);
System.out.print("" + port + ": " + str); // str should contain newline
} //while reading from s

try {istr.close();}
catch (IOException ioe) {System.err.println("bad stream close");return;}
try {s.close();}
catch (IOException ioe) {System.err.println("bad socket close");return;}
System.err.println("socket to port " + port + " closed");
} // line_talker

static class Talker implements Runnable {


private Socket _s;

public Talker (Socket s) {


_s = s;
}

public void run() {


line_talker(_s);
} // run
} // class Talker
}

12.6.1 The TCP Client


Here is the corresponding client tcp_stalkc.java. As with the UDP version, the default host to connect to is localhost . We
first call InetAddress.getByName() to perform the DNS lookup. Part of the construction of the Socket object is the
connection to the desired dest and destport. Within the main while loop, we use an ordinary write() call to write strings
to the socket’s associated OutputStream .

// TCP simplex-talk CLIENT in java

import java.net.*;
import java.io.*;

12.10 https://eng.libretexts.org/@go/page/10997
public class stalkc {

static public BufferedReader bin;


static public int destport = 5431;

static public void main(String args[]) {


String desthost = "localhost";
if (args.length >= 1) desthost = args[0];
bin = new BufferedReader(new InputStreamReader(System.in));

InetAddress dest;
System.err.print("Looking up address of " + desthost + "...");
try {
dest = InetAddress.getByName(desthost);
}
catch (UnknownHostException uhe) {
System.err.println("unknown host: " + desthost);
return;
}
System.err.println(" got it!");

System.err.println("connecting to port " + destport);


Socket s;
try {
s = new Socket(dest, destport);
}
catch(IOException ioe) {
System.err.println("cannot connect to <" + desthost + "," + destport + ">
return;
}

OutputStream sout;
try {
sout = s.getOutputStream();
}
catch (IOException ioe) {
System.err.println("I/O failure!");
return;
}

//============================================================

while (true) {
String buf;
try {
buf = bin.readLine();
}

12.11 https://eng.libretexts.org/@go/page/10997
catch (IOException ioe) {
System.err.println("readLine() failed");
return;
}
if (buf == null) break; // user typed EOF character

buf = buf + "\n"; // protocol requires sender includes \n


byte[] bbuf = buf.getBytes();

try {
sout.write(bbuf);
}
catch (IOException ioe) {
System.err.println("write() failed");
return;
}
} // while
}
}

A Python3 version of the stalk client is available at tcp_stalkc.py.


Here are some things to try with THREADING=false in the server:
start up two clients while the server is running. Type some message lines into both. Then exit the first client.
start up the client before the server.
start up the server, and then the client. Kill the server and then type some message lines to the client. What happens to the
client? (It may take a couple message lines.)
start the server, then the client. Kill the server and restart it. Now what happens to the client?
With THREADING=true, try connecting multiple clients simultaneously to the server. How does this behave differently from the
first example above?
See also exercise 14.0.

12.6.2 netcat again


As with UDP (11.1.4 netcat), we can use the netcat utility to act as either end of the TCP simplex-talk connection. As the
client we can use

netcat localhost 5431

while as the server we can use

netcat -l -k 5431

Here (but not with UDP) the -k option causes the server to accept multiple connections in sequence. The connections are
handled one at a time, as is the case in the stalk server above with THREADING=false .
We can also use netcat to download web pages, using the HTTP protocol. The command below sends an HTTP GET request
(version 1.1; RFC 2616 and updates) to retrieve part of the website for this book; it has been broken over two lines for
convenience.

12.12 https://eng.libretexts.org/@go/page/10997
echo -e 'GET /index.html HTTP/1.1\r\nHOST: intronetworks.cs.luc.edu\r\n'|
netcat intronetworks.cs.luc.edu 80

The \r\n represents the officially mandatory carriage-return/newline line-ending sequence, though \n will often work. The
index.html identifies the file being requested; as index.html is the default it is often omitted, though the preceding /
is still required. The webserver may support other websites as well via virtual hosting (7.8.1 nslookup (and dig)); the HOST:
specification identifies to the server the specific site we are looking for. Version 2 of HTTP is described in RFC 7540; its primary
format is binary. (For production command-line retrieval of web pages, cURL and wget are standard choices.)

12.7 TCP state diagram


A formal definition of TCP involves the state diagram, with conditions for transferring from one state to another, and responses to
all packets from each state. The state diagram originally appeared in RFC 793; the following diagram came from
http://commons.wikimedia.org/wiki/File:Tcp_state_diagram_fixed.svg. The blue arrows indicate the sequence of state transitions
typically followed by the server; the brown arrows represent the client. Arrows are labeled with event / action; that is, we move
from LISTEN to SYN_RECD upon receipt of a SYN packet; the action is to respond with SYN+ACK.

CONNECT/ SYN (Step 1 of the 3-way-handshake)


unusual event
client/receiver path (Start) CLOSED
server/sender path CLOSE/-
LISTEN/-
CLOSE/-
(Step 2 of the 3-way-handshake) SYN/SYN+ACK
LISTEN

RST/- SEND/ SYN


SYN SYN
RECEIVED SYN/SYN+ACK (simultaneous open) SENT

Data exchange occurs


ACK/- SYN+ACK/ACK
ESTABLISHED
(Step 3 of the 3-way-handshake)

CLOSE/ FIN
CLOSE/ FIN FIN/ACK

Active CLOSE Passive CLOSE


FIN/ACK
FIN WAIT 1 CLOSING CLOSE WAIT
FIN + ACK-of-FIN / ACK

ACK-of-FIN / - ACK/- CLOSE/ FIN

FIN WAIT 2 TIME WAIT LAST ACK


FIN/ACK

Timeout ACK/-

(Go back to start) CLOSED

In general, this state-machine approach to protocol specification has proven very effective, and is used for most protocols. It makes
it very clear to the implementer how the system should respond to each packet arrival. It is also a useful model for the
implementation itself.
It is visually impractical to list every possible transition within the state diagram, full details are usually left to the accompanying
text. For example, although this does not appear in the state diagram above, the per-state response rules of TCP require that in the
ESTABLISHED state, if the receiver sends an ACK outside the current sliding window, then the correct response is to reply with
one’s own current ACK. This includes the case where the receiver acknowledges data not yet sent.
The ESTABLISHED state and the states below it are sometimes called the synchronized states, as in these states both sides have
confirmed one another’s ISN values.
Here is the ladder diagram for the 14-packet connection described above, this time labeled with TCP states.

12.13 https://eng.libretexts.org/@go/page/10997
A B
CLOSED
SYN LISTEN
SYN_SENT
SYN+ACK
ACK SYN_RECD
“abc”

ACK
“defg”
ESTABLISHED
ACK
ESTABLISHED
“foobar”
“hello”
“goodbye”
FIN ACK
FINWAIT_1 ACK
CLOSEWAIT
FINWAIT_2 FIN
ACK LAST_ACK
TIMEWAIT
CLOSED

Although it essentially never occurs in practice, it is possible for each side to send the other a SYN, requesting a connection,
simultaneously (that is, the SYNs cross on the wire). The telephony analogue occurs when each party dials the other
simultaneously. On traditional land-lines, each party then gets a busy signal. On cell phones, your mileage may vary. With TCP, a
single connection is created. With OSI TP4, two connections are created. The OSI approach is not possible in TCP, as a connection
is determined only by the socketpair involved; if there is only one socketpair then there can be only one connection.
It is possible to view connection states under either Linux or Windows with netstat -a . Most states are ephemeral,
exceptions being LISTEN, ESTABLISHED, TIMEWAIT, and CLOSE_WAIT. One sometimes sees large numbers of connections
in CLOSE_WAIT, meaning that the remote endpoint has closed the connection and sent its FIN, but the process at your end has not
executed close() on its socket. Often this represents a programming error; alternatively, the process at the local end is still
working on something. Given a local port number p in state CLOSE_WAIT on a Linux system, the (privileged) command
lsof -i : p will identify the process using port p.
The reader who is implementing TCP is encouraged to consult RFC 793 and updates. For the rest of us, below are a few general
observations about closing connections.

12.7.1 Closing a connection


The “normal” TCP close sequence is as follows:

A B
ESTABLISHED
FIN ESTABLISHED

FIN_WAIT_1 ACK
CLOSE_WAIT

More Data (optional)


FIN_WAIT_2
ACK of Data FIN
ACK LAST_ACK

TIMEWAIT
CLOSED

Normal close

A’s FIN is, in effect, a promise to B not to send any more. However, A must still be prepared to receive data from B, hence the
optional data shown in the diagram. A good example of this occurs when A is sending a stream of data to B to be sorted; A sends
FIN to indicate that it is done sending, and only then does B sort the data and begin sending it back to A. This can be generated
with the command, on A, cat thefile | ssh B sort . That said, the presence of the optional B-to-A data above
following A’s FIN is relatively less common.
In the diagram above, A sends a FIN to B and receives an ACK, and then, later, B sends a FIN to A and receives an ACK. This
essentially amounts to two separate two-way closure handshakes.
Either side may elect to close the connection, just as either party to a telephone call may elect to hang up. The first side to send a
FIN – A in the diagram above – takes the Active CLOSE path; the other side takes the Passive CLOSE path. In the diagram,
active-closer A moves from state ESTABLISHED to FIN_WAIT_1 to FIN_WAIT_2 (upon receipt of B’s ACK of A’s FIN), and

12.14 https://eng.libretexts.org/@go/page/10997
then to TIMEWAIT and finally to CLOSED. Passive-closer B moves from ESTABLISHED to CLOSE_WAIT to LAST_ACK to
CLOSED.
A simultaneous close – having both sides send each other FINs before receiving the other side’s FIN – is a little more likely than a
simultaneous open, earlier above, though still not very. Each side would send its FIN and move to state FIN_WAIT_1. Then, upon
receiving each other’s FIN packets, each side would send its final ACK and move to CLOSING. See exercises 4.0 and 4.5.
A TCP endpoint is half-closed if it has sent its FIN (thus promising not to send any more data) and is waiting for the other side’s
FIN; this corresponds to A in the diagram above in states FIN_WAIT_1 and FIN_WAIT_2. With the BSD socket library, an
application can half-close its connection with the appropriate call to shutdown() .
Unrelatedly, A TCP endpoint is half-open if it is in the ESTABLISHED state, but during a lull in the exchange of packets the other
side has rebooted; this has nothing to do with the close protocol. As soon as the ESTABLISHED side sends a packet, the rebooted
side will respond with RST and the connection will be fully closed.
In the absence of the optional data from B to A after A sends its FIN, the closing sequence reduces to the left-hand diagram below:

A B A B
FIN
FIN
ACK
FIN+ACK
FIN
ACK
ACK

Two TCP close scenarios with no B-to-A data

If B is ready to close immediately, it is possible for B’s ACK and FIN to be combined, as in the right-hand diagram above, in which
case the resultant diagram superficially resembles the connection-opening three-way handshake. In this case, A moves directly
from FIN_WAIT_1 to TIMEWAIT, following the state-diagram link labeled “FIN + ACK-of-FIN”. In theory this is rare, as the
ACK of A’s FIN is generated by the kernel but B’s FIN cannot be sent until B’s process is scheduled to run on the CPU. If the TCP
layer adopts a policy of immediately sending ACKs upon receipt of any packet, this will never happen, as the ACK will be sent
well before B’s process can be scheduled to do anything. However, if B delays its ACKs slightly (and if it has no more data to
send), then it is possible – and in fact not uncommon – for B’s ACK and FIN to be sent together. Delayed ACKs, are, as we shall
see below, a common strategy (12.15 TCP Delayed ACKs). To create the scenario of 12.4 TCP and WireShark, it was necessary to
introduce an artificial delay to prevent the simultaneous transmission of B’s ACK and FIN.

12.7.2 Calling close()


Most network programming interfaces provide a close() method for ending a connection, based on the close operation for
files. However, it usually closes bidirectionally and so models the TCP closure protocol rather imperfectly.
As we have seen in the previous section, the TCP close sequence is is followed more naturally if the active-closing endpoint calls
shutdown() – promising not to send more, but allowing for continued receiving – before the final close() . Here is what
should happen at the application layer if endpoint A of a TCP connection wishes to initiate the closing of its connection with
endpoint B:
A’s application calls shutdown() , thereby promising not to send any more data. A’s FIN is sent to B. A’s application is
expected to continue reading, however.
The connection is now half-closed. On receipt of A’s FIN, B’s TCP layer knows this. If B’s application attempts to read more
data, it will receive an end-of-file indication (this is typically a read() or recv() operation that returns immediately
with 0 bytes received).
B’s application is now done reading data, but it may or may not have more data to send. When B’s application is done sending,
it calls close() , at which point B’s FIN is sent to A. Because the connection is already half-closed, B’s close() is
really a second half-close, ending further transmission by B.
A’s application keeps reading until it too receives an end-of-file indication, corresponding to B’s FIN.
The connection is now fully closed. No data has been lost.
It is sometimes the case that it is evident to A from the application protocol that B will not send more data. In such cases, A might
simply call close() instead of shutdown() . This is risky, however, unless the protocol is crystal clear: if A calls

12.15 https://eng.libretexts.org/@go/page/10997
close() and B later does send a little more data after all, or if B has already sent some data but A has not actually read it, A’s
TCP layer may send RST to B to indicate that not all B’s data was received properly. RFC 1122 puts it this way:

If such a host issues a CLOSE call while received data is still pending in TCP, or if new
data is received after CLOSE is called, its TCP SHOULD send a RST to show that data
was lost.
If A’s RST arrives at B before all of A’s sent data has been processed by B’s application, it is entirely possible that data sent by A
will be lost, that is, will never be read by B.
In the BSD socket library, A can set the SO_LINGER option, which causes A’s close() to block until A’s data has been
delivered to B (or until the SO_LINGER timeout has expired). However, SO_LINGER has no bearing on the issue above;
post-close data from B to A will still cause A to send a RST.
In the simplex-talk program at 12.6 TCP simplex-talk, the client does not call shutdown() (it implicitly calls close()
when it exits). When the client is done, the server calls s.close() . However, the fact that there is no data at all sent from the
server to the client prevents the problem discussed above.
See exercises 14.0 and 15.0.

12.8 TCP Old Duplicates


Conceptually, perhaps the most serious threat facing the integrity of TCP data is external old duplicates (11.3 Fundamental
Transport Issues), that is, very late packets from a previous instance of the connection. Suppose a TCP connection is opened
between A and B. One packet from A to B is duplicated and unduly delayed, with sequence number N. The connection is closed,
and then another instance is reopened, that is, a connection is created using the same ports. At some point in the second connection,
when an arriving packet with seq=N would be acceptable at B, the old duplicate shows up. Later, of course, B is likely to receive a
seq=N packet from the new instance of the connection, but that packet will be seen by B as a duplicate (even though the data does
not match), and (we will assume) be ignored.

A B
Port 2000 SYN
SYN+ACK Port 100
ACK

First instance of the connection


FIN
FIN+ACK
ACK

Delayed duplicate packet, red

Port 2000 SYN


SYN+ACK Port 100
ACK
Second instance of the connection
B mistakenly accepts the old duplicate
from the first instance

For TCP, it is the actual sequence numbers, rather than the relative sequence numbers, that would have to match up. The diagram
above ignores that.
As with TFTP, coming up with a possible scenario accounting for the generation of such a late packet is not easy. Nonetheless,
many of the design details of TCP represent attempts to minimize this risk.
Solutions to the old-duplicates problem generally involve setting an upper bound on the lifetime of any packet, the MSL, as we
shall see in the next section. T/TCP (12.12 TCP Faster Opening) introduced a connection-count field for this.
TCP is also vulnerable to sequence-number wraparound: arrival of an old duplicates from the same instance of the connection.
However, if we take the MSL to be 60 seconds, sequence-number wrap requires sending 232 bytes in 60 seconds, which requires a
data-transfer rate in excess of 500 Mbps. TCP offers a fix for this (Protection Against Wrapped Segments, or PAWS), but it was
introduced relatively late; we return to this in 12.11 Anomalous TCP scenarios.

12.16 https://eng.libretexts.org/@go/page/10997
12.9 TIMEWAIT
The TIMEWAIT state is entered by whichever side initiates the connection close; in the event of a simultaneous close, both sides
enter TIMEWAIT. It is to last for a time 2×MSL, where MSL = Maximum Segment Lifetime is an agreed-upon value for the
maximum lifetime on the Internet of an IP packet. Traditionally MSL was taken to be 60 seconds, but more modern
implementations often assume 30 seconds (for a TIMEWAIT period of 60 seconds).
One function of TIMEWAIT is to solve the external-old-duplicates problem. TIMEWAIT requires that between closing and
reopening a connection, a long enough interval must pass that any packets from the first instance will disappear. After the
expiration of the TIMEWAIT interval, an old duplicate cannot arrive.
A second function of TIMEWAIT is to address the lost-final-ACK problem (11.3 Fundamental Transport Issues). If host A sends
its final ACK to host B and this is lost, then B will eventually retransmit its final packet, which will be its FIN. As long as A
remains in state TIMEWAIT, it can appropriately reply to a retransmitted FIN from B with a duplicate final ACK. As with TFTP, it
is possible (though unlikely) for the final ACK to be lost as well as all the retransmitted final FINs sent during the TIMEWAIT
period; should this happen, one side thinks the connection closed normally while the other side thinks it did not. See exercise 12.0.
TIMEWAIT only blocks reconnections for which both sides reuse the same port they used before. If A connects to B and closes the
connection, A is free to connect again to B using a different port at A’s end.
Conceptually, a host may have many old connections to the same port simultaneously in TIMEWAIT; the host must thus maintain
for each of its ports a list of all the remote ⟨IP_address,port⟩ sockets currently in TIMEWAIT for that port. If a host is connecting
as a client, this list likely will amount to a list of recently used ports; no port is likely to have been used twice within the
TIMEWAIT interval. If a host is a server, however, accepting connections on a standardized port, and happens to be the side that
initiates the active close and thus later goes into TIMEWAIT, then its TIMEWAIT list for that port can grow quite long.
Generally, busy servers prefer to be free from these bookkeeping requirements of TIMEWAIT, so many protocols are designed so
that it is the client that initiates the active close. In the original HTTP protocol, version 1.0, the server sent back the data stream
requested by the http GET message (12.6.2 netcat again), and indicated the end of this stream by closing the connection. In HTTP
1.1 this was fixed so that the client initiated the close; this required a new mechanism by which the server could indicate “I am
done sending this file”. HTTP 1.1 also used this new mechanism to allow the server to send back multiple files over one
connection.
In an environment in which many short-lived connections are made from host A to the same port on server B, port exhaustion –
having all ports tied up in TIMEWAIT – is a theoretical possibility. If A makes 1000 connections per second, then after 60 seconds
it has gone through 60,000 available ports, and there are essentially none left. While this rate is high, early Berkeley-Unix TCP
implementations often made only about 4,000 ports available to clients; with a 120-second TIMEWAIT interval, port exhaustion
would occur with only 33 connections per second.
If you use ssh to connect to a server and then issue the netstat -a command on your own host (or, more conveniently,
netstat -a |grep -i tcp ), you should see your connection in ESTABLISHED state. If you close your connection and
check again, your connection should be in TIMEWAIT.

12.10 The Three-Way Handshake Revisited


As stated earlier in 12.3 TCP Connection Establishment, both sides choose an ISN; actual sequence numbers are the sum of the
sender’s ISN and the relative sequence number. There are two original reasons for this mechanism, and one later one (12.10.1 ISNs
and spoofing). The original TCP specification, as clarified in RFC 1122, called for the ISN to be determined by a special clock,
incremented by 1 every 4 microseconds.
The most basic reason for using ISNs is to detect duplicate SYNs. Suppose A initiates a connection to B by sending a SYN packet.
B replies with SYN+ACK, but this is lost. A then times out and retransmits its SYN. B now receives A’s second SYN while in state
SYN_RECEIVED. Does this represent an entirely new request (perhaps A has suddenly restarted), or is it a duplicate? If A uses the
clock-driven ISN strategy, B can tell (almost certainly) whether A’s second SYN is new or a duplicate: only in the latter case will
the ISN values in the two SYNs match.
While there is no danger to data integrity if A sends a SYN, restarts, and sends the SYN again as part of a reopening the same
connection, the arrival of a second SYN with a new ISN means that the original connection cannot proceed, because that ISN is

12.17 https://eng.libretexts.org/@go/page/10997
now wrong. The receiver of the duplicate SYN should drop any connection state it has recorded so far, and restart processing the
second SYN from scratch.
The clock-driven ISN also originally added a second layer of protection against external old duplicates. Suppose that A opens a
connection to B, and chooses a clock-based ISN N1. A then transfers M bytes of data, closed the connection, and reopens it with
ISN N2. If N1 + M < N2, then the old-duplicates problem cannot occur: all of the absolute sequence numbers used in the first
instance of the connection are less than or equal to N1 + M, and all of the absolute sequence numbers used in the second instance
will be greater than N2. In fact, early Berkeley-Unix implementations of the socket library often allowed a second connection
meeting this ISN requirement to be reopened before TIMEWAIT would have expired; this potentially addressed the problem of
port exhaustion. Of course, if the first instance of the connection transferred data faster than the ISN clock rate, that is at more than
250,000 bytes/sec, then N1 + M would be greater than N2, and TIMEWAIT would have to be enforced. But in the era in which TCP
was first developed, sustained transfers exceeding 250,000 bytes/sec were not common.
The three-way handshake was extensively analyzed by Dalal and Sunshine in [DS78]. The authors noted that with a two-way
handshake, the second side receives no confirmation that its ISN was correctly received. The authors also observed that a four-way
handshake – in which the ACK of ISNA is sent separately from ISNB, as in the diagram below – could fail if one side restarted.
A B
SYN

ACK

SYN

ACK

Possible four-way handshake


Less secure than three-way version

For this failure to occur, assume that after sending the SYN in line 1, with ISNA1, A restarts. The ACK in line 2 is either ignored or
not received. B now sends its SYN in line 3, but A interprets this as a new connection request; it will respond after line 4 by
sending a fifth, SYN packet containing a different ISNA2. For B the connection is now ESTABLISHED, and if B acknowledges
this fifth packet but fails to update its record of A’s ISN, the connection will fail as A and B would have different notions of ISNA.

12.10.1 ISNs and spoofing


The clock-based ISN proved to have a significant weakness: it often allowed an attacker to guess the ISN a remote host might use.
It did not help any that an early version of Berkeley Unix, instead of incrementing the ISN 250,000 times a second, incremented it
once a second, by 250,000 (plus something for each connection). By guessing the ISN a remote host would choose, an attacker
might be able to mimic a local, trusted host, and thus gain privileged access.
Specifically, suppose host A trusts its neighbor B, and executes with privileged status commands sent by B; this situation was
typical in the era of the rhost command. A authenticates these commands because the connection comes from B’s IP address.
The bad guy, M, wants to send packets to A so as to pretend to be B, and thus get a privileged command invoked. The connection
only needs to be started; if the ruse is discovered after the command is executed, it is too late. M can easily send a SYN packet to A
with B’s IP address in the source-IP field; M can probably temporarily disable B too, so that A’s SYN-ACK response, which is sent
to B, goes unnoticed. What is harder is for M to figure out how to guess how to ACK ISNA. But if A generates ISNs with a slowly
incrementing clock, M can guess the pattern of the clock with previous connection attempts, and can thus guess ISNA with a
considerable degree of accuracy. So M sends SYN to A with B as source, A sends SYN-ACK to B containing ISNA, and M guesses
this value and sends ACK(ISNA+1) to A, again with B listed in the IP header as source, followed by a single-packet command.
This TCP-layer IP-spoofing technique was first described by Robert T Morris in [RTM85]; Morris went on to launch the Internet
Worm of 1988 using unrelated attacks. The IP-spoofing technique was used in the 1994 Christmas Day attack against UCSD,
launched from Loyola’s own apollo.it.luc.edu; the attack was associated with Kevin Mitnickthough apparently not actually carried
out by him. Mitnick was arrested a few months later.
RFC 1948, in May 1996, introduced a technique for introducing a degree of randomization in ISN selection, while still ensuring
that the same ISN would not be used twice in a row for the same connection. The ISN is to be the sum of the 4-µs clock, C(t), and a
secure hash of the connection information as follows:

ISN = C(t) + hash(local_addr, local_port, remote_addr, remote_port, key)

12.18 https://eng.libretexts.org/@go/page/10997
The key value is a random value chosen by the host on startup. While M, above, can poll A for its current ISN, and can
probably guess the hash function and the first four parameters above, without knowing the key it cannot determine (or easily guess)
the ISN value A would have sent to B. Legitimate connections between A and B, on the other hand, see the ISN increasing at the 4-
µs rate.
RFC 5925 addresses spoofing and related attacks by introducing an optional TCP authentication mechanism: the TCP header
includes an option containing a secure hash (22.6 Secure Hashes) of the rest of the TCP header and a shared secret key. The need
for key management limits when this mechanism can be used; the classic use case is BGP connections between routers (10.6
Border Gateway Protocol, BGP).
Another approach to the prevention of spoofing attacks is to ask sites and ISPs to refuse to forward outwards any IP packets with a
source address not from within that site or ISP. If an attacker’s ISP implements this, the attacker will be unable to launch spoofing
attacks against the outside world. A concrete proposal can be found in RFC 2827. Unfortunately, it has been (as of 2015) almost
entirely ignored.
See also the discussion of SYN flooding at 12.3 TCP Connection Establishment, although that attack does not involve ISN
manipulation.

12.11 Anomalous TCP scenarios


TCP, like any transport protocol, must address the transport issues in 11.3 Fundamental Transport Issues.
As we saw above, TCP addresses the Duplicate Connection Request (Duplicate SYN) issue by noting whether the ISN has
changed. This is handled at the kernel level by TCP, versus TFTP’s application-level (and rather desultory) approach to handing
Duplicate RRQs.
TCP addresses Loss of Final ACK through TIMEWAIT: as long as the TIMEWAIT period has not expired, if the final ACK is lost
and the other side resends its final FIN, TCP will still be able to reissue that final ACK. TIMEWAIT in this sense serves a similar
function to TFTP’s DALLY state.
External Old Duplicates, arriving as part of a previous instance of the connection, are prevented by TIMEWAIT, and may also be
prevented by the use of a clock-driven ISN.
Internal Old Duplicates, from the same instance of the connection, that is, sequence number wraparound, is only an issue for
bandwidths exceeding 500 Mbps: only at bandwidths above that can 4 GB be sent in one 60-second MSL. TCP implementations
now address this with PAWS: Protection Against Wrapped Segments (RFC 1323). PAWS adds a 32-bit “timestamp option” to the
TCP header. The granularity of the timestamp clock is left unspecified; one tick must be small enough that sequence numbers
cannot wrap in that interval (eg less than 3 seconds for 10,000 Mbps), and large enough that the timestamps cannot wrap in time
MSL. On Linux systems the timestamp clock granularity is typically 1 to 10 ms; measurements on the author’s systems have been
4 ms. With timestamps, an old duplicate due to sequence-number wraparound can now easily be detected.
The PAWS mechanism also requires ACK packets to echo back the sender’s timestamp, in addition to including their own. This
allows senders to accurately measure round-trip times.
Reboots are a potential problem as the host presumably has no record of what aborted connections need to remain in TIMEWAIT.
TCP addresses this on paper by requiring hosts to implement Quiet Time on Startup: no new connections are to be accepted for
1*MSL. No known implementations actually do this; instead, they assume that the restarting process itself will take at least one
MSL. This is no longer as certain as it once was, but serious consequences have not ensued.

12.12 TCP Faster Opening


If a client wants to connect to a server, send a request and receive an immediate reply, TCP mandates one full RTT for the three-
way handshake before data can be delivered. This makes TCP one RTT slower than UDP-based request-reply protocols. There
have been periodic calls to allow TCP clients to include data with the first SYN packet and have it be delivered immediately upon
arrival – this is known as accelerated open.
If there will be a series of requests and replies, the simplest fix is to pipeline all the requests and replies over one persistent
connection; the one-RTT delay then applies only to the first request. If the pipeline connection is idle for a long-enough interval, it
may be closed, and then reopened later if necessary.

12.19 https://eng.libretexts.org/@go/page/10997
An early accelerated-open proposal was T/TCP, or TCP for Transactions, specified in RFC 1644. T/TCP introduced a connection
count TCP option, called CC; each participant would include a 32-bit CC value in its SYN; each participant’s own CC values were
to be monotonically increasing. Accelerated open was allowed if the server side had the client’s previous CC in a cache, and the
new CC value was strictly greater than this cached value. This ensured that the new SYN was not a duplicate of an older SYN.
Unfortunately, this also bypasses the modest authentication of the client’s IP address provided by the full three-way handshake,
worsening the spoofing problem of 12.10.1 ISNs and spoofing. If malicious host M wants to pretend to be B when sending a
privileged request to A, all M has to do is send a single SYN+Data packet with an extremely large value for CC. Generally, the
accelerated open succeeded as long as the CC value presented was larger that the value A had cached for B; it did not have to be
larger by exactly 1.
The recent TCP Fast Open proposal, described in RFC 7413, involves a secure “cookie” sent by the client as a TCP option; if a
SYN+Data packet has a valid cookie, then the client has proven its identity and the data may be released immediately to the
receiving application. Cookies are cryptographically secure, and are requested ahead of time from the server.
Because cookies have an expiration date and must be requested ahead of time, TCP Fast Open is not fundamentally faster from the
connection-pipeline option, except that holding a TCP connection open uses more resources than simply storing a cookie. The
likely application for TCP Fast Open is in accessing web servers. Web clients and servers already keep a persistent connection open
for a while, but often “a while” here amounts only to several seconds; TCP Fast Open cookies could remain active for much longer.
One serious practical problem with TCP Fast Open is that some middleboxes (7.7.2 Middleboxes) remove TCP options they do not
understand, or even block the connection attempt entirely.

12.13 Path MTU Discovery


TCP connections are more efficient if they can keep large packets flowing between the endpoints. Once upon a time, TCP
endpoints included just 512 bytes of data in each packet that was not destined for local delivery, to avoid fragmentation. TCP
endpoints now typically engage in Path MTU Discovery which almost always allows them to send larger packets; backbone ISPs
are now usually able to carry 1500-byte packets. The Path MTU is the largest packet size that can be sent along a path without
fragmentation.
The IPv4 strategy is to send an initial data packet with the IPv4 DONT_FRAG bit set. If the ICMP message
Frag_Required/DONT_FRAG_Set comes back, or if the packet times out, the sender tries a smaller size. If the sender
receives a TCP ACK for the packet, on the other hand, indicating that it made it through to the other end, it might try a larger size.
Usually, the size range of 512-1500 bytes is covered by less than a dozen discrete values; the point is not to find the exact Path
MTU but to determine a reasonable approximation rapidly.
IPv6 has no DONT_FRAG bit. Path MTU Discovery over IPv6 involves the periodic sending of larger packets; if the ICMPv6
message Packet Too Big is received, a smaller packet size must be used. RFC 1981 has details.

12.14 TCP Sliding Windows


TCP implements sliding windows, in order to improve throughput. Window sizes are measured in terms of bytes rather than
packets; this leaves TCP free to packetize the data in whatever segment size it elects. In the initial three-way handshake, each side
specifies the maximum window size it is willing to accept, in the Window Size field of the TCP header. This 16-bit field can only
go to 64 kB, and a 1 Gbps × 100 ms bandwidth×delay product is 12 MB; as a result, there is a TCP Window Scale option that can
also be negotiated in the opening handshake. The scale option specifies a power of 2 that is to be multiplied by the actual Window
Size value. In the WireShark example above, the client specified a Window Size field of 5888 (= 4 × 1472) in the third packet, but
with a Window Scale value of 26 = 64 in the first packet, for an effective window size of 64 × 5888 = 256 segments of 1472 bytes.
The server side specified a window size of 5792 and a scaling factor of 25 = 32.
TCP may either transmit a bulk stream of data, using sliding windows fully, or it may send slowly generated interactive data; in the
latter case, TCP may never have even one full segment outstanding.
In the following chapter we will see that a sender frequently reduces the actual TCP window size, in order to avoid congestion; the
window size included in the TCP header is known as the Advertised Window Size. On startup, TCP does not send a full window
all at once; it uses a mechanism called “slow start”.

12.20 https://eng.libretexts.org/@go/page/10997
12.15 TCP Delayed ACKs
TCP receivers are allowed briefly to delay their ACK responses to new data. This offers perhaps the most benefit for interactive
applications that exchange small packets, such as ssh and telnet. If A sends a data packet to B and expects an immediate response,
delaying B’s ACK allows the receiving application on B time to wake up and generate that application-level response, which can
then be sent together with B’s ACK. Without delayed ACKs, the kernel layer on B may send its ACK before the receiving
application on B has even been scheduled to run. If response packets are small, that doubles the total traffic. The maximum ACK
delay is 500 ms, according to RFC 1122and RFC 2581.
For bulk traffic, delayed ACKs simply mean that the ACK traffic volume is reduced. Because ACKs are cumulative, one ACK
from the receiver can in principle acknowledge multiple data packets from the sender. Unfortunately, acknowledging too many data
packets with one ACK can interfere with the self-clocking aspect of sliding windows; the arrival of that ACK will then trigger a
burst of additional data packets, which would otherwise have been transmitted at regular intervals. Because of this, the RFCs above
specify that an ACK be sent, at a minimum, for every other data packet. For a discussion of how the sender should respond to
delayed ACKs, see 13.2.1 TCP Reno Per-ACK Responses.
Bandwidth Conservation
Delayed ACKs and the Nagle algorithm both originated in a bygone era, when bandwidth was in much shorter supply than it is
today. In RFC 896, John Nagle writes (in 1984, well before TCP Reno, 13 TCP Reno and Congestion Management) “In general,
we have not been able to afford the luxury of excess long-haul bandwidth that the ARPANET possesses, and our long-haul links are
heavily loaded during peak periods. Transit times of several seconds are thus common in our network.” Today, it is unlikely that
extra small packets would cause detectable, let alone significant, problems.
The TCP ACK-delay time can usually be adjusted globally as a system parameter. Linux offers a TCP_QUICKACK option, as a
flag to setsockopt() , to disable delayed ACKs on a per-connection basis, but only until the next TCP system call. It must be
invoked immediately after every receive operation to disable delayed ACKs entirely. This option is also not very portable.
The TSO option of 12.5 TCP Offloading, used at the receiver, can also reduce the number of ACKs sent. If every two arriving data
packets are consolidated via TSO into a single packet, then the receiver will appear to the sender to be acknowledging every other
data packet. The ACK delay introduced by TSO is, however, usually quite small.

12.16 Nagle Algorithm


Like delayed ACKs, the Nagle algorithm (RFC 896) also attempts to improve the behavior of interactive small-packet applications.
It specifies that a TCP endpoint generating small data segments should queue them until either it accumulates a full segment’s
worth or receives an ACK for the previous batch of small segments. If the full-segment threshold is not reached, this means that
only one (consolidated) segment will be sent per RTT.
As an example, suppose A wishes to send to B packets containing consecutive letters, starting with “a”. The application on A
generates these every 100 ms, but the RTT is 501 ms. At T=0, A transmits “a”. The application on A continues to generate “b”, “c”,
“d”, “e” and “f” at times 100 ms through 500 ms, but A does not send them immediately. At T=501 ms, ACK(“a”) arrives; at this
point A transmits its backlogged “bcdef”. The ACK for this arrives at T=1002, by which point A has queued “ghijk”. The end
result is that A sends a fifth as many packets as it would without the Nagle algorithm. If these letters are generated by a user typing
them with telnet, and the ACKs also include the echoed responses, then if the user pauses the echoed responses will very soon
catch up.
The Nagle algorithm does not always interact well with delayed ACKs, or with user expectations; see exercises 10.0 and 10.5. It
can usually be disabled on a per-connection basis, in the BSD socket library by calling setsockopt() with the
TCP_NODELAY flag.

12.17 TCP Flow Control


It is possible for a TCP sender to send data faster than the receiver can process it. When this happens, a TCP receiver may reduce
the advertised Window Size value of an open connection, thus informing the sender to switch to a smaller window size. This
provides support for flow control.
The window-size reduction appears in the ACKs sent back by the receiver. A given ACK is not supposed to reduce the window
size by so much that the upper end of the window gets smaller. A window might shrink from the byte range [20,000..28,000] to
[22,000..28,000] but never to [20,000..26,000].

12.21 https://eng.libretexts.org/@go/page/10997
If a TCP receiver uses this technique to shrink the advertised window size to 0, this means that the sender may not send data. The
receiver has thus informed the sender that, yes, the data was received, but that, no, more may not yet be sent. This corresponds to
the ACKWAIT suggested in 6.1.3 Flow Control. Eventually, when the receiver is ready to receive data, it will send an ACK
increasing the advertised window size again.
If the TCP sender has its window size reduced to 0, and the ACK from the receiver increasing the window is lost, then the
connection would be deadlocked. TCP has a special feature specifically to avoid this: if the window size is reduced to zero, the
sender sends dataless packets to the receiver, at regular intervals. Each of these “polling” packets elicits the receiver’s current
ACK; the end result is that the sender will receive the eventual window-enlargement announcement reliably. These “polling”
packets are regulated by the so-called persist timer.

12.18 Silly Window Syndrome


The silly-window syndrome is a term for a scenario in which TCP transfers only small amounts of data at a time. Because TCP/IP
packets have a minimum fixed header size of 40 bytes, sending small packets uses the network inefficiently. The silly-window
syndrome can occur when either by the receiving application consuming data slowly or when the sending application generating
data slowly.
As an example involving a slow-consuming receiver, suppose a TCP connection has a window size of 1000 bytes, but the receiving
application consumes data only 10 bytes at a time, at intervals about equal to the RTT. The following can then happen:
The sender sends bytes 1-1000. The receiving application consumes 10 bytes, numbered 1-10. The receiving TCP buffers the
remaining 990 bytes and sends an ACK reducing the window size to 10, per 12.17 TCP Flow Control.
Upon receipt of the ACK, the sender sends 10 bytes numbered 1001-1010, the most it is permitted. In the meantime, the
receiving application has consumed bytes 11-20. The window size therefore remains at 10 in the next ACK.
the sender sends bytes 1011-1020 while the application consumes bytes 21-30. The window size remains at 10.
The sender may end up sending 10 bytes at a time indefinitely. This is of no benefit to either side; the sender might as well send
larger packets less often. The standard fix, set forth in RFC 1122, is for the receiver to use its ACKs to keep the window at 0 until
it has consumed one full packet’s worth (or half the window, for small window sizes). At that point the sender is invited – by an
appropriate window-size advertisement in the next ACK – to send another full packet of data.
The silly-window syndrome can also occur if the sender is generating data slowly, say 10 bytes at a time. The Nagle algorithm,
above, can be used to prevent this, though for interactive applications sending small amounts of data in separate but closely spaced
packets may actually be useful.

12.19 TCP Timeout and Retransmission


When TCP sends a packet containing user data (this excludes ACK-only packets), it sets a timeout. If that timeout expires before
the packet data is acknowledged, it is retransmitted. Acknowledgments are sent for every arriving data packet (unless Delayed
ACKs are implemented, 12.15 TCP Delayed ACKs); this amounts to receiver-side retransmit-on-duplicate of 6.1.1 Packet Loss.
Because ACKs are cumulative, and so a later ACK can replace an earlier one, lost ACKs are seldom a problem.
For TCP to work well for both intra-server-room and trans-global connections, with RTTs ranging from well under 1 ms to close to
1 second, the length of the timeout interval must adapt. TCP manages this by maintaining a running estimate of the RTT, EstRTT.
In the original version, TCP then set TimeOut = 2×EstRTT (in the literature, the TCP TimeOut value is often known as RTO, for
Retransmission TimeOut). EstRTT itself was a running average of periodically measured SampleRTT values, according to

EstRTT = ��×EstRTT + (1-��)×SampleRTT


for a fixed ��, 0<��<1. Typical values of �� might be ��=1/2 or ��=7/8. For �� close to 1 this is “conservative” in
that EstRTT is slow to change. For �� closer to 0, EstRTT is more volatile.
There is a potential RTT measurement ambiguity: if a packet is sent twice, the ACK received could be in response to the first
transmission or the second. The Karn/Partridge algorithm resolves this: on packet loss (and retransmission), the sender
Doubles Timeout
Stops recording SampleRTT
Uses the doubled Timeout as EstRTT when things resume

12.22 https://eng.libretexts.org/@go/page/10997
Setting TimeOut = 2×EstRTT proved too short during congestion periods and too long other times. Jacobson and Karels ([JK88])
introduced a way of calculating the TimeOut value based on the statistical variability of EstRTT. After each SampleRTT value was
collected, the sender would also update EstDeviation according to

SampleDev = | SampleRTT − EstRTT |


EstDeviation = ��×EstDeviation + (1-��)×SampleDev
for a fixed ��, 0<��<1. Timeout was then set to EstRTT + 4×EstDeviation. EstDeviation is an estimate of the so-called mean
deviation; 4 mean deviations corresponds (for normally distributed data) to about 5 standard deviations. If the SampleRTT values
were normally distributed (which they are not), this would mean that the chance that a non-lost packet would arrive outside the
TimeOut period is vanishingly small.
For further details, see [JK88] and [AP99].
Keeping track of when packets time out is usually handled by putting a record for each packet sent into a timer list. Each record
contains the packet’s timeout time, and the list is kept sorted by these times. Periodically, eg every 100 ms, the list is inspected and
all packets with expired timeout are then retransmitted. When an ACK arrives, the corresponding packet timeout record is removed
from the list. Note that this approach means that a packet’s timeout processing may be slightly late.

12.20 KeepAlive
There is no reason that a TCP connection should not be idle for a long period of time; ssh/telnet connections, for example, might go
unused for days. However, there is the turned-off-at-night problem: a workstation might telnet into a server, and then be shut off
(not shut down gracefully) at the end of the day. The connection would now be half-open, but the server would not generate any
traffic and so might never detect this; the connection itself would continue to tie up resources.
KeepAlive in action
One evening long ago, when dialed up (yes, that long ago) into the Internet, my phone line disconnected while I was typing an
email message in an ssh window. I dutifully reconnected, expecting to find my message in the file “dead.letter”, which is what
would have happened had I been disconnected while using the even-older tty dialup. Alas, nothing was there. I reconstructed my
email as best I could and logged off.
The next morning, there was my lost email in a file “dead.letter”, dated two hours after the initial crash! What had happened,
apparently, was that the original ssh connection on the server side just hung there, half-open. Then, after two hours, KeepAlive
kicked in, and aborted the connection. At that point ssh sent my mail program the HangUp signal, and the mail program wrote out
what it had in “dead.letter”.
To avoid this, TCP supports an optional KeepAlive mechanism: each side “polls” the other with a dataless packet. The original
RFC 1122 KeepAlive timeout was 2 hours, but this could be reduced to 15 minutes. If a connection failed the KeepAlive test, it
would be closed.
Supposedly, some TCP implementations are not exactly RFC 1122-compliant: either KeepAlives are enabled by default, or the
KeepAlive interval is much smaller than called for in the specification.

12.21 TCP timers


To summarize, TCP maintains the following four kinds of timers. All of them can be maintained by a single timer list, above.
TimeOut: a per-segment timer; TimeOut values vary widely
2×MSL TIMEWAIT: a per-connection timer
Persist: the timer used to poll the receiving end when winsize = 0
KeepAlive, above

12.22 Variants and Alternatives


One alternative to TCP is UDP with programmer-implemented timout and retransmission; many RPC implementations (11.5
Remote Procedure Call (RPC)) do exactly this, with reasonable results. Within a LAN a static timeout of around half a second
usually works quite well (unless the LAN has some tunneled links), and implementation of a simple timeout-retransmission

12.23 https://eng.libretexts.org/@go/page/10997
mechanism is quite straightforward. Implementing adaptive timeouts as in 12.19 TCP Timeout and Retransmission can, however,
be a bit trickier. QUIC (11.1.1 QUIC) is an example of this strategy.
We here consider four other protocols. The first, MPTCP, is based on TCP itself. The second, SCTP, is a message-oriented
alternative to TCP that is an entirely separate protocol. The last two, DCCP and QUIC, are attempts to create a TCP-like transport
layer on top of UDP.

12.22.1 MPTCP
Multipath TCP, or MPTCP, allows connections to use multiple network interfaces on a host, either sequentially or simultaneously.
MPTCP architectural principles are outlined in RFC 6182; implementation details are in RFC 6824.
To carry the actual traffic, MPTCP arranges for the creation of multiple standard-TCP subflows between the sending and receiving
hosts; these subflows typically connect between different pairs of IP addresses on the respective hosts.
For example, a connection to a server can start using the client’s wired Ethernet interface, and continue via Wi-Fi after the user has
unplugged. If the client then moves out of Wi-Fi range, the connection might continue via a mobile network. Alternatively, MPTCP
allows the parallel use of multiple Ethernet interfaces on both client and server for higher throughput.
MPTCP officially forbids the creation of multiple TCP connections between a single pair of interfaces in order to simulate
Highspeed TCP (15.5 Highspeed TCP); RFC 6356spells out an MWTCP congestion-control algorithm to enforce this.
Suppose host A, with two interfaces with IP addresses A1 and A2, wishes to connect to host B with IP addresses B1 and B2.
Connection establishment proceeds via the ordinary TCP three-way handshake, between one of A’s IP addresses, say A1, and one
of B’s, B1. The SYN packets must each carry the MP_CAPABLE TCP option, to signal one another that MPTCP is supported. As
part of the MP_CAPABLE option, A and B also exchange pseudorandom 64-bit connection keys, sent unencrypted; these will be
used to sign later messages as in 22.6.1 Secure Hashes and Authentication. This first connection is the initial subflow.
Once the MPTCP initial subflow has been established, additional subflow connections can be made. Usually these will be initiated
from the client side, here A, though the B side can also do this. At this point, however, A does not know of B’s address B2, so the
only possible second subflow will be from A2 to B1. New subflows will carry the MP_JOIN option with their initial SYN
packets, along with digital signatures signed by the original connection keys verifying that the new subflow is indeed part of this
MPTCP connection.
At this point A and B can send data to one another using both connections simultaneously. To keep track of data, each side
maintains a 64-bit data sequence number, DSN, for the data it sends; each side also maintains a mapping between the DSN and the
subflow sequence numbers. For example, A might send 1000-byte blocks of data alternating between the A1 and A2 connections;
the blocks might have DSN values 10000, 11000, 12000, 13000, …. The A1 subflow would then carry blocks 10000, 12000, etc,
numbering these consecutively (perhaps 20000, 21000, …) with its own sequence numbers. The sides exchange DSN mapping
information with a DSS TCP option. This mechanism means that all data transmitted over the MWTCP connection can be
delivered in the proper order, and that if one subflow fails, its data can be retransmitted on another subflow.
B can inform A of its second IP address, B2, using the ADD_ADDR option. Of course, it is possible that B2 is not directly
reachable by A; for example, it might be behind a NAT router. But if B2 is reachable, A can now open two more subflows A1──B2
and A2──B2.
All the above works equally well if either or both of A’s addresses is behind a NAT router, simply because the NAT router is able to
properly forward the subflow TCP connections. Addresses sent from one host to another, such as B’s transmission of its address B2,
may be rendered invalid by NAT, but in this case A’s attempt to open a connection to B2 simply fails.
Generally, hosts can be configured to use multiple subflows in parallel, or to use one interface only as a backup, when the primary
interface is unplugged or out of range. APIs have been proposed that allow an control over MPTCP behavior on a per-connection
basis.

12.22.2 SCTP
The Stream Control Transmission Protocol, SCTP, is an entirely separate protocol from TCP, running directly above IP. It is, in
effect, a message-oriented alternative to TCP: an application writes a sequence of messages and SCTP delivers each one as a unit,
fragmenting and reassembling it as necessary. Like TCP, SCTP is connection-oriented and reliable. SCTP uses a form of sliding
windows, and, like TCP, adjusts the window size to manage congestion.

12.24 https://eng.libretexts.org/@go/page/10997
An SCTP connection can support multiple message streams; the exact number is negotiated at startup. A retransmission delay in
one stream never blocks delivery in other streams. Within each stream, SCTP messages are sequentially numbered, and are
normally delivered in order of message number. A receiver can request, however, to receive messages immediately upon successful
delivery, that is, potentially out of order. Either way, the data within each message is guaranteed to be delivered in order and
without loss.
Internally, message data is divided into SCTP chunks for inclusion in packets. One SCTP packet can contain data chunks from
different messages and different streams; packets can also contain control chunks.
Messages themselves can be quite large; there is no set limit. Very large messages may need to be received in multiple system calls
(eg calls to recvmsg() ).
SCTP supports an MPTCP-like feature by which each endpoint can use multiple network interfaces.
SCTP connections are set up using a four-way handshake, versus TCP’s three-way handshake. The extra packet provides some
protection against so-called SYN flooding (12.3 TCP Connection Establishment). The central idea is that if client A initiates a
connection request with server B, then B allocates no resources to the connection until after B has received a response to its own
message to A. This means that, at a minimum, A is a real host with a real IP address.
The full four-way handshake between client A and server B is, in brief, as follows:
A sends B an INIT chunk (corresponding to SYN), along with a pseudorandom TagA.
B sends A an INIT ACK , with TagB and a state cookie. The state cookie contains all the information B needs to allocate
resources to the connection, and is digitally signed (22.6.1 Secure Hashes and Authentication) with a key known only to B.
Crucially, B does not at this point allocate any resources to the incipient connection.
A returns the state cookie to B in a COOKIE ECHO packet.
B enters the ESTABLISHED state and sends a COOKIE ACK to A. Upon receipt, A enters the ESTABLISHED state.
When B receives the COOKIE ECHO , it verifies the signature. At this point B knows that it sent the cookie to A and received a
response, so A must exist. Only then does B allocate memory resources to the connection. Spoofed INITs in the first step cost B
essentially nothing.
The TagA and TagB in the first two packets are called verification tags. From this point on, B will include TagA in every packet it
sends to A, and vice-versa. Although these tags are sent unencrypted, they nonetheless make it much harder for an attacker to inject
data into the connection.
Data can be included in the third and fourth packets above; ie A can begin sending data after one RTT.
Unfortunately for potential SCTP applications, few if any NAT routers recognize SCTP; this limits the use of SCTP to Internet
paths along which NAT is not used. In principle SCTP could simplify delivery of web pages, transmitting one page component per
message, but lack of NAT support makes this infeasible. SCTP is also blocked by some middleboxes (7.7.2 Middleboxes) on the
grounds that it is an unknown protocol, and therefore suspect. While this is not quite as common as the NAT problem, it is common
enough to prevent by itself the widespread adoption of SCTP in the general Internet. SCTP is widely used for telecommunications
signaling, both within and between providers, where NAT and recalcitrant middleboxes can be banished.

12.22.3 DCCP
As we saw in 11.1.2 DCCP, DCCP is a UDP-based transport protocol that supports, among other things, connection establishment.
While it is used much less often than TCP, it provides an alternative example of how transport can be done.
DCCP defines a set of distinct packet types, rather than TCP’s independent packet flags; this disallows unforeseen combinations
such as TCP SYN+RST. Connection establishment involves Request and Respond; data transmission involves Data, ACK and
DataACK, and teardown involves CloseReq, Close and Reset. While one cannot have, for example, a Respond+ACK, Respond
packets do carry an acknowledgment field.
Like TCP, DCCP uses a three-way handshake to open a connection; here is a diagram:

12.25 https://eng.libretexts.org/@go/page/10997
A B
CLOSED
Request LISTEN
REQUEST
Response
RESPOND
ACK
PARTOPEN
DATA or ACK OPEN

OPEN

DCCP “three”-way handshake

The fourth packet, DATA or ACK, is not


considered part of the handshake itself.

The OPEN state corresponds to TCP’s ESTABLISHED state. Like TCP, each side chooses an ISN (not shown in the diagram).
Because packet delivery is not reliable, and because ACKs are not cumulative, the client remains in PARTOPEN state until it has
confirmed that the server has received its ACK of the server’s Response. While in state PARTOPEN, the client can send ACK and
DataACK but not ACK-less Data packets.
Packets are numbered sequentially. The numbering includes all packets, not just Data packets, and is by packet rather than by byte.
The DCCP state diagram is shown below. It is simpler than the TCP state diagram because DCCP does not support simultaneous
opens.
Server side Client side

passive open / CLOSED open / Request

LISTEN REQUEST

Request / Response Response / ACK

RESPOND PARTOPEN

ACK / any ACK /

OPEN

server close / CloseReq client close / Close

CLOSEREQ CLOSING

Reset /

Close / Reset Close / Reset


TIMEWAIT

2*MSL timer
CLOSED

DCCP State Diagram

To close a connection, one side sends Close and the other responds with Reset. Reset is used for normal close as well as for
exceptional conditions. Because whoever sends the Close is then stuck with TIMEWAIT, the server side may send CloseReq to ask
the client to send Close.
There are also two special packet formats, Sync and SyncAck, for resynchronizing sequence numbers after a burst of lost packets.
The other major TCP-like feature supported by DCCP is congestion control; see 14.6.3 DCCP Congestion Control.

12.22.4 QUIC Revisited


Like DCCP, QUIC is also a UDP-based transport protocol, aimed rather squarely at HTTP plus TLS (22.10.2 TLS). The
fundamental goal of QUIC is to provide TLS encryption protection with as little overhead as possible, in a manner that competes
fairly with TCP in the presence of congestion. Opening a QUIC connection, encryption included, takes a single RTT. QUIC can
also be seen, however, as a complete rewrite of TCP from the ground up; a reading of specific features sheds quite a bit of light on

12.26 https://eng.libretexts.org/@go/page/10997
how the corresponding TCP features have fared over the past thirty-odd years. QUIC is currently (2018) documented in a set of
Internet Drafts:
Transport basics: draft-ietf-quic-transport
Loss and congestion management: draft-ietf-quic-recovery
TLS encryption over QUIC: draft-ietf-quic-tls
HTTP over QUIC: draft-ietf-quic-http
The last Internet Draft above was retitled in November 2018 as Hypertext Transfer Protocol Version 3 (HTTP/3); that is, it
proposes that QUIC should become HTTP version 3. If and when this happens, it may represent the beginning of the end for TCP,
given that most Internet connections are for HTTP or HTTPS. Still, many TCP design issues (13 TCP Reno and Congestion
Management, 14 Dynamics of TCP, 15 Newer TCP Implementations) carry over very naturally to QUIC; a shift from TCP to
QUIC should best be viewed as evolutionary.
The design of QUIC was influenced by the fate of SCTP above; the latter, as a new protocol above IP, was often blocked by overly
security-conscious middleboxes (7.7.2 Middleboxes).
The fact that the QUIC layer resides within an application (or within a library) rather than within the kernel has meant that QUIC is
able to evolve much faster than TCP. The long-term consequences of having the transport layer live outside the kernel are not yet
completely clear, however; it may, for example, make it easier for users to install unfair congestion-management schemes.

12.22.4.1 Headers
We will start with the QUIC header. While there are some alternative forms, the basic header is diagrammed below, with a 1-byte
Type field, an 8-byte Connection ID , and 4-byte Version and Packet Number fields.

Type

Connection ID

Version

Packet Number

Payload....

Typical QUIC Long Header

Perhaps the most striking thing about this header is that 4-byte alignment – used consistently in the IPv4, IPv6, UDP and TCP
headers – has been completely abandoned. On most contemporary processors, the performance advantages of alignment are
negligible; see the last paragraph at 7.1 The IPv4 Header.
IP packets are identified as such by the Ethernet type field, and TCP and UDP packets are identified as such by the IPv4-header
Protocol field. But QUIC packets are notidentified as such by any flag in the preceding IP or UDP headers; there is in fact no place
in those headers for a QUIC marker to go. QUIC appears to an observer as just another form of UDP traffic. This acts as a form of
middlebox defense; QUIC packets cannot be identified as such in isolation. WireShark, sidebar below, identifies QUIC packets by
looking at the whole history of the connection, and even then must make some (educated) guesses. Middleboxes could do that too,
but it would take work.
The initial Connection ID consists of 64 random bits chosen by the client. The server, upon accepting the connection, may
change the Connection ID ; at that point the Connection ID is fixed for the lifetime of the connection. The
Connection ID may be omitted for packets whose connection can be determined from the associated IP address and port
values; this is signaled by the Type field. The Connection ID can also be used to migrate a connection to a different IP
address and port, as might happen if a mobile device moves out of range of Wi-Fi and the mobile-data plan continues the
communication. This may also happen if a connection passes through a NAT router. The NAT forwarding entry may time out (see
the comment on UDP and inactivity at 7.7 Network Address Translation), and the connection may be assigned a different outbound
UDP port if it later resumes. QUIC uses the Connection ID to recognize that the reassigned connection is still the same one
as before.
The Version field gets dropped as soon as the version is negotiated. As part of the version negotiation, a packet might have
multiple version fields. Such packets put a random value into the low-order seven bits of the Type field, as a prevention against

12.27 https://eng.libretexts.org/@go/page/10997
middleboxes’ blocking unknown types. This way, aggressive middlebox behavior should be discovered early, before it becomes
widespread.
QUIC-watching
QUIC packets can be observed in WireShark by using the filter string “quic”. To generate QUIC traffic, use a Chromium-based
browserand go to a Google-operated site, say, google.com. Often the only non-encrypted fields are the Type field and the packet
number. It may be necessary to enable QUIC in the browser, done in Chrome via chrome://flags; see also chrome://net-internals.
The packet number can be reduced to one or two bytes once the connection is established; this is signaled by the Type field.
Internally, QUIC uses packet numbers in the range 0 to 262; these internal numbers are not allowed to wrap around. The low-order
32 bits (or 16 bits or 8 bits) of the internal number are what is transmitted in the packet header. A packet receiver infers the high-
order bits from the most recent acknowledgment.
The initial packet number is to be chosen randomly in the range 0 to 232−1025. This corresponds to TCP’s use of Initial Sequence
Numbers.
Use of 16-bit or 8-bit transmitted packet numbers is restricted to cases where there can be no ambiguity. At a minimum, this means
that the number of outstanding packets (the QUIC winsize) cannot exceed 27−1 for 8-bit packet numbering or 215−1 for 16-bit
packet numbering. These maximum winsizes represent the ideal case where there is no packet reordering; smaller values are likely
to be used in practice. (See 6.5 Exercises, exercise 9.0.)

12.22.4.2 Frames and streams


Data in a QUIC packet is partitioned into one or more frames. Each frame’s data is prefixed by a simple frame header indicating its
length and type. Some frames contain management information; frames containing higher-layer data are called STREAM frames.
Each frame must be fully contained in one packet.
The application’s data can be divided into multiple streams, depending on the application requirements. This is particularly useful
with HTTP, as a client may request a large number of different resources (html, images, javascript, etc) simultaneously. Stream data
is contained in STREAM frames. Streams are numbered, with Stream 0 reserved for the TLS cryptographic handshake. The
HTTP/2 protocol has introduced its own notion of streams; these map neatly onto QUIC streams.
The two low-order bits of each stream number indicate whether the stream was initiated by the client or by the server, and whether
it is bi- or uni-directional. This design decision means that either side can create a stream and send data on it immediately, without
negotiation; this is important for reducing unnecessary RTTs.
Each individual stream is guaranteed in-order delivery, but there are no ordering guarantees between different streams. Within a
packet, the data for a particular stream is contained in a frame for that stream.
One packet can contain stream frames for multiple streams. However, if a packet is lost, streams that have frames contained in that
packet are blocked until retransmission. Other streams can continue without interruption. This creates an incentive for keeping
separate streams in separate packets.
Stream frames contain the byte offset of the frame’s block of stream data (starting from 0), to enable in-order stream reassembly.
TCP, as we have seen, uses this byte-numbering approach exclusively, though starting with the Initial Sequence Number rather than
zero. QUIC’s stream-level numbering by byte is unrelated to its top-level numbering by packet.
In addition to stream frames, there are a large number of management frames. Here are a few of them:
RST_STREAM : like TCP RST, but for one stream only.
MAX_DATA : this corresponds to the TCP advertised window size. As with TCP, it can be reduced to zero to pause the flow of
data and thereby implement flow control. There is also a similar MAX_STREAM_DATA , applying per stream.
PING and PONG : to verify that the other endpoint is still responding. These serve as the equivalent of TCP KEEPALIVEs,
among other things.
CONNECTION_CLOSE and APPLICATION_CLOSE : these initiate termination of the connection; they differ only in that
a CONNECTION_CLOSE might be accompanied by a QUIC-layer error or explanation message while an
APPLICATION_CLOSE might be accompanied by, say, an HTTP error/explanation messge.
PAD : to pad out the packet to a larger size.
ACK : for acknowledgments, below.

12.28 https://eng.libretexts.org/@go/page/10997
12.22.4.3 Acknowledgments
QUIC assigns a new, sequential packet number (the Packet ID ) to every packet, including retransmissions. TCP, by
comparison, assigns sequence numbers to each byte. (QUIC stream frames do number data by byte, as noted above.)
Lost QUIC packets are retransmitted, but with a new packet number. This makes it impossible for a receiver to send cumulative
acknowledgments, as lost packets will never be acknowledged. The receiver handles this as below. At the sender side, the sender
maintains a list of packets it has sent that are both unacknowledged and also not known to be lost. These represent the packets in
flight. When a packet is retransmitted, its old packet number is removed from this list, as lost, and the new packet number replaces
it.
To the extent possible given this retransmission-renumbering policy, QUIC follows the spirit of sliding windows. It maintains a
state variable bytes_in_flight , corresponding to TCP’s winsize, listing the total size of all the packets in flight. As with
TCP, new acknowledgments allow new transmissions.
Acknowledgments themselves are sent in special acknowledgment frames. These begin with the number of the highest packet
received. This is followed by a list of pairs, as long as will fit into the frame, consisting of the length of the next block of
contiguous packets received followed by the length of the intervening gap of packets notreceived. The TCP Selective ACK (13.6
Selective Acknowledgments (SACK)) option is similar, but is limited to three blocks of received packets. It is quite possible that
some of the gaps in a QUIC ACK frame refer to lost packets that were long since retransmitted with new packet numbers, but this
does not matter.
The sender is allowed to skip packet numbers occasionally, to prevent the receiver from trying to increase throughput by
acknowledging packets not yet received. Unlike with TCP, acknowledging an unsent packet is considered to be a fatal error, and the
connection is terminated.
As with TCP, there is a delayed-ACK timer, but, while TCP’s is typically 250 ms, QUIC’s is 25 ms. QUIC also includes in each
ACK frame the receiver’s best estimate of the elapsed time between arrival of the most recent packet and the sending of the ACK it
triggered; this allows the sender to better estimate the RTT. The primary advantage of the design decision not to reuse packet IDs is
that there is never any ambiguity as to a retransmitted packet’s RTT, as there is in TCP (12.19 TCP Timeout and Retransmission).
Note, however, that because QUIC runs in a user process and not the kernel, it may not be able to respond immediately to an
arriving packet, and so the time-delay estimate may be slightly short.
ACK frames are not themselves acknowledged. This means that, in a one-way data flow, the receiver may have no idea if its ACKs
are getting through (a TCP receiver may be in the same situation). The QUIC receiver may send a PING frame to the sender, which
will respond not only with a matching PONG frame but also an ACK frame acknowledging the receiver’s recent acknowledgment
packets.
QUIC adjusts its bytes_in_flight value to manage congestion, much as TCP manages its winsize (or more properly its
cwnd , 13 TCP Reno and Congestion Management) for the same purpose. Specifically, QUIC attempts to mimic the congestion
response of TCP Cubic, 15.15 TCP CUBIC, and so should in theory compete fairly with TCP Cubic connections. However, it is
straightforward to arrange for QUIC to model the behavior of any other flavor of TCP (15 Newer TCP Implementations).

12.22.4.4 Connection handshake and TLS encryption


The opening of a QUIC connection makes use of the TLS handshake, 22.10.2 TLS, specifically TLS v1.3, 22.10.2.4.3 TLS version
1.3. A client wishing to connect sends a QUIC Initial packet, containing the TLS ClientHello message. The server
responds (with a ServerHello ) in a QUIC Handshake packet. (There is also a Retry packet, for special situations.)
The TLS negotiation is contained in QUIC’s Stream 0. While the TLS and QUIC handshake rules are rather precise, there is as yet
no formal state-diagram description of connection opening.
The Initial packet also contains a set of QUIC transport parameters declared unilaterally by the client; the server makes a
similar declaration in its response. These parameters include, among other things, the maximum packet size, the connection’s idle
timeout, and initial value for MAX_DATA , above.
An important feature of TLS v1.3 is that, if the client has connected to the server previously and still has the key negotiated in that
earlier session, it can use that old key to send an encrypted application-layer request (in a STREAM frame) immediately following
the Initial packet. This is called 0-RTT protection (or encryption). The advantage of this is that the client may receive an
answer from the server within a single RTT, versus four RTTs for traditional TCP (one for the TCP three-way handshake, two for

12.29 https://eng.libretexts.org/@go/page/10997
TLS negotiation, and one for the application request/reply). As discussed at 22.10.2.4.4 TLS v1.3 0-RTT mode, requests submitted
with 0-RTT protection must be idempotent, to prevent replay attacks.
Once the server’s first Handshake packet makes it back to the client, the client is in possession of the key negotiated by the
new session, and will encrypt everything using that going forward. This is known as the 1-RTT key, and all further data is said to
be 1-RTT protected. The negotiated key is initially calculated by the TLS layer, which then exports it to QUIC. The QUIC layer
then encrypts the entire data portion of its packets, using the format of RFC 5116.
The QUIC header is not encrypted, but is still covered by an authentication checksum, making it impossible for middleboxes to
rewrite anything. Such rewriting has been observed for TCP, and has sometimes complicated TCP evolution.
The type field of a QUIC packet contains a special code to mark 0-RTT data, ensuring that the receiver will know what level of
protection is in effect.
When a QUIC server receives the ClientHello and sends off its ServerHello , it has not yet received any evidence that
the client “owns” the IP address it claims to have; that is, that the client is not spoofing its IP address (12.10.1 ISNs and spoofing).
Because of the idempotency restriction on responses to 0-RTT data, the server cannot give away privileges if spoofed in this way
by a client. The server may, however, be an unwitting participant in a traffic-amplification attack, if the real client can trigger the
sending by the server to a spoofed client of a larger response than the real client sends directly. The solution here is to require that
the QUIC Initial packet, containing the ClientHello , be at least 1200 bytes. The server’s Handshake response is
likely to be smaller, and so represents no amplification of traffic.
To close the connection, one side sends a CONNECTION_CLOSE or APPLICATION_CLOSE . It may continue to send these
in response to packets from the other side. When the other side receives the CLOSE packet, it should send its own, and then enter
the so-called draining state. When the initiator of the close receives the other side’s echoed CLOSE , it too will enter the draining
state. Once in this state, an endpoint may not send any packets. The draining state corresponds to TCP’s TIMEWAIT (12.9
TIMEWAIT), for the purpose of any lost final ACKs; it should last three RTT’s. There is no need of a TIMEWAIT analog to
prevent old duplicates, as a second QUIC connection will select a new Connection ID .
QUIC connection closing has no analog of TCP’s feature in which one side sends FIN and the other continues to send data
indefinitely, 12.7.1 Closing a connection. This use of FIN, however, is allowed in bidirectional streams; the per-stream (and per-
direction) FIN bit lives in the stream header. Alternatively, one side can send its request and close its stream, and the other side can
then answer on a different stream.

12.23 Epilog
At this point we have covered the basic mechanics of TCP, but have one important topic remaining: how TCP manages its window
size so as to limit congestion, while maintaining fairness. This turns out to be complex, and will be the focus of the next three
chapters.

12.24 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 4.5 is distinct, for
example, from exercises 4.0 and 5.0.
1.0. Experiment with the TCP version of simplex-talk. How does the server respond differently with threading enabled and without,
if two simultaneous attempts to connect are made, from two different client instances?
2.0. Trace the states visited if nodes A and B attempt to create a TCP connection by simultaneously sending each other SYN
packets, that then cross in the network. Draw the ladder diagram, and label the states on each side. Hint: there should be two pairs
of crossing packets. A SYN+ACK counts as an ACK.
2.5. Suppose nodes A and B are each behind their own NAT firewall (7.7 Network Address Translation).

A ──── NAT_A ──── Internet ──── NAT_B ──── B

A and B attempt to connect to one another simultaneously, using TCP. A sends to the public IPv4 address of NAT_B, and vice-
versa for B. Assume that neither NAT_A nor NAT_B changes the port numbers in outgoing packets, at least for the packets
involved in this connection attempt. Show that the connection succeeds.

12.30 https://eng.libretexts.org/@go/page/10997
3.0. When two nodes A and B simultaneously attempt to connect to one another using the OSI TP4 protocol, two bidirectional
network connections are created (rather than one, as with TCP). If TCP had instead chosen the TP4 semantics here, what would
have to be added to the TCP header? Hint: if a packet from ⟨A,port1⟩ arrives at ⟨B,port2⟩, how would we tell to which of the two
possible connections it belongs?
4.0. Simultaneous connection initiations are rare, but simultaneous connection termination is relatively common. How do two TCP
nodes negotiate the simultaneous sending of FIN packets to one another? Draw the ladder diagram, and label the states on each
side. Which node goes into TIMEWAIT state? Hint: there should be two pairs of crossing packets.
4.5. The state diagram at 12.7 TCP state diagram shows a dashed path from FIN_WAIT_1 to TIMEWAIT on receipt of FIN+ACK.
All FIN packets contain a valid ACK field, but that is not what is meant here. Under what circumstances is this direct arc from
FIN_WAIT_1 to TIMEWAIT taken? Explain why this arc can never be used during simultaneous close. Hint: consider the ladder
diagram of a “normal” close.
5.0. (a) Suppose you see multiple connections on your workstation in state FIN_WAIT_1. What is likely going on? Whose fault is
it? | (b). What might be going on if you see connections languishing in state FIN_WAIT_2?
6.0. Suppose that, after downloading a file, the client host is unplugged from the network, so it can send no further packets. The
server’s connection is still in the ESTABLISHED state. In each case below, use the TCP state diagram to list all states that are
reachable by the server.
(a). Before being unplugged, the client was in state ESTABLISHED; ie it had not sent the first FIN.
(b). Before being unplugged the client had sent its FIN, and moved to FIN_WAIT_1.
Eventually, the server connection here would in fact transition to CLOSED due to repeated timeouts. For this exercise, though,
assume only transitions explicitly shown in the state diagram are allowed.
6.5. In 12.3 TCP Connection Establishment we noted that RST packets had to have a valid SYN value, but that “RFC 793 does not
require the RST packet’s ACK value to match”. There is an exception for RST packets arriving at state SYN-SENT: “the RST is
acceptable if the ACK field acknowledges the SYN”. Explain the reasoning behind this exception.
7.0. Suppose A and B create a TCP connection with ISNA=20,000 and ISNB=5,000. A sends three 1000-byte packets (Data1, Data2
and Data3 below), and B ACKs each. Then B sends a 1000-byte packet DataB to A and terminates the connection with a FIN. In
the table below, fill in the SEQ and ACK fields for each packet shown.

A sends B sends

SYN, ISNA=20,000

SYN, ISNB=5,000, ACK=______

ACK, SEQ=______, ACK=______

Data1, SEQ=______, ACK=______

ACK, SEQ=______, ACK=______

Data2, SEQ=______, ACK=______

ACK, SEQ=______, ACK=______

Data3, SEQ=______, ACK=______

ACK, SEQ=______, ACK=______


DataB, SEQ=______, ACK=______

ACK, SEQ=_____, ACK=______

FIN, SEQ=______, ACK=______

8.0. Suppose you are downloading a large file, and there is a progress bar showing how much of the file has been downloaded. For
definiteness, assume the progress bar moves 1 mm per MB, the throughput averages 0.5 MB per second (so the progress bar
advances at a rate of 0.5 mm/sec), and the winsize is 5 MB.

12.31 https://eng.libretexts.org/@go/page/10997
A packet is now lost, and is retransmitted after a timeout. What will happen to the progress bar? If someone measured the progress
bar at two times 1 second apart, just before and just after the lost packet arrived, what value would they calculate for the
throughput?
9.0. Suppose you are creating software for a streaming-video site. You want to limit the video read-ahead – the gap between how
much has been downloaded and how much the viewer has actually watched – to approximately 1 MB; the server should pause in
sending when necessary to enforce this. On the other hand, you do want the receiver to be able to read ahead by up to this much.
You should assume that the TCP connection throughput will be higher than the actual video-data-consumption rate.
(a). Suppose the TCP window size happens to be exactly 1 MB. If the receiver simply reads each video frame from the TCP
connection, displays it, and then pauses briefly before reading the next frame in accordance with the frame rate, explain how the
flow-control mechanism of 12.17 TCP Flow Control will achieve the desired effect.
(b). Applications, however, cannot control their TCP window size. What support would you have to add to the video-transfer
application to allow it to read ahead by 1 MB but not to exceed this? Hint: both client and server sides of the application will have
to implement something to enable this feature.
10.0. A user moves the computer mouse and sees the mouse-cursor’s position updated on the screen. Suppose the mouse-position
updates are being transmitted over a TCP connection with a relatively long RTT. The user attempts to move the cursor to a specific
point. How will the user perceive the mouse’s motion
(a). with the Nagle algorithm
(b). without the Nagle algorithm
10.5. Host A sends two single-byte packets, one containing “x” and the other containing “y”, to host B. A implements the Nagle
algorithm and B implements delayed ACKs, with a 500 ms maximum delay. The RTT is negligible. How long does the
transmission take? Draw a ladder diagram.
11.0. Suppose you have fallen in with a group that wants to add to TCP a feature so that, if A and B1 are connected, then B1 can
hand off its connection to a different host B2; the end result is that A and B2 are connected and A has received an uninterrupted
stream of data. Either A or B1 can initiate the handoff.
(a). Suppose B1 is the host to send the final FIN (or HANDOFF) packet to A. How would you handle appropriate analogues of the
TIMEWAIT state for host B1? Does the fact that A is continuing the connection, just not with B1, matter?
(b). Now suppose A is the party to send the final FIN/HANDOFF, to B1. What changes to TIMEWAIT would have to be made at
A’s end? Note that A may potentially hand off the connection again and again, eg to B3, B4 and then B5.
12.0. Suppose A connects to B via TCP, and sends the message “Attack at noon”, followed by FIN. Upon receiving this, B is sure it
has received the entire message.
(a). What can A be sure of upon receiving B’s own FIN+ACK?
(b). What can B be sure of upon receiving A’s final ACK?
(c). What is A not absolutely sure of after sending its final ACK?
13.0. Host A connects to the Internet via Wi-Fi, receiving IPv4 address 10.0.0.2, and then opens a TCP connection conn1 to remote
host B. After conn1 is established, A’s Ethernet cable is plugged in. A’s Ethernet interface receives IP address 10.0.0.3, and A
automatically selects this new Ethernet connection as its default route. Assume that A now starts using 10.0.0.3 as the source
address of packets it sends as part of conn1 (contrary to RFC 1122).
Assume also that A’s TCP implementation is such that when a packet arrives from ⟨BIP, Bport⟩ to ⟨AIP, Aport⟩ and this socketpair is
to be matched to an existing TCP connection, the field AIP is allowed to be any of A’s IP addresses (that is, either 10.0.0.2 or
10.0.0.3); it does not have to match the IP address with which the connection was originally negotiated.
(a). Explain why conn1 will now fail, as soon as any packet is sent from A. Hint: the packet will be sent from 10.0.0.3. What will B
send in response? In light of the second assumption, how will A react to B’s response packet?
(The author regularly sees connections appear to fail this way. Perhaps some justification for this behavior is that, at the time of
establishment of conn1, A was not yet multihomed.)

12.32 https://eng.libretexts.org/@go/page/10997
(b). Now suppose all four fields of the socketpair (⟨BIP, Bport⟩, ⟨AIP, Aport⟩) are used to match an incoming packet to its
corresponding TCP connection. The connection conn1still fails, though not as immediately. Explain what happens.
See also 7.9.5 ARP and multihomed hosts, 7 IP version 4 exercise 11.0, and 9 Routing-Update Algorithms exercise 13.0.
14.0. Modify the simplex-talk server of 12.6 TCP simplex-talk so that line_talker() breaks out of the while loop as
soon as it has printed the first string received (or simply remove the while loop). Once out of the while loop, the existing code
calls s.close() .
(a). Start up the modified server, and connect to it with a client. Send a single message line, and use netstat to examine the
TCP states of the client and server. What are these states?
(b). Send two message lines to the server. What are the TCP states of the client and server?
(c). Send three message lines to the server. Is there an error message at the client?
(d). Send two message lines to the server, while monitoring packets with WireShark. The WireShark filter expression
tcp.port == 5431 may be useful for eliminating irrelevant traffic. What FIN packets do you see? Do you see a RST
packet?
15.0. Outline a scenario in which TCP endpoint A sends data to B and then calls close() on its socket, and after the
connection terminates B has not received all the data, even though the network has not failed. In the style of 12.6.1 The TCP Client,
A’s code might look like this:

s = new Socket(dest, destport);


sout = s.getOutputStream();
sout.write(large_buffer)
s.close()

Hint: see 12.7.2 Calling close().

This page titled 12: TCP Transport is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

12.33 https://eng.libretexts.org/@go/page/10997
13: TCP Reno and Congestion Management
Page created for new attachment

This page titled 13: TCP Reno and Congestion Management is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated
by Peter Lars Dordal.

13.1 https://eng.libretexts.org/@go/page/10998
CHAPTER OVERVIEW

14: Dynamics of TCP

This page titled 14: Dynamics of TCP is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

14.1 https://eng.libretexts.org/@go/page/10999
15: Newer TCP Implementations
Page created for new attachment

This page titled 15: Newer TCP Implementations is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

15.1 https://eng.libretexts.org/@go/page/11000
16: Network Simulations - ns-2
Page created for new attachment

This page titled 16: Network Simulations - ns-2 is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

16.1 https://eng.libretexts.org/@go/page/11001
17: The ns-3 Network Simulator

17 The ns-3 Network Simulator


In this chapter we take a somewhat cursory look at the ns-3 simulator, intended as a replacement for ns-2. The project is managed
by the NS-3 Consortium, and all materials are available at nsnam.org [nsnam.org].
Ns-3 represents a rather sharp break from ns-2. Gone is the Tcl programming interface; instead, ns-3 simulation programs are
written in the C++ language, with extensive calls to the ns-3 library, although they are often still referred to as simulation “scripts”.
As the simulator core itself is also written in C++, this in some cases allows improved interaction between configuration and
execution. However, configuration and execution are still in most cases quite separate: at the end of the simulation script comes a
call Simulator::Run() – akin to ns-2’s $ns run – at which point the user-written C++ has done its job and the library
takes over.
To configure a simple simulation, an ns-2 Tcl script had to create nodes and links, create network-connection “agents” attached to
nodes, and create traffic-generating applications attached to agents. Much the same applies to ns-3, but in addition each node must
be configured with its network interfaces, and each network interface must be assigned an IP address.

17.1 Installing and Running ns-3


We here outline the steps for installing ns-3 under Linux from the “allinone” tar file, assuming that all prerequisite packages
(such as gcc) are already in place. Much more general installation instructions can be found at www.nsnam.org. In particular,
serious users are likely to want to download the current Mercurial repository directly. Information is also available for Windows
and Macintosh installation, although perhaps the simplest option for Windows users is to run ns-3 in a Linux virtual machine.
The first step is to unzip the tar file; this should leave a directory named ns-allinone-3.nn, where nn reflects the version number
(20 in the author’s installation as of this 2014 writing). This directory is the root of the ns-3 system; it contains a build.py
(python) script and the primary ns-3 directory ns-3.nn. All that is necessary is to run the build.py script:

./build.py
Considerable configuration and then compiler output should ensue, hopefully terminating with a list of “Modules built” and
“Modules not built”.
From this point on, most ns-3 work will take place in the subdirectory ns-3.nn, that is, in ns-allinone-3.nn/ns-3.nn. This
development directory contains the source directory src , the script directory scratch , and the execution script waf
[https://code.google.com/p/waf/].
The development directory also contains a directory examples containing a rich set of example scripts. The scripts in
examples/tutorial are described in depth in the ns-3 tutorial in doc/tutorial .

17.1.1 Running a Script


Let us now run a script, for example, the file first.cc included in the examples/tutorial directory. We first copy this file
into the directory “scratch”, and then, in the parent development directory, enter the command

./waf --run first


The program is compiled and, if compilation is successful, is run.
In fact, every uncompiled program in the scratch directory is compiled, meaning that projects in progress that are not yet
compilable must be kept elsewhere. One convenient strategy is to maintain multiple project directories, and link them symbolically
to scratch as needed.
The ns-3 system includes support for command-line options; the following example illustrates the passing by command line of the
value 3 for the variable nCsma :

17.1 https://eng.libretexts.org/@go/page/11002
./waf --run "second --nCsma=3"

This page titled 17: The ns-3 Network Simulator is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter
Lars Dordal.

17.2 https://eng.libretexts.org/@go/page/11002
18: Mininet
Sometimes simulations are not possible or not practical, and network experiments must be run on actual machines. One can always
use a set of interconnected virtual machines, but even pared-down virtual machines consume sufficient resources that it is hard to
create a network of more than a handful of nodes. Mininet is a system that supports the creation of lightweight logical nodes that
can be connected into networks. These nodes are sometimes called containers, or, more accurately, network namespaces. Virtual-
machine technology is not used. These containers consume sufficiently few resources that networks of over a thousand nodes have
been created, running on a single laptop. While Mininet was originally developed as a testbed for software-defined networking (2.8
Software-Defined Networking), it works just as well for demonstrations and experiments involving traditional networking.
A Mininet container is a process (or group of processes) that no longer has access to all the host system’s “native” network
interfaces, much as a process that has executed the chroot() system call no longer has access to the full filesystem. Mininet
containers then are assigned virtual Ethernet interfaces (see the ip-link man page entries for veth), which are connected to other
containers through virtual Ethernet links. The use of veth links ensures that the virtual links behave like Ethernet, though it may be
necessary to disable TSO (12.5 TCP Offloading) to view Ethernet packets in WireShark as they would appear on the (virtual) wire.
Any process started within a Mininet container inherits the container’s view of network interfaces.
For efficiency, Mininet containers all share the same filesystem by default. This makes setup simple, but sometimes causes
problems with applications that expect individualized configuration files in specified locations. Mininet containers can be
configured with different filesystem views, though we will not do this here.
Mininet is a form of network emulation, as opposed to simulation. An important advantage of emulation is that all network
software, at any layer, is simply run “as is”. In a simulator environment, on the other hand, applications and protocol
implementations need to be ported to run within the simulator before they can be used. A drawback of emulation is that as the
network gets large and complex the emulation may slow down. In particular, it is not possible to emulate link speeds faster than the
underlying hardware can support. (It is also not possible to emulate non-Linux network software.)
The Mininet group maintains extensive documentation; three useful starting places are the Overview, the Introduction and the
FAQ.
The goal of this chapter is to present a series of Mininet examples. Most examples are in the form of a self-contained Python2 file
(Mininet does not at this time support Python3). Each Mininet Python2 file configures the network and then starts up the Mininet
command-line interface (which is necessary to start commands on the various node containers). The use of self-contained Python
files arguably makes the configurations easier to edit, and avoids the complex command-line arguments of many standard Mininet
examples. The Python code uses what the Mininet documentation calls the mid-level API.
The Mininet distribution comes with its own set of examples, in the directory of that name. A few of particular interest are listed
below; with the exception of linuxrouter.py, the examples presented here do not use any of these techniques.
bind.py: demonstrates how to give each Mininet node its own private directory (otherwise all nodes share a common filesystem)
controllers.py: demonstrates how to arrange for multiple SDN controllers, with different switches connecting to different
controllers
limit.py: demonstrates how to set CPU utilization limits (and link bandwidths)
linuxrouter.py: creates a node that acts as a router. Any host node can act as a router, though, provided we enable forwarding
with sysctl net.ipv4.ip_forward=1
miniedit.py: a graphical editor for Mininet networks
mobility.py: demonstrates how to move a host from one switch to another
nat.py: demonstrates how to connect hosts to the Internet
tree1024.py: creates a network with 1024 nodes
We will occasionally need supplemental programs as well, eg for sending, monitoring or receiving traffic. These are meant to be
modified as necessary to meet circumstances; they contain few command-line option settings. Most of these supplemental
programs are written, perhaps confusingly, in Python3. Python2 files are run with the python command, while Python3’s
command is python3 . Alternatively, given that all these programs are running under Linux, one can make all Python files
executable and be sure that the first line is either #!/usr/bin/python or #!/usr/bin/python3 as appropriate.

18.1 https://eng.libretexts.org/@go/page/11003
18.1 Installing Mininet
Mininet runs only under the Linux operating system. Windows and Mac users can, however, easily run Mininet in a single Linux
virtual machine. Even Linux users may wish to do this, as running Mininet has a nontrivial potential to affect normal operation (a
virtual-switch process started by Mininet has, for example, interfered with the suspend feature on the author’s laptop).
The Mininet group maintains a virtual machine with a current Mininet installation at their downloads site. The download file is
actually a .zip file, which unzips to a modest .ovf file defining the specifications of the virtual machine and a much larger (~2 GB)
.vmdk file representing the virtual disk image. (Some unzip versions have trouble with unzipping very large files; if that happens,
search online for an alternative unzipper.)
There are several choices for virtual-machine software; two options that are well supported and free (as of 2017) for personal use
are VirtualBox and VMware Workstation Player. The .ovf file should open in either (in VirtualBox with the “import appliance”
option). However, it may be easier simply to create a new Linux virtual machine and specify that it is to use an existing virtual
disk; then select the downloaded .vmdk file as that disk.
Both the login name and the password for the virtual machine is “mininet”. Once logged in, the sudo command can be used to
obtain root privileges, which are needed to run Mininet. It is safest to do this on a command-by-command basis; eg
sudo python switchline.py . It is also possible to keep a terminal window open that is permanently logged in as root,
eg via sudo bash .
Another option is to set up a Linux virtual machine from scratch (eg via the Ubuntu distribution) and then install Mininet on it,
although the preinstalled version also comes with other useful software, such as the Pox controller for OpenFlow switches.
The preinstalled version does not, however, come with any graphical-interface desktop. One can install the full Ubuntu desktop
with the command (as root) apt-getinstall ubuntu-desktop . This will, however, add more than 4 GB to the virtual
disk. A lighter-weight option, recommended by the Mininet site, is to install the alternative desktop environment lxde; it is half the
size of Ubuntu. Install it with

apt-get install xinit lxde

The standard graphical text editor included with lxde is leafpad, though of course others (eg gedit or emacs) can be installed as
well.
After desktop installation, the command startx will be necessary after login to start the graphical environment (though one
can automate this). A standard recommendation for new Debian-based Linux systems, before installing anything else, is

apt-get update
apt-get upgrade

Most virtual-machine software offers a special package to improve compatibility with the host system. One of the most annoying
incompatibilities is the tendency of the virtual machine to grab the mouse and not allow it to be dragged outside the virtual-
machine window. (Usually a special keypress releases the mouse; on VirtualBox it is the right-hand Control key and on VMWare
Player it is Control-Alt.) Installation of the compatibility package (in VirtualBox called Guest Additions) usually requires
mounting a CD image, with the command

mount /dev/cdrom /media/cdrom

The Mininet installation itself can be upgraded as follows:

cd /home/mininet/mininet
git fetch
git checkout master # Or a specific version like 2.2.1
git pull
make install

18.2 https://eng.libretexts.org/@go/page/11003
The simplest environment for beginners is to install a graphical desktop (eg lxde) and then work within it. This allows seamless
opening of xterm and WireShark as necessary. Enabling copy/paste between the virtual system and the host is also convenient.
However, it is also possible to work entirely without the desktop, by using multiple ssh logins with X-windows forwarding enabled:

ssh -X -l username mininet

This does require an X-server on the host system, but these are available even for Windows (see, for example, Cygwin/X). At this
point one can open a graphical program on the ssh command line, eg wireshark & or gedit mininet-demo.py & ,
and have the program window display properly (or close to properly).
Finally, it is possible to access the Mininet virtual machine solely via ssh terminal sessions, without X-windows, though one then
cannot launch xterm or WireShark.

18.2 A Simple Mininet Example


Starting Mininet via the mn command (as root!), with no command-line arguments, creates a simple network of two hosts and
one switch, h1–s1–h2, and starts up the Mininet command-line interface (CLI). By convention, Mininet host names begin with ‘h’
and switch names begin with ‘s’; numbering begins with 1.
At this point one can issue various Mininet-CLI commands. The command nodes , for example, yields the following output:

available nodes are:


c0 h1 h2 s1

The node c0 is the controller for the switch s1 . The default controller action her makes s1 behave like an Ethernet
learning switch (2.4.1 Ethernet Learning Algorithm). The command intfs lists the interfaces for each of the nodes, and
links lists the connections, but the most useful command is net , which shows the nodes, the interfaces and the connections:

h1 h1-eth0:s1-eth1
h2 h2-eth0:s1-eth2
s1 lo: s1-eth1:h1-eth0 s1-eth2:h2-eth0

From the above, we can see that the network looks like this:

h1-eth0 s1-eth1 s1-eth2 h2-eth0


h1 s1 h2

18.2.1 Running Commands on Nodes


The next step is to run commands on individual nodes. To do this, we use the Mininet CLI and prefix the command name with the
node name:

h1 ifconfig
h1 ping h2

The first command here shows that h1 (or, more properly, h1-eth0) has IP address 10.0.0.1. Note that the name ‘h2’ in the second is
recognized. The ifconfig command also shows the MAC address of h1-eth0, which may vary but might be something like
62:91:68:bf:97:a0. We will see in the following section how to get more human-readable MAC addresses.
There is a special Mininet command pingall that generates pings between each pair of hosts.
We can open a full shell window on node h1 using the Mininet command below; this works for both host nodes and switch
nodes.

xterm h1

18.3 https://eng.libretexts.org/@go/page/11003
Note that the xterm runs with root privileges. From within the xterm, the command ping h2 now fails, as hostname h2 is
not recognized. We can switch to ping 10.0.0.2 , or else add entries to /etc/hosts for the IP addresses of h1 and h2 :

10.0.0.1 h1
10.0.0.2 h2

As the Mininet system shares its filesystem with h1 and h2 , this means that the names h1 and h2 are now defined
everywhere within Mininet (though be forewarned that when a different Mininet configuration assigns different addresses to h1
or h2 , chaos will ensue).
From within the xterm on h1 we might try logging into h2 via ssh: ssh h2 (if h2 is defined in /etc/hosts as above). But
the connection is refused: the ssh server is not running on node h2 . We will return to this in the following example.
We can also start up WireShark, and have it listen on interface h1-eth0, and see the progress of our pings. (We can also usually start
WireShark from the mininet> prompt using h1 wireshark & .)
Similarly, we can start an xterm on the switch and start WireShark there. However, there is another option, as switches by default
share all their network systems with the Mininet host system. (In terms of the container model, switches do not by default get their
own network namespace; they share the “root” namespace with the host.) We can see this by running the following from the
Mininet command line

s1 ifconfig

and comparing the output with that of ifconfig run on the Mininet host, while Mininet is running but outside of the Mininet
process itself. We see these interfaces:

eth0
lo
s1
s1-eth1
s1-eth2

We see the same interfaces on the controller node c0 , even though the net and intfs commands above showed no
interfaces for c0 .
Running WireShark on, say, s1-eth1 is an excellent way to observe traffic on a nearly idle network; by default, the Mininet
nodes are not connected to the outside world. As an example, suppose we start up xterm windows on h1 and h2, and run
netcat -l 5432 on h2 and then netcat 10.0.0.2 5432 on h1. We can then watch the ARP exchange, the TCP
three-way handshake, the content delivery and the connection teardown, with most likely no other traffic at all. Wireshark filtering
is not needed.

18.3 Multiple Switches in a Line


The next example creates the topology below. All hosts are on the same subnet.
h1 h2 h3 h4
h1-eth0 h2-eth0 h3-eth0 h4-eth0

s1-eth1 s2-eth1 s3-eth1 s4-eth1

s1 s1-eth2 s2-eth2
s2 s2-eth3 s3-eth2
s3 s3-eth3 s4-eth2
s4

The Mininet-CLI command links can be used to determine which switch interface is connected to which neighboring switch
interface.
The full Python2 program is switchline.py; to run it use

python switchline.py

18.4 https://eng.libretexts.org/@go/page/11003
This configures the network and starts the Mininet CLI. The default number of host/switch pairs is 4, but this can be changed with
the -N command-line parameter, for example python switchline.py -N 5 .
We next describe selected parts of switchline.py. The program starts by building the network topology object, LineTopo ,
extending the built-in Mininet class Topo , and then call Topo.addHost() to create the host nodes. (We here override
__init()__ , but overriding build() is actually more common.)

class LineTopo( Topo ):


def __init__( self , **kwargs):
"Create linear topology"
super(LineTopo, self).__init__(**kwargs)
h = [] # list of hosts; h[0] will be h1, etc
s = [] # list of switches

for key in kwargs:


if key == 'N': N=kwargs[key]

# add N hosts h1..hN


for i in range(1,N+1):
h.append(self.addHost('h' + str(i)))

Method Topo.addHost() takes a string, such as “h2”, and builds a host object of that name. We immediately append the new
host object to the list h[] . Next we do the same to switches, using Topo.addSwitch() :

# add N switches s1..sN


for i in range(1,N+1):
s.append(self.addSwitch('s' + str(i)))

Now we build the links, with Topo.addLink . Note that h[0]..h[N-1] represent h1..hN. First we build the host-switch links,
and then the switch-switch links.

for i in range(N): # Add links from hi to si


self.addLink(h[i], s[i])

for i in range(N-1): # link switches


self.addLink(s[i],s[i+1])

Now we get to the main program. We use argparse to support the -N command-line argument.

def main(**kwargs):
parser = argparse.ArgumentParser()
parser.add_argument('-N', '--N', type=int)
args = parser.parse_args()
if args.N is None:
N = 4
else:
N = args.N

18.5 https://eng.libretexts.org/@go/page/11003
Next we create a LineTopo object, defined above. We also set the log-level to ‘info’; if we were having problems we would set
it to ‘debug’.

ltopo = LineTopo(N=N)
setLogLevel('info')

Finally we’re ready to create the Mininet net object, and start it. We’ve specified the type of switch here, though at this point
that does not really matter. It does matter that we’re using the DefaultController, as otherwise the switches will not behave
automatically as Ethernet learning switches. The autoSetMacs option sets the host MAC addresses to 00:00:00:00:00:01
through 00:00:00:00:00:04 (for N=4), which can be a great convenience when manually examining Ethernet addresses.

net = Mininet(topo = ltopo, switch = OVSKernelSwitch,


controller = DefaultController,
autoSetMacs = True
)
net.start()

The next bit starts /usr/sbin/sshd on each node. This command automatically puts itself in the background; otherwise we
would need to add an ‘&’ to the string to run the command in the background.

for i in range(1, N+1):


hi = net['h' + str(i)]
hi.cmd('/usr/sbin/sshd')

Finally we start the Mininet CLI, and, when that exits, we stop the emulation.

CLI( net)
net.stop()

Using sshd requires a small bit of configuration, if ssh for the root user has not been set up already. We must first run
ssh-keygen , which creates the directory /root/.ssh and then the public and private key files, id_rsa.pub and
id_rsa respectively. There is no need, in this setting, to protect the keys with a password. The second step is to go to the .ssh
directory and copy id_rsa.pub to the (new) file authorized_keys (if the latter file already exists, append
id_rsa.pub to it). This will allow passwordless ssh connections between the different Mininet hosts.
Because we started sshd on each host, the command ssh 10.0.0.4 on h1 should successfully connect to h4. The first
time a connection is made from h1 to h4 (as root), ssh will ask for confirmation, and then store h4’s key in /root/.ssh/known_hosts.
As this is the same file for all Mininet nodes, due to the common filesystem, a subsequent request to connect from h2 to h4 will
succeed immediately; h4 has already been authenticated for all nodes.

18.3.1 Running a webserver


Now let’s run a web server on, say, host 10.0.0.4 of the switchline.py example above. Python includes a simple implementation that
serves up the files in the directory in which it is started. After switchline.py is running, start an xterm on host h4, and then change
directory to /usr/share/doc (where there are some html files). Then run the following command (the 8000 is the server port
number):

python -m SimpleHTTPServer 8000

If this is run in the background somewhere, output should be redirected to /dev/null or else the server will eventually hang.
The next step is to start a browser. If the lxde environment has been installed (18.1 Installing Mininet), then the chromium browser
should be available. Start an xterm on host h1, and on h1 run the following (the --no-sandbox option is necessary to run

18.6 https://eng.libretexts.org/@go/page/11003
chromium as root):

chromium-browser --no-sandbox

Assuming chromium opens successfully, enter the following URL: 10.0.0.4:8000 . If chromium does not start, try
wget 10.0.0.4:8000 , which stores what it receives as the file index.html . Either way, you should see a listing of the
/usr/share/doc directory. It is possible to browse subdirectories, but only browser-recognized filetypes (eg .html ) will
open directly. A few directories with subdirectories named html are iperf , iptables and xarchiver ; try
navigating to these.

18.4 IP Routers in a Line


In the next example we build a Mininet example involving a router rather than a switch. A router here is simply a multi-interface
Mininet host that has IP forwarding enabled in its Linux kernel. Mininet support for multi-interface hosts is somewhat fragile;
interfaces may need to be initialized in a specific order, and IP addresses often cannot be assigned at the point when the link is
created. In the code presented below we assign IP addresses using calls to Node.cmd() used to invoke the Linux command
ifconfig (Mininet containers do not fully support the use of the alternative ip addr command).
Our first router topology has only two hosts, one at each end, and N routers in between; below is the diagram with N=3. All subnets
are /24. The program to set this up is routerline.py, here invoked as python routerline.py -N 3 . We will use N=3 in
most of the examples below. A somewhat simpler version of the program, which sets up the topology specifically for N=3, is
routerline3.py.

h1 10.0.0.10 10.0.0.2
r1 10.0.1.1 10.0.1.2
r2 10.0.2.1 10.0.2.2
r3 10.0.3.1 10.0.3.10
h2

In both versions of the program, routing entries are created to route traffic from h1 to h2, but not back again. That is, every router
has a route to 10.0.3.0/24, but only r1 knows how to reach 10.0.0.0/24 (to which r1 is directly connected). We can verify the “one-
way” connectedness by running WireShark or tcpdump on h2 (perhaps first starting an xterm on h2), and then
running ping 10.0.3.10 on h1 (perhaps using the Mininet command h1 ping h2 ). WireShark or tcpdump
should show the arriving ICMP ping packets from h1 , and also the arriving ICMP Destination Network Unreachable packets
from r3 as h2 tries to reply (see 7.11 Internet Control Message Protocol).
It turns out that one-way routing is considered to be suspicious; one interpretation is that the packets involved have a source
address that shouldn’t be possible, perhaps spoofed. Linux provides the interface configuration option rp_filter – reverse-
path filter – to block the forwarding of packets for which the router does not have a route back to the packet’s source. This must be
disabled for the one-way example to work; see the notes on the code below.
Despite the lack of connectivity, we can reach h2 from h1 via a hop-by-hop sequence of ssh connections (the program enables
sshd on each host and router):

h1: slogin 10.0.0.2


r1: slogin 10.0.1.2
r2: slogin 10.0.2.2
r3: slogin 10.0.3.10 (that is, h3)

To get the one-way routing to work from h1 to h2, we needed to tell r1 and r2 how to reach destination 10.0.3.0/24. This can be
done with the following commands (which are executed automatically if we set
ENABLE_LEFT_TO_RIGHT_ROUTING = True in the program):

r1: ip route add to 10.0.3.0/24 via 10.0.1.2


r2: ip route add to 10.0.3.0/24 via 10.0.2.2

To get full, bidirectional connectivity, we can create the following routes to 10.0.0.0/24:

18.7 https://eng.libretexts.org/@go/page/11003
r2: ip route add to 10.0.0.0/24 via 10.0.1.1
r3: ip route add to 10.0.0.0/24 via 10.0.2.1

When building the network topology, the single-interface hosts can have all their attributes set at once (the code below is from
routerline3.py:

h1 = self.addHost( 'h1', ip='10.0.0.10/24', defaultRoute='via 10.0.0.2' )


h2 = self.addHost( 'h2', ip='10.0.3.10/24', defaultRoute='via 10.0.3.1' )

The routers are also created with addHost() , but with separate steps:

r1 = self.addHost( 'r1' )
r2 = self.addHost( 'r2' )
...

self.addLink( h1, r1, intfName1 = 'h1-eth0', intfName2 = 'r1-eth0')


self.addLink( r1, r2, inftname1 = 'r1-eth1', inftname2 = 'r2-eth0')

Later on the routers get their IPv4 addresses:

r1 = net['r1']
r1.cmd('ifconfig r1-eth0 10.0.0.2/24')
r1.cmd('ifconfig r1-eth1 10.0.1.1/24')
r1.cmd('sysctl net.ipv4.ip_forward=1')
rp_disable(r1)

The sysctl command here enables forwarding in r1. The rp_disable(r1) call disables Linux’s default refusal to
forward packets if the router does not have a route back to the packet’s source; this is often what is wanted in the real world but not
necessarily in routing demonstrations. It too is ultimately implemented via sysctl commands.

18.5 IP Routers With Simple Distance-Vector Implementation


The next step is to automate the discovery of the route from h1 to h2 (and back) by using a simple distance-vector routing-update
protocol. We present a partial implementation of the Routing Information Protocol, RIP, as defined in RFC 2453.
The distance-vector algorithm is described in 9.1 Distance-Vector Routing-Update Algorithm. In brief, the idea is to add a cost
attribute to the forwarding table, so entries have the form ⟨destination,next_hop,cost⟩. Routers then send ⟨destination,cost⟩ lists to
their neighbors; these lists are referred to the RIP specification as update messages. Routers receiving these messages then process
them to figure out the lowest-cost route to each destination. The format of the update messages is diagrammed below:

Addr Family route_tag

IP Address

Netmask

Next_hop Address

metric

The full RIP specification also includes request messages, but the implementation here omits these. The full specification also
includes split horizon, poison reverse and triggered updates (9.2.1.1 Split Horizon and 9.2.1.2 Triggered Updates); we omit these as
well. Finally, while we include code for the third next_hop increase case of 9.1.1 Distance-Vector Update Rules, we do not include
any test for whether a link is down, so this case is never triggered.

18.8 https://eng.libretexts.org/@go/page/11003
The implementation is in the Python3 file rip.py. Most of the time, the program is waiting to read update messages from other
routers. Every UPDATE_INTERVAL seconds the program sends out its own update messages. All communication is via UDP
packets sent using IP multicast, to the official RIP multicast address 224.0.0.9. Port 520 is used for both sending and receiving.
Rather than creating separate threads for receiving and sending, we configure a short (1 second) recv() timeout, and then after
each timeout we check whether it is time to send the next update. An update can be up to 1 second late with this approach, but this
does not matter.
The program maintains a “shadow” copy RTable of the real system forwarding table, with an added cost column. The real table
is updated whenever a route in the shadow table changes. In the program, RTable is a dictionary mapping TableKey values
(consisting of the IP address and mask) to TableValue objects containing the interface name, the cost, and the next_hop.
To run the program, a “production” approach would be to use Mininet’s Node.cmd() to start up rip.py on each router, eg via
r.cmd('python3 rip.py &') (assuming the file rip.py is located in the same directory in which Mininet was started).
For demonstrations, the program output can be observed if the program is started in an xterm on each router.

18.5.1 Multicast Programming


Sending IP multicast involves special considerations that do not arise with TCP or UDP connections. The first issue is that we are
sending to a multicast group – 224.0.0.9 – but don’t have any multicast routes (multicast trees, 20.5 Global IP Multicast)
configured. What we would like is to have, at each router, traffic to 224.0.0.9 forwarded to each of its neighboring routers.
However, we do not actually want to configure multicast routes; all we want is to reach the immediate neighbors. Setting up a
multicast tree presumes we know something about the network topology, and, at the point where RIP comes into play, we do not.
The multicast packets we send should in fact not be forwarded by the neighbors (we will enforce this below by setting TTL); the
multicast model here is very local. Even if we did want to configure multicast routes, Linux does not provide a standard utility for
manual multicast-routing configuration; see the ip-mroute.8 man page.
So what we do instead is to create a socket for each separate router interface, and configure the socket so that it forwards its traffic
only out its associated interface. This introduces a complication: we need to get the list of all interfaces, and then, for each
interface, get its associated IPv4 addresses with netmasks. (To simplify life a little, we will assume that each interface has only a
single IPv4 address.) The function getifaddrdict() returns a dictionary with interface names (strings) as keys and pairs
(ipaddr,netmask) as values. If ifaddrs is this dictionary, for example, then ifaddrs['r1-eth0'] might be
('10.0.0.2','255.255.255.0') . We could implement getifaddrdict() straightforwardly using the Python
module netifaces, though for demonstration purposes we do it here via low-level system calls.
We get the list of interfaces using myInterfaces = os.listdir('/sys/class/net/') . For each interface, we then
get its IP address and netmask (in get_ip_info(intf) ) with the following:

s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
SIOCGIFADDR = 0x8915 # from /usr/include/linux/sockios.h
SIOCGIFNETMASK = 0x891b
intfpack = struct.pack('256s', bytes(intf, 'ascii'))
# ifreq, below, is like struct ifreq in /usr/include/linux/if.h
ifreq = fcntl.ioctl(s.fileno(), SIOCGIFADDR, intfpack)
ipaddrn = ifreq[20:24] # 20 is the offset of the IP addr in ifreq
ipaddr = socket.inet_ntoa(ipaddrn)
netmaskn = fcntl.ioctl(s.fileno(), SIOCGIFNETMASK, intfpack)[20:24]
netmask = socket.inet_ntoa(netmaskn)
return (ipaddr, netmask)

We need to create the socket here (never connected) in order to call ioctl() . The SIOCGIFADDR and
SIOCGIFNETMASK values come from the C language include file; the Python3 libraries do not make these constants available
but the Python3 fcntl.ioctl() call does pass the values we provide directly to the underlying C ioctl() call. This call
returns its result in a C struct ifreq ; the ifreq above is a Python version of this. The binary-format IPv4 address (or
netmask) is at offset 20.

18.9 https://eng.libretexts.org/@go/page/11003
18.5.1.1 createMcastSockets()
We are now in a position, for each interface, to create a UDP socket to be used to send and receive on that interface. Much of the
information here comes from the Linux socket.7 and ip.7 man pages. The function createMcastSockets(ifaddrs)
takes the dictionary above mapping interface names to (ipaddr,netmask) pairs and, for each interface intf , configures it as
follows. The list of all the newly configured sockets is then returned.
The first step is to obtain the interface’s address and mask, and then convert these to 32-bit integer format as ipaddrn and
netmaskn . We then enter the subnet corresponding to the interface into the shadow routing table RTable with a cost of 1
(and with a next_hop of None ), via

RTable[TableKey(subnetn, netmaskn)] = TableValue(intf, None, 1)

Next we create the socket and begin configuring it, first by setting its read timeout to a short value. We then set the TTL value used
by outbound packets to 1. This goes in the IPv4 header Time To Live field (7.1 The IPv4 Header); this means that no downstream
routers will ever forward the packet. This is exactly what we want; RIP uses multicast only to send to immediate neighbors.

sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_TTL, 1)

We also want to be able to bind the same socket source address, 224.0.0.9 and port 520, to all the sockets we are creating
here (the actual bind() call is below):

sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)

The next call makes the socket receive only packets arriving on the specified interface:

sock.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, bytes(intf, 'ascii'))

We add the following to prevent packets sent on the interface from being delivered back to the sender; otherwise multicast delivery
may do just that:

sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_LOOP, False)

The next call makes the socket send on the specified interface. Multicast packets do have IPv4 destination addresses, and normally
the kernel chooses the sending interface based on the IP forwarding table. This call overrides that, in effect telling the kernel how to
route packets sent via this socket. (The kernel may also be able to figure out how to route the packet from the subsequent call
joining the socket to the multicast group.)

sock.setsockopt(socket.IPPROTO_IP, socket.IP_MULTICAST_IF, socket.inet_aton(ipaddr))

Finally we can join the socket to the multicast group represented by 224.0.0.9 . We also need the interface’s IP address,
ipaddr .

addrpair = socket.inet_aton('224.0.0.9')+ socket.inet_aton(ipaddr)


sock.setsockopt(socket.IPPROTO_IP, socket.IP_ADD_MEMBERSHIP, addrpair)

The last step is to bind the socket to the desired address and port, with sock.bind(('224.0.0.9', 520)) . This specifies
the source address of outbound packets; it would fail (given that we are using the same socket address for multiple interfaces)
without the SO_REUSEADDR configuration above.

18.10 https://eng.libretexts.org/@go/page/11003
18.5.2 The RIP Main Loop
The rest of the implementation is relatively nontechnical. One nicety is the use of select() to wait for arriving packets on any
of the sockets created by createMcastSockets() above; the alternatives might be to poll each socket in turn with a short
read timeout or else to create a separate thread for each socket. The select() call takes the list of sockets (and a timeout
value) and returns a sublist consisting of those sockets that have data ready to read. Almost always, this will be just one of the
sockets. We then read the data with s.recvfrom() , recording the source address src which will be used when we, next,
call update_tables() . When a socket closes, it must be removed from the select() list, but the sockets here do not
close; for more on this, see 18.6.1.2 dualreceive.py.
The update_tables() function takes the incoming message (parsed into a list of RipEntry objects via
parse_msg() ) and the IP address from which it arrives, and runs the distance-vector algorithm of 9.1.1 Distance-Vector
Update Rules. TK is the TableKey object representing the new destination (as an (addr,netmask) pair). The new destination
rule from 9.1.1 Distance-Vector Update Rules is applied when TK is not present in the existing RTable . The lower cost rule
is applied when newcost <currentcost , and the third next_hop increase rule is applied when
newcost > currentcost but currentnexthop == update_sender .

18.6 TCP Competition: Reno vs Vegas


The next routing example uses the following topology in order to emulate competition between two TCP connections h1→h3 and
h2→h3. We introduce Mininet features to set, on the links, an emulated bandwidth and delay, and to set on the router an emulated
queue size. Our first application will be to arrange a competition between TCP Reno (13 TCP Reno and Congestion Management)
and TCP Vegas (15.6 TCP Vegas). The Python2 file for running this Mininet configuration is competition.py.

h1
80 MBit

8 MBit, 110 ms
r h3

80 MBit
h2

To create links with bandwidth/delay support, we simply set Link=TCLink in the Mininet() call in main() . The
TCLink class represents a Traffic Controlled Link. Next, in the topology section calls to addLink() , we add keyword
parameters such as bw=BottleneckBW and delay=DELAY . To implement the bandwidth limit, Mininet then takes care of
creating the virtual-Ethernet links with a rate constraint.
To implement the delay, Mininet uses a queuing hierarchy (19.7 Hierarchical Queuing). The hierarchy is managed by the tc (traffic
control) command, part of the LARTCsystem. In the topology above, Mininet sets up h3’s queue as an htb queue (19.13.2 Linux
htb, 18.8 Linux Traffic Control (tc)) with a netem queue below it (see the man page for tc-netem.8). The latter has a delay
parameter set as requested, to 110 ms in our example here. Note that this means that the delay from h3 to r will be 110 ms, and the
delay from r to h3 will be 0 ms.
The queue configuration is also handled via the tc command. Again Mininet configures r’s r-eth3 interface to have an htb queue
with a netem queue below it. Using the tcqdisc show command we can see that the “handle” of the netem queue is 10:; we
can now set the maximum queue size to, for example, 25 with the following command on r:

tc qdisc change dev r-eth3 handle 10: netem limit 25

18.6.1 Running A TCP Competition


In order to arrange a TCP competition, we need the following tools:
sender.py, to open the TCP connection and send bulk data, after requesting a specific TCP congestion-control mechanism (Reno
or Vegas)
dualreceive.py, to receive data from two connections and track the results
randomtelnet.py, to send random additional data to break TCP phase effects.
wintracker.py, to monitor the number of packets a connection has in flight (a good estimator of cwnd ).

18.11 https://eng.libretexts.org/@go/page/11003
18.6.1.1 sender.py
The Python3 program sender.py is similar to tcp_stalkc.py, except that it allows specification of the TCP congestion algorithm.
This is done with the following setsockopt() call:

s.setsockopt(socket.IPPROTO_TCP, TCP_CONGESTION, cong)

where cong is “reno” or “cubic” or some other available TCP flavor. The list is at
/proc/sys/net/ipv4/tcp_allowed_congestion_control.
See also 15.1 Choosing a TCP on Linux.

18.6.1.2 dualreceive.py
The receiver for sender.py’s data is dualreceive.py. It listens on two ports, by default 5430 and 5431, and, when both connections
have been made, begins reading. The main loop starts with a call to select() , where sset is the list of all (both)
connected sockets:

sl,_,_ = select(sset, [], [])

The value sl is a sublist of sset consisting of the sockets with data ready to read. It will normally be a list consisting of a single
socket, though with so much data arriving it may sometimes contain both. We then call s.recv() for s in sl, and record in
either count1 or count2 the running total of bytes received.
If a sender closes a socket, this results in a read of 0 bytes. At this point dualreceive.py must close the socket, at which point it must
be removed from sset as it will otherwise always appear in the sl list.
We repeatedly set a timer (in printstats() ) to print the values of count1 and count2 at 0.1 second intervals,
reflecting the cumulative amounts of data received by the connections. (If the variable PRINT_CUMULATIVE is set to
False , then the values printed are the amounts of data received in the last 0.1 seconds.) If the TCP competition is fair,
count1 and count2 should stay approximately equal. When printstats() detects no change in count1 and
count2 , it exits.
In Python, calling exit() only exits the current thread; the other threads keep running.

18.6.1.3 randomtelnet.py
In 16.3.4 Phase Effects we show that, with completely deterministic travel times, two competing TCP connections can have
throughputs differing by a factor of as much as 10 simply because of unfortunate synchronizations of transmission times. We must
introduce at least some degree of packet-arrival-time randomization in order to obtain meaningful results.
In 16.3.6 Phase Effects and overhead we used the ns2 overhead attribute for this. This is not availble in real networks,
however. The next-best thing is to introduce some random telnet-like traffic, as in 16.3.7 Phase Effects and telnet traffic. This is the
purpose of randomtelnet.py.
This program sends packets at random intervals; the lengths of the intervals are exponentially distributed, meaning that to find the
length of the next interval we choose X randomly between 0 and 1 (with a uniform distribution), and then set the length of the wait
interval to a constant times -log(X). The packet sizes are 210 bytes (a very atypical value for real telnet traffic). Crucially, the
average rate of sending is held to a small fraction (by default 1%) of the available bottleneck bandwidth, which is supplied as a
constant BottleneckBW . This means the randomtelnet traffic should not interfere significantly with the competing TCP
connections (which, of course, have no additional interval whatsoever between packet transmissions, beyond what is dictated by
sliding windows). The randomtelnet traffic appears to be quite effective at eliminating TCP phase effects.
Randomtelnet.py sends to port 5433 by default. We will usually use netcat (12.6.2 netcat again) as the receiver, as we are not
interested in measuring throughput for this traffic.

18.6.1.4 Monitoring cwnd with wintracker.py


At the end of the competition, we can look at the dualreceive.py output and determine the overall throughput of each connection, as
of the time when the first connection to send all its data has just finished. We can also plot throughput at intervals by plotting

18.12 https://eng.libretexts.org/@go/page/11003
successive differences of the cumulative-throughput values.
However, this does not give us a view of each connection’s cwnd , which is readily available when modeling competition in a
simulator. Indeed, getting direct access to a connection’s cwnd is nearly impossible, as it is a state variable in the sender’s
kernel.
However, we can do the next best thing: monitor the number of packets (or bytes) a connection has in flight; this is the difference
between the highest byte sent and the highest byte acknowledged. The highest byte ACKed is one less than the value of the ACK
field in the most recent ACK packet, and the highest byte sent is one less than the value of the SEQ field, plus the packet length, in
the most recent DATA packet.
To get these ACK and SEQ numbers, however, requires eavesdropping on the network connections. We can do this using a packet-
capture library such as libpcap. The Pcapy Python2 (not Python3) module is a wrapper for libpcap.
The program wintracker.py uses Pcapy to monitor packets on the interfaces r-eth1 and r-eth2 of router r. It would be slightly more
accurate to monitor on h1-eth0 and h2-eth0, but that entails separate monitoring of two different nodes, and the difference is small
as the h1–r and h2–r links have negligible delay and no queuing. Wintracker.py must be configured to monitor only the two TCP
connections that are competing.
The way libpcap, and thus Pcapy, works is that we first create a packet filter to identify the packets we want to capture. The filter
for both connections is

host 10.0.3.10 and tcp and portrange 5430-5431

The host is, of course, h3; packets are captured if either source host or destination host is h3. Similarly, packets are captured if
either the source port or the destination port is either 5430 or 5431. The connection from h1 to h3 is to port 5430 on h3, and the
connection from h2 to h3 is to port 5431 on h3.
For the h1–h3 connection, each time a packet arrives heading from h1 to h3 (in the code below we determine this because the
destination port dport is 5430), we save in seq1 the TCP header SEQ field plus the packet length. Each time a packet is
seen heading from h3 to h1 (that is, with source port 5430), we record in ack1 the TCP header ACK field. The packets
themselves are captured as arrays of bytes, but we can determine the offset of the TCP header and read the four-byte SEQ/ACK
values with appropriate helper functions:

_,p = cap1.next() # p is the captured packet


...
(_,iphdr,tcphdr,data) = parsepacket(p) # find the headers
sport = int2(tcphdr, TCP_SRCPORT_OFFSET) # extract port numbers
dport = int2(tcphdr, TCP_DSTPORT_OFFSET)
if dport == port1: # port1 == 5430
seq1 = int4(tcphdr, TCP_SEQ_OFFSET) + len(data)
elif sport == port1:
ack1 = int4(tcphdr, TCP_ACK_OFFSET)

Separate threads are used for each connection, as there is no variant of select() available to return the next captured packet
of either connection.
Both the SEQ and ACK fields have had ISNA added to them, but this will cancel out when we subtract. The SEQ and ACK values
are subject to 32-bit wraparound, but subtraction again saves us here.
As with dualreceive.py, a timer fires every 100 ms and prints out the differences seq1-ack1 and seq2-ack2 . This isn’t
completely thread-safe, but it is close enough. There is some noise in the results; we can minimize that by taking the average of
several differences in a row.

18.13 https://eng.libretexts.org/@go/page/11003
18.6.1.5 Synchronizing the start
The next issue is to get both senders to start at about the same time. We could use two ssh commands, but ssh commands can take
several hundred milliseconds to complete. A faster method is to use netcat to trigger the start. On h1 and h2 we run shell scripts like
the one below (separate values for $PORT and $CONG are needed for each of h1 and h2, which is simplest to implement with
separate scripts, say h1.sh and h2.sh):

netcat -l 2345
python3 sender.py $BLOCKS 10.0.3.10 $PORT $CONG

We then start both at very close to the same time with the following on r (not on h3, due to the delay on the r–h3 link); these
commands typically complete in under ten milliseconds.

echo hello | netcat h1 2345


echo hello | netcat h2 2345

The full sequence of steps is


On h3, start the netcat -l ... for the randomtelnet.py output (on two different ports)
On h1 and h2, start the randomtelnet.py senders
On h3, start dualreceive.py
On h1 and h2, start the scripts (eg h1.sh and h2.sh) that wait for the signal and start sender.py
On r, send the two start triggers via netcat
This is somewhat cumbersome; it helps to incorporate everything into a single shell script with ssh used to run subscripts on the
appropriate host.

18.6.1.6 Reno vs Vegas results


In the Reno-Vegas graph at 16.5 TCP Reno versus TCP Vegas, we set the Vegas parameters �� and �� to 3 and 6 respectively.
The implementation of TCP Vegas on the Mininet virtual machine does not, however, support changing �� and ��, and the
default values are more like 1 and 3. To give Vegas a fighting chance, we reduce the queue size at r to 10 in competition.py. Here is
the graph, with the packets-in-flight monitoring above and the throughput below:
2e+06
Reno
Vegas

1.5e+06

1e+06

500000

0
0 50 100 150 200 250 300 350 400 450

TCP Vegas is getting a smaller share of the bandwidth (overall about 40% to TCP Reno’s 60%), but it is consistently holding its
own. It turns out that TCP Vegas is greatly helped by the small queue size; if the queue size is doubled to 20, then Vegas gets a 17%
share.

18.14 https://eng.libretexts.org/@go/page/11003
In the upper part of the graph, we can see the Reno sawteeth versus the Vegas triangular teeth (sloping down as well as sloping up);
compare to the red-and-green graph at 16.5 TCP Reno versus TCP Vegas. The tooth shapes are somewhat mirrored in the
throughput graph as well, as throughput is proportional to queue utilization which is proportional to the number of packets in flight.

18.7 TCP Competition: Reno vs BBR


We can apply the same technique to compare TCP Reno to TCP BBR. This was done to create the graph at 15.16 TCP BBR. The
Mininet approach is usable as soon as a TCP BBR module for Linux was released (in source form); to use a simulator, on the other
hand, would entail waiting for TCP BBR to be ported to the simulator.
One nicety is that it is essential that the fq queuing discipline be enabled for the TCP BBR sender. If that is h2 , for example,
then the following Mininet code (perhaps in competition.py ) removes any existing queuing discipline and adds fq :

h2.cmd('tc qdisc del dev h2-eth root')


h2.cmd('tc qdisc add dev h2-eth root fq')

The purpose of the fq queuing is to enable pacing; that is, the transmission of packets at regular, very small intervals.

18.8 Linux Traffic Control (tc)


The Linux tc command, for traffic control, allows the attachment of any implemented queuing discipline (19 Queuing and
Scheduling) to any network interface (usually of a router). A hierarchical example appears in 19.13.2 Linux htb. The tc command is
also used extensively by Mininet to control, for example, link queue capacities. An explicit example, of adding the fq queuing
discipline, appears immediately above.
The two examples presented in this section involve “simple” token-bucket filtering, using tbf, and then “classful” token-bucket
filtering, using htb. We will use the latter example to apply token-bucket filtering only to one class of connections; other
connections receive no filtering.
The granularity of tc-tbf rate control is limited by the cpu-interrupt timer granularity; typically tbf is able schedules packets every
10 ms. If the transmission rate is 6 MB/s, or about four 1500-byte packets per millisecond, then tbf will schedule 40 packets for
transmission every 10 ms. They will, however, most likely be sent as a burst at the start of the 10-ms interval. Some tc schedulers
are able to achieve much finer pacing control; eg the ‘fq’ qdisc of 18.7 TCP Competition: Reno vs BBR above.
The Mininet topology in both cases involves a single router between two hosts, h1—r—h2. We will here use the routerline.py
example with the option -N 1 ; the router is then r1 with interfaces r1-eth0 connecting to h1 and r1-eth1
connecting to h2 . The desired topology can also be built using competition.py and then ignoring the third host.
To send data we will use sender.py (18.6.1.1 sender.py), though with the default TCP congestion algorithm. To receive data we will
use dualreceive.py, though initially with just one connection sending any significant data. We will set the constant
PRINT_CUMULATIVE to False , so dualreceive.py prints at intervals the number of bytes received during the most
recent interval; we will call this modified version dualreceive_incr.py . We will also redirect the stderr messages to
/dev/null , and start this on h2 :

python3 dualreceive_incr.py 2>/dev/null

We start the main sender on h1 with the following, where h2 has IPv4 address 10.0.1.10 and 1,000,000 is the number of
blocks:

python3 sender.py 1000000 10.0.1.10 5430

The dualreceive program will not do any reading until both connections are enabled, so we also need to create a second connection
from h1 in order to get started; this second connection sends only a single block of data:

python3 sender.py 1 10.0.1.10 5431

18.15 https://eng.libretexts.org/@go/page/11003
At this point dualreceive should generate output somewhat like the following (with timestamps in the first column rounded to the
nearest millisecond). The byte-count numbers in the middle column are rather hardware-dependent

1.016 14079000 0
1.106 12702000 0
1.216 14724000 0
1.316 13666448 0
1.406 11877552 0

This means that, on average, h2 is receiving about 13 MB every 100ms, which is about 1.0 Gbps.
Now we run the command below on r1 to reduce the rate ( tc requires the abbreviation mbit for megabit/sec; it treats
mbps as MegaBytes per second). The token-bucket filter parameters are rate and burst . The purpose of the limit
parameter – used by netem and several other qdiscs as well – is to specify the maximum queue size for the waiting packets. Its
value here is not very significant, but too low a value can lead to packet loss and thus to momentarily plunging bandwidth. Too
high a value, on the other hand, can lead to bufferbloat (14.8.1 Bufferbloat).

tc qdisc add dev r1-eth1 root tbf rate 40mbit burst 50kb limit 200kb

We get output something like this:

1.002 477840 0
1.102 477840 0
1.202 477840 0
1.302 482184 0
1.402 473496 0

477840 bytes per 100 ms is 38.2 Mbps. That is received application data; the extra 5% or so to 40 Mbps corresponds mostly to
packet headers (66 bytes out of every 1514, though to see this with WireShark we need to disable TSO, 12.5 TCP Offloading).
We can also change the rate dynamically:

tc qdisc change dev r1-eth1 root tbf rate 20mbit burst 100kb limit 200kb

The above use of tbf allows us to throttle (or police) all traffic through interface r1-eth1 . Suppose we want to police selected
traffic only? Then we can use hierarchical token bucket, or htb. We set up an htb root node, with no limits, and then create two
child nodes, one for policed traffic and one for default traffic.

root htb qdisc, handle 1:

root class, 1000 mbit, classid 1:1

policed leaf class, default leaf class,


40 mbit, classid 1:2 1000 mbit, classid 1:10

To create the htb hierarchy we will first create the root qdisc and associated root class. We need the raw interface rate, here taken to
be 1000mbit. Class identifiers are of the form major:minor, where major is the integer root “handle” and minor is another integer.

tc qdisc add dev r1-eth1 root handle 1: htb default 10


tc class add dev r1-eth1 parent 1: classid 1:1 htb rate 1000mbit

18.16 https://eng.libretexts.org/@go/page/11003
We now create the two child classes (not qdiscs), one for the rate-limited traffic and one for default traffic. The rate-limited class
has classid 1:2 here; the default class has classid 1:10.

tc class add dev r1-eth1 parent 1: classid 1:2 htb rate 40mbit
tc class add dev r1-eth1 parent 1: classid 1:10 htb rate 1000mbit

We still need a classifier (or filter) to assign selected traffic to class 1:2. Our goal is to police traffic to port 5430 (by default,
dualreceive.py accepts traffic at ports 5430 and 5431).
There are several classifiers available; for example u32 (man tc-u32) and bpf (man tc-bpf). The latter is based on the Berkeley
Packet Filter virtual machine for packet recognition. However, what we use here – mainly because it seems to work most reliably –
is the iptables fwmark mechanism, used earlier in 9.6 Routing on Other Attributes. Iptables is intended for filtering – and
sometimes modifying – packets; we can associate a fwmark value of 2 to packets bound for TCP port 5430 with the command
below (the fwmark value does not become part of the packet; it exists only while the packet remains in the kernel).

iptables --append FORWARD --table mangle --protocol tcp --dport 5430 --jump MARK --set

When this is run on r1 , then packets forwarded by r1 to TCP port 5430 receive the fwmark upon arrival.
The next step is to tell the tc subsystem that packets with a fwmark value of 2 are to be placed in class 1:2; this is the rate-
limited class above. In the following command, flowid may be used as a synonym for classid .

tc filter add dev r1-eth1 parent 1:0 protocol ip handle 2 fw classid 1:2

We can view all these settings with

tc qdisc show dev r1-eth1


tc class show dev r1-eth1
tc filter show dev r1-eth1 parent 1:1
iptables --table mangle --list

We now verify that all this works. As with tbf , we start dualreceive_incr.py on h2 and two senders on h1 . This
time, both senders send large amounts of data:

h2: python3 dualreceive_incr.py 2>/dev/null


h1: python3 sender.py 500000 10.0.1.10 5430
h1: python3 sender.py 500000 10.0.1.10 5431

If everything works, then shortly after the second sender starts we should see something like the output below (taken after both
TCP connections have their cwnd stabilize). The middle column is the number of received data bytes to the policed port, 5430.

1.000 453224 10425600


1.100 457568 10230120
1.200 461912 9934728
1.300 476392 10655832
1.401 438744 10230120

With 66 bytes of TCP/IP headers in every 1514-byte packet, our requested 40 mbit data-rate cap should yield about 478,000 bytes
every 0.1 sec. The slight reduction above appears to be related to TCP competition; the full 478,000-byte rate is achieved after the
port-5431 connection terminates.

18.17 https://eng.libretexts.org/@go/page/11003
18.9 OpenFlow and the POX Controller
In this section we introduce the POX controller for OpenFlow (2.8.1 OpenFlow Switches) switches, allowing exploration of
software-defined networking (2.8 Software-Defined Networking). In the switchline.py Ethernet-switch example from
earlier, the Mininet() call included a parameter controller=DefaultController ; this causes each switch to
behave like an ordinary Ethernet learning switch. By using Pox to create customized controllers, we can investigate other options
for switch operation. Pox is preinstalled on the Mininet virtual machine.
Pox is, like Mininet, written in Python2. It receives and sends OpenFlow messages, in response to events. Event-related messages,
for our purposes here, can be grouped into the following categories:
PacketIn: a switch is informing the controller about an arriving packet, usually because the switch does not know how to
forward the packet or does not know how to forward the packet without flooding. Often, but not always, PacketIn events will
result in the controller providing new forwarding instructions.
ConnectionUP: a switch has connected to the controller. This will be the point at which the controller gives the switch its initial
packet-handling instructions.
LinkEvent: a switch is informing the controller of a link becoming available or becoming unavailable; this includes initial
reports of link availability.
BarrierEvent: a switch’s response to an OpenFlow Barrier message, meaning the switch has completed its responses to all
messages received before the Barrier and now may begin to respond to messages received after the Barrier.
The Pox program comes with several demonstration modules illustrating how controllers can be programmed; these are in the
pox/misc and pox/forwarding directories. The starting point for Pox documentation is the Pox wiki (archived copy at poxwiki.pdf),
which among other thing includes brief outlines of these programs. We now review a few of these programs; most were written by
James McCauley and are licensed under the Apache license.
The Pox code data structures are very closely tied to the OpenFlow Switch Specification, versions of which can be found at the
OpenNetworking.org technical library.

18.9.1 hub.py
As a first example of Pox, suppose we take a copy of the switchline.py file and make the following changes:
change the controller specification, inside the Mininet() call, from controller=DefaultController to
controller=RemoteController .
add the following lines immediately following the Mininet() call:

c = RemoteController( 'c', ip='127.0.0.1', port=6633 )


net.addController(c)

This modified version is available as switchline_rc.py, “rc” for remote controller. If we now run this modified version, then pings
fail because the RemoteController, c, does not yet exist; in the absence of a controller, the switches’ default response is to do
nothing.
We now start Pox, in the directory /home/mininet/pox, as follows; this loads the file pox/forwarding/hub.py

./pox.py forwarding.hub

Ping connectivity should be restored! The switch connects to the controller at IPv4 address 127.0.0.1 (more on this below) and TCP
port 6633. At this point the controller is able to tell the switch what to do.
The hub.py example configures each switch as a simple hub, flooding each arriving packet out all other interfaces (though for
the linear topology of switchline_rc.py , this doesn’t matter much). The relevant code is here:

def _handle_ConnectionUp (event):


msg = of.ofp_flow_mod()

18.18 https://eng.libretexts.org/@go/page/11003
msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD))
event.connection.send(msg)

This is the handler for ConnectionUp events; it is invoked when a switch first reports for duty. As each switch connects to the
controller, the hub.py code instructs the switch to forward each arriving packet to the virtual port OFPP_FLOOD , which
means to forward out all other ports.
The event parameter is of class ConnectionUp , a subclass of class Event . It is defined in
pox/openflow/__init__.py . Most switch-event objects throughout Pox include a connection field, which the
controller can use to send messages back to the switch, and a dpid field, representing the switch identification number.
Generally the Mininet switch s1 will have a dpid of 1, etc.
The code above creates an OpenFlow modify-flow-table message, msg ; this is one of several types of controller-to-switch
messages that are defined in the OpenFlow standard. The field msg.actions is a list of actions to be taken; to this list we
append the action of forwarding on the designated (virtual) port OFPP_FLOOD .
Normally we would also append to the list msg.match the matching rules for the packets to be forwarded, but here we want to
forward all packets and so no matching is needed.
A different – though functionally equivalent – approach is taken in pox/misc/of_tutorial.py . Here, the response to the
ConnectionUp event involves no communication with the switch (though the connection is stored in Tutorial.__init__()
). Instead, as the switch reports each arriving packet to the controller, the controller responds by telling the switch to flood the
packet out every port (this approach does result in sufficient unnecessary traffic that it would not be used in production code). The
code (slightly consolidated) looks something like this:

def _handle_PacketIn (self, event):


packet = event.parsed # This is the parsed packet data.
packet_in = event.ofp # The actual ofp_packet_in message.
self.act_like_hub(packet, packet_in)

def act_like_hub (self, packet, packet_in):


msg = of.ofp_packet_out()
msg.data = packet_in
action = of.ofp_action_output(port = of.OFPP_ALL)
msg.actions.append(action)
self.connection.send(msg)

The event here is now an instance of class PacketIn. This time the switch sents a packet out message to the switch. The packet
and packet_in objects are two different views of the packet; the first is parsed and so is generally easier to obtain information
from, while the second represents the entire packet as it was received by the switch. It is the latter format that is sent back to the
switch in the msg.data field. The virtual port OFPP_ALL is equivalent to OFPP_FLOOD .
For either hub implementation, if we start WireShark on h2 and then ping from h4 to h1, we will see the pings at h2. This
demonstrates, for example, that s2 is behaving like a hub rather than a switch.

18.9.2 l2_pairs.py
The next Pox example, l2_pairs.py , implements a real Ethernet learning switch. This is the pairs-based switch
implementation discussed in 2.8.2 Learning Switches in OpenFlow. This module acts at the Ethernet address layer (layer 2, the l2
part of the name), and flows are specified by (src,dst) pairs of addresses. The l2_pairs.py module is started with the Pox
command ./pox.py forwarding.l2_pairs .
A straightforward implementation of an Ethernet learning switch runs into a problem: the switch needs to contact the controller
whenever the packet source address has not been seen before, so the controller can send back to the switch the forwarding rule for
how to reach that source address. But the primary lookup in the switch flow table must be by destination address. The approach

18.19 https://eng.libretexts.org/@go/page/11003
used here uses a single OpenFlow table, versus the two-table mechanism of 18.9.3 l2_nx.py. However, the learned flow table match
entries will all include match rules for both the source and the destination address of the packet, so that a separate entry is necessary
for each pair of communicating hosts. The number of flow entries thus scales as O(N2), which presents a scaling problem for very
large switches but which we will ignore here.
When a switch sees a packet with an unmatched (dst,src) address pair, it forwards it to the controller, which has two cases to
consider:
If the controller does not know how to reach the destination address from the current switch, it tells the switch to flood the
packet. However, the controller also records, for later reference, the packet source address and its arrival interface.
If the controller knows that the destination address can be reached from this switch via switch port dst_port, it sends to the
switch instructions to create a forwarding entry for (dst,src)→dst_port. At the same time, the controller also sends to the switch
a reverse forwarding entry for (src,dst), forwarding via the port by which the packet arrived.
The controller maintains its partial map from addresses to switch ports in a dictionary table , which takes a (switch,destination)
pair as its key and which returns switch port numbers as values. The switch is represented by the event.connection object
used to reach the switch, and destination addresses are represented as Pox EthAddr objects.
The program handles only PacketIn events. The main steps of the PacketIn handler are as follows. First, when a packet arrives, we
put its switch and source into table :

table[(event.connection,packet.src)] = event.port

The next step is to check to see if there is an entry in table for the destination, by looking up
table[(event.connection,packet.dst)] . If there is not an entry, then the packet gets flooded by the same
mechanism as in of_tutorial.py above: we create a packet-out message containing the to-be-flooded packet and send it
back to the switch.
If, on the other hand, the controller finds that the destination address can be reached via switch port dst_port , it proceeds as
follows. We first create the reverse entry; event.port is the port by which the packet just arrived:

msg = of.ofp_flow_mod()
msg.match.dl_dst = packet.src # reversed dst and src
msg.match.dl_src = packet.dst # reversed dst and src
msg.actions.append(of.ofp_action_output(port = event.port))
event.connection.send(msg)

This is like the forwarding rule created in hub.py , except that we here are forwarding via the specific port event.port
rather than the virtual port OFPP_FLOOD , and, perhaps more importantly, we are adding two packet-matching rules to
msg.match .
The next step is to create a similar matching rule for the src-to-dst flow, and to include the packet to be retransmitted. The modify-
flow-table message thus does double duty as a packet-out message as well.

msg = of.ofp_flow_mod()
msg.data = event.ofp # Forward the incoming packet
msg.match.dl_src = packet.src # not reversed this time!
msg.match.dl_dst = packet.dst
msg.actions.append(of.ofp_action_output(port = dst_port))
event.connection.send(msg)

The msg.match object has quite a few potential matching fields; the following is taken from the Pox-Wiki:

Attribute Meaning

18.20 https://eng.libretexts.org/@go/page/11003
Attribute Meaning

in_port Switch port number the packet arrived on

dl_src Ethernet source address

dl_dst Ethernet destination address

dl_type Ethertype / length (e.g. 0x0800 = IPv4)

nw_tos IPv4 TOS/DS bits

nw_proto IPv4 protocol (e.g., 6 = TCP), or lower 8 bits of ARP opcode

nw_src IPv4 source address

nw_dst IP destination address

tp_src TCP/UDP source port

tp_dst TCP/UDP destination port

It is also possible to create a msg.match object that matches all fields of a given packet.
We can watch the forwarding entries created by l2_pairs.py with the Linux program ovs-ofctl. Suppose we start
switchline_rc.py and then the Pox module l2_pairs.py . Next, from within Mininet, we have h1 ping h4 and h2
ping h4. If we now run the command (on the Mininet virtual machine but from a Linux prompt)

ovs-ofctl dump-flows s2

we get

cookie=0x0, …,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:04 actions=output:3


cookie=0x0, …,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:00:02 actions=output:1
cookie=0x0, …,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:04 actions=output:3
cookie=0x0, …,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:00:01 actions=output:2
Because we used the autoSetMacs=True option in the Mininet() call in switchline_rc.py , the Ethernet
addresses assigned to hosts are easy to follow: h1 is 00:00:00:00:00:01, etc. The first and fourth lines above result from h1 pinging
h4; we can see from the output port at the end of each line that s1 must be reachable from s2 via port 2 and s3 via port 3. Similarly,
the middle two lines result from h2 pinging h4; h2 lies off s2’s port 1. These port numbers correspond to the interface numbers
shown in the diagram at 18.3 Multiple Switches in a Line.

18.9.3 l2_nx.py
The l2_nx.py example accomplishes the same Ethernet-switch effect as l2_pairs.py , but using only O(N) space. It
does, however, use two OpenFlow tables, one for destination addresses and one for source addresses. In the implementation here,
source addresses are held in table 0, while destination addresses are held in table 1; this is the reverse of the multiple-table
approach outlined in 2.8.2 Learning Switches in OpenFlow. The l2 again refers to network layer 2, and the nx refers to the so-called
Nicira extensions to Pox, which enable the use of multiple flow tables.
Initially, table 0 is set up so that it tries a match on the source address. If there is no match, the packet is forwarded to the controller,
and sent on to table 1. If there is a match, the packet is sent on to table 1 but not to the controller.
Table 1 then looks for a match on the destination address. If one is found then the packet is forwarded to the destination, and if
there is no match then the packet is flooded.
Using two OpenFlow tables in Pox requires the loading of the so-called Nicira extensions (hence the “nx” in the module name
here). These require a slightly more complex command line:

18.21 https://eng.libretexts.org/@go/page/11003
./pox.py openflow.nicira --convert-packet-in forwarding.l2_nx

Nicira will also require, eg, nx.nx_flow_mod() instead of of.ofp_flow_mod() .


The no-match actions for each table are set during the handling of the ConnectionUp events. An action becomes the default action
when no msg.match() rules are included, and the priority is low; recall (2.8.1 OpenFlow Switches) that if a packet matches
multiple flow-table entries then the entry with the highest priority wins. The priority is here set to 1; the Pox default priority –
which will be used (implicitly) for later, more-specific flow-table entries – is 32768. The first step is to arrange for table 0 to
forward to the controller and to table 1.

msg = nx.nx_flow_mod()
msg.table_id = 0 # not necessary as this is the default
msg.priority = 1 # low priority
msg.actions.append(of.ofp_action_output(port = of.OFPP_CONTROLLER))
msg.actions.append(nx.nx_action_resubmit.resubmit_table(table = 1))
event.connection.send(msg)

Next we tell table 1 to flood packets by default:

msg = nx.nx_flow_mod() msg.table_id = 1 msg.priority = 1


msg.actions.append(of.ofp_action_output(port = of.OFPP_FLOOD))
event.connection.send(msg)
Now we define the PacketIn handler. First comes the table 0 match on the packet source; if there is a match, then the source address
has been seen by the controller, and so the packet is no longer forwarded to the controller (it is forwarded to table 1 only).

msg = nx.nx_flow_mod()
msg.table_id = 0
msg.match.of_eth_src = packet.src # match the source
msg.actions.append(nx.nx_action_resubmit.resubmit_table(table = 1))
event.connection.send(msg)

Now comes table 1, where we match on the destination address. All we know at this point is that the packet with source address
packet.src came from port event.port , and we forward any packets addressed to packet.src via that port:

msg = nx.nx_flow_mod() msg.table_id = 1 msg.match.of_eth_dst = packet.src # this rule


applies only for packets to packet.src msg.actions.append(of.ofp_action_output(port =
event.port)) event.connection.send(msg)
Note that there is no network state maintained at the controller; there is no analog here of the table dictionary of
l2_pairs.py .
Suppose we have a simple network h1–s1–h2. When h1 sends to h2, the controller will add to s1’s table 0 an entry indicating that
h1 is a known source address. It will also add to s1’s table 1 an entry indicating that h1 is reachable via the port on s1’s left.
Similarly, when h2 replies, s1 will have h2 added to its table 0, and then to its table 1.

18.9.4 multitrunk.py
The goal of the multitrunk example is to illustrate how different TCP connections between two hosts can be routed via different
paths; in this case, via different “trunk lines”. This example and the next are not part of the standard distributions of either Mininet
or Pox. Unlike the other examples discussed here, these examples consist of Mininet code to set up a specific network topology and
a corresponding Pox controller module that is written to work properly only with that topology. Most real networks evolve with

18.22 https://eng.libretexts.org/@go/page/11003
time, making such a tight link between topology and controller impractical (though this may sometimes work well in datacenters).
The purpose here, however, is to illustrate specific OpenFlow possibilities in a (relatively) simple setting.
The multitrunk topology involves multiple “trunk lines” between host h1 and h2 , as in the following diagram; the trunk lines
are the s1 – s3 and s2 – s4 links.

No packets flooded along this link


s2 s4

h1 s5 s6 h2

s1 s3

Multitrunk topology, N=1, K=2

The Mininet file is multitrunk12.py and the corresponding Pox module is multitrunkpox.py. The number of trunk lines is K=2 by
default, but can be changed by setting the variable K. We will prevent looping of broadcast traffic by never flooding along the
s2 – s4 link.
TCP traffic takes either the s1 – s3 trunk or the s2 – s4 trunk. We will refer to the two directions h1 → h2 and h2
→ h1 of a TCP connection as flows, consistent with the usage in 8.1 The IPv6 Header. Only h1 → h2 flows will have their
routing vary; flows h2 → h1 will always take the s1 – s3 path. It does not matter if the original connection is opened from
h1 to h2 or from h2 to h1 .
The first TCP flow from h1 to h2 goes via s1 – s3 . After that, subsequent connections alternate in round-robin fashion
between s1 – s3 and s2 – s4 . To achieve this we must, of course, include TCP ports in the OpenFlow forwarding
information.
All links will have a bandwidth set in Mininet. This involves using the link=TCLink option; TC here stands for Traffic
Control. We do not otherwise make use of the bandwidth limits. TCLinks can also have a queue size set, as in 18.6 TCP
Competition: Reno vs Vegas.
For ARP and ICMP traffic, two OpenFlow tables are used as in 18.9.3 l2_nx.py. The PacketIn messages for ARP and ICMP
packets are how switches learn of the MAC addresses of hosts, and also how the controller learns which switch ports are directly
connected to hosts. TCP traffic is handled differently, below.
During the initial handling of ConnectionUp messages, switches receive their default packet-handling instructions for ARP
and ICMP packets, and a SwitchNode object is created in the controller for each switch. These objects will eventually contain
information about what neighbor switch or host is reached by each switch port, but at this point none of that information is yet
available.
The next step is the handling of LinkEvent messages, which are initiated by the discovery module. This module must be
included on the ./pox.py command line in order for this example to work. The discovery module sends each switch, as
it connects to the controller, a special discovery packet in the Link Layer Discovery Protocol (LLDP) format; this packet includes
the originating switch’s dpid value and the switch port by which the originating switch sent the packet. When an LLDP packet
is received by the neighboring switch, that switch forwards it back to the controller, together with the dpid and port for the
receiving switch. At this point the controller knows the switches and port numbers at each end of the link. The controller then
reports this to our multitrunkpox module via a LinkEvent event.
As LinkEvent messages are processed, the multitrunkpox module learns, for each switch, which ports connect directly
to neighboring switches. At the end of the LinkEvent phase, which generally takes several seconds, each switch’s
SwitchNode knows about all directly connected neighbor switches. Nothing is yet known about directly connected neighbor
hosts though, as hosts have not yet sent any packets.
Once hosts h1 and h2 exchange a pair of packets, the associated PacketIn events tell multitrunkpox what switch
ports are connected to hosts. Ethernet address learning also takes place. If we execute h1 ping h2 , for example, then
afterwards the information contained in the SwitchNode graph is complete.

18.23 https://eng.libretexts.org/@go/page/11003
Now suppose h1 tries to open a TCP connection to h2 , eg via ssh. The first packet is a TCP SYN packet. The switch s5
will see this packet and forward it to the controller, where the PacketIn handler will process it. We create a flow for the
packet,

flow = Flow(psrc, pdst, ipv4.srcip, ipv4.dstip, tcp.srcport, tcp.dstport)

and then see if a path has already been assigned to this flow in the dictionary flow_to_path . For the very first packet this will
never be the case. If no path exists, we create one, first picking a trunk:

trunkswitch = picktrunk(flow)
path = findpath(flow, trunkswitch)

The first path will be the Python list [h1, s5, s1, s3, s6, h2], where the switches are represented by SwitchNode objects.
The supposedly final step is to call

result = create_path_entries(flow, path)

to create the forwarding rules for each switch. With the path as above, the SwitchNode objects know what port s5 should
use to reach s1 , etc. Because the first TCP SYN packet must have been preceeded by an ARP exchange, and because the ARP
exchange will result in s6 learning what port to use to reach h2 , this should work.
But in fact it does not, at least not always. The problem is that Pox creates separate internal threads for the ARP-packet handling
and the TCP-packet handling, and the former thread may not yet have installed the location of h2 into the appropriate
SwitchNode object by the time the latter thread calls create_path_entries() and needs the location of h2 . This
race condition is unfortunate, but cannot be avoided. As a fallback, if creating a path fails, we flood the TCP packet along the s1
– s3 link (even if the chosen trunk is the s2 – s4 link) and wait for the next TCP packet to try again. Very soon, s6 will
know how to reach h2 , and so create_path_entries() will succeed.
If we run everything, create two xterms on h1 , and then create two ssh connections to h2 , we can see the forwarding entries
using ovs-ofctl . Let us run

ovs-ofctl dump-flows s5

Restricting attention only to those flow entries with foo=tcp, we get (with a little sorting)

cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.
0.2,tp_src=59404,tp_dst=22 actions=output:1
cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.
0.2,tp_src=59526,tp_dst=22 actions=output:2
cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:01,nw_src=10.0.0.2,nw_dst=10.0.
0.1,tp_src=22,tp_dst=59404 actions=output:3
cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:02,dl_dst=00:00:00:00:00:01,nw_src=10.0.0.2,nw_dst=10.0.
0.1,tp_src=22,tp_dst=59526 actions=output:3

18.24 https://eng.libretexts.org/@go/page/11003
The first two entries represent the h1 → h2 flows. The first connection has source TCP port 59404 and is routed via the s1 –
s3 trunk; we can see that the output port from s5 is port 1, which is indeed the port that s5 uses to reach s1 (the output of the
Mininet links command includes s5-eth1<->s1-eth2 ). Similarly, the output port used at s5 by the second
connection, with source TCP port 59526, is 2, which is the port s5 uses to reach s2 . The switch s5 reaches host h1 via
port 3, which can be seen in the last two entries above, which correspond to the reverse h2 → h1 flows.
The OpenFlow timeout here is infinite. This is not a good idea if the system is to be running indefinitely, with a steady stream of
short-term TCP connections. It does, however, make it easier to view connections with ovs-ofctl before they disappear. A
production implementation would need a finite timeout, and then would have to ensure that connections that were idle for longer
than the timeout interval were properly re-established when they resumed sending.
The multitrunk strategy presented here can be compared to Equal-Cost Multi-Path routing, 9.7 ECMP. In both cases, traffic is
divided among multiple paths to improve throughput. Here, individual TCP connections are assigned a trunk by the controller (and
can be reassigned at will, perhaps to improve the load balance). In ECMP, it is common to assign TCP connections to paths via a
pseudorandom hash, in which case the approach here offers the potential for better control of the distribution of traffic among the
trunk links. In some configurations, however, ECMP may route packets over multiple links on a round-robin packet-by-packet
basis rather than a connection-by-connection basis; this allows much better load balancing.
OpenFlow has low-level support for this approach in the select group mechanism. A flow-table traffic-matching entry can forward
traffic to a so-called group instead of out via a port. The action of a select group is then to select one of a set of output actions
(often on a round-robin basis) and apply that action to the packet. In principle, we could implement this at s5 to have successive
packets sent to either s1 or s2 in round-robin fashion. In practice, Pox support for select groups appears to be insufficiently
developed at the time of this writing (2017) to make this practical.

18.9.5 loadbalance31.py
The next example demonstrates a simple load balancer. The topology is somewhat the reverse of the previous example: there are
now three hosts (N=3) at each end, and only one trunk line (K=1) (there are also no left- and right-hand entry/exit switches). The
right-hand hosts act as the “servers”, and are renamed t1 , t2 and t3 .

h1 10.0.1.1/24 c t1
10.0.0.1/24

10.0.1.2/24

10.0.2.1/24 10.0.0.2/24 10.0.0.1/24


h2 r s t2
10.0.2.2/24

10.0.3.2/24

10.0.3.1/24 10.0.0.1/24
h3 t3

The servers all get the same IPv4 address, 10.0.0.1. This would normally lead to chaos, but the servers are not allowed to talk to
one another, and the controller ensures that the servers are not even aware of one another. In particular, the controller makes sure
that the servers never all simultaneously reply to an ARP “who-has 10.0.0.1” query from r .
The Mininet file is loadbalance31.py and the corresponding Pox module is loadbalancepox.py.
The node r is a router, not a switch, and so its four interfaces are assigned to separate subnets. Each host is on its own subnet,
which it shares with r . The router r then connects to the only switch, s ; the connection from s to the controller c is
shown.
The idea is that each TCP connection from any of the hi to 10.0.0.1 is connected, via s , to one of the servers ti , but different
connections will connect to different servers. In this implementation the server choice is round-robin, so the first three TCP
connections will connect to t1 , t2 and t3 respectively, and the fourth will connect again to t1 .
The servers t1 through t3 are configured to all have the same IPv4 address 10.0.0.1; there is no address rewriting done to
packets arriving from the left. However, as in the preceding example, when the first packet of each new TCP connection from left
to right arrives at s , it is forwarded to c which then selects a specific ti and creates in s the appropriate forwarding rule
for that connection. As in the previous example, each TCP connection involves two Flow objects, one in each direction, and
separate OpenFlow forwarding entries are created for each flow.
There is no need for paths; the main work of routing the TCP connections looks like this:

18.25 https://eng.libretexts.org/@go/page/11003
server = pickserver(flow)
flow_to_server[flow] = server
addTCPrule(event.connection, flow, server+1) # ti is at port i+1
addTCPrule(event.connection, flow.reverse(), 1) # port 1 leads to r

The biggest technical problem is ARP: normally, r and the ti would contact one another via ARP to find the appropriate
LAN addresses, but that will not end well with identical IPv4 addresses. So instead we create “static” ARP entries. We know (by
checking) that the MAC address of r-eth0 is 00:00:00:00:00:04, and so the Mininet file runs the following command on each
of the ti :

arp -s 10.0.0.2 00:00:00:00:00:04

This creates a static ARP entry on each of the ti , which leaves them knowing the MAC address for their default router 10.0.0.2.
As a result, none of them issues an ARP query to find r . The other direction is similar, except that r (which is not really in on
the load-balancing plot) must think 10.0.0.1 has a single MAC address. Therefore, we give each of the ti the same MAC
address (which would normally lead to even more chaos than giving them all the same IPv4 address); that address is
00:00:00:00:01:ff. We then install a permanent ARP entry on r with

arp -s 10.0.0.1 00:00:00:00:01:ff

Now, when h1 , say, sends a TCP packet to 10.0.0.1, r forwards it to MAC address 00:00:00:00:01:ff, and then s forwards
it to whichever of t1 .. t3 it has been instructed by the controller c to forward it to. The packet arrives at ti with the
correct IPv4 address (10.0.0.1) and correct MAC address (00:00:00:00:01:ff), and so is accepted. Replies are similar: ti sends to
r at MAC address 00:00:00:00:00:04.
As part of the ConnectionUp processing, we set up rules so that ICMP packets from the left are always routed to t1 . This
way we have a single responder to ping requests. It is entirely possible that some important ICMP message – eg
Fragmentation required but DF flag set – will be lost as a result.
If we run the programs and create xterm windows for h1, h2 and h3 and, from each, connect to 10.0.0.1 via ssh, we can tell that
we’ve reached t1 , t2 or t3 respectively by running ifconfig . The Ethernet interface on t1 is named t1-eth0, and
similarly for t2 and t3 . (Finding another way to distinguish the ti is not easy.) An even simpler way to see the connection
rotation is to run h1 ssh 10.0.0.1 ifconfig at the mininet> prompt several times in succession, and note the
successive interface names.
If we create three connections and then run ovs-ofctl dump-flows s and look at tcp entries with destination address
10.0.0.1, we get this:

cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:01:ff,nw_src=10.0.1.1,nw_dst=10.0.0
.1,tp_src=35110,tp_dst=22 actions=output:2
cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:01:ff,nw_src=10.0.2.1,nw_dst=10.0.0
.1,tp_src=44014,tp_dst=22 actions=output:3
cookie=0x0, …,
tcp,dl_src=00:00:00:00:00:04,dl_dst=00:00:00:00:01:ff,nw_src=10.0.3.1,nw_dst=10.0.0
.1,tp_src=55598,tp_dst=22 actions=output:4
The three different flows take output ports 2, 3 and 4 on s , corresponding to t1 , t2 and t3 .

18.26 https://eng.libretexts.org/@go/page/11003
18.9.6 l2_multi.py
This final Pox controller example takes an arbitrary Mininet network, learns the topology, and then sets up OpenFlow rules so that
all traffic is forwarded by the shortest path, as measured by hopcount. OpenFlow packet-forwarding rules are set up on demand,
when traffic between two hosts is first seen.
This module is compatible with topologies with loops, provided the spanning_tree module is also loaded.
We start with the spanning_tree module. This uses the openflow.discovery module, as in 18.9.4 multitrunk.py, to build a
map of all the connections, and then runs the spanning-tree algorithm of 2.5 Spanning Tree Algorithm and Redundancy. The result
is a list of switch ports on which flooding should not occur; flooding is then disabled by setting the OpenFlow NO_FLOOD
attribute on these ports. We can see the ports of a switch s that have been disabled via NO_FLOOD by using
ovs-ofctl show s .
One nicety is that the spanning_tree module is never quite certain when the network is complete. Therefore, it recalculates the
spanning tree after every LinkEvent.
We can see the spanning_tree module in action if we create a Mininet network of four switches in a loop, as in exercise 9.0 below,
and then run the following:

./pox.py openflow.discovery openflow.spanning_tree forwarding.l2_pairs

If we run ovs-ofctl show for each switch, we get something like the following:

s1: (s1-eth2): … NO_FLOOD


s2: (s2-eth2): … NO_FLOOD
We can verify with the Mininet links command that s1-eth2 and s2-eth2 are connected interfaces. We can verify with
tcpdump -i s1-eth2 that no packets are endlessly circulating.
We can also verify, with ovs-ofctl dump-flows , that the s1 – s2 link is not used at all, not even for s1 – s2
traffic. This is not surprising; the l2_pairs learning strategy learns ultimately learns source addresses from flooded ARP
packets, which are not sent along the s1 – s2 link. If s1 hears nothing from s2 , it will never learn to send anything to
s2 .
The l2_multi module, on the other hand, creates a full map of all network links (separate from the map created by the
spanning_tree module), and then calculates the best route between each pair of hosts. To calculate the routes, l2_multi uses
the Floyd-Warshall algorithm (outlined below), which is a form of the distance-vector algorithm optimized for when a full
network map is available. (The shortest-path algorithm of 9.5.1 Shortest-Path-First Algorithm might be a faster choice.) To avoid
having to rebuild the forwarding map on each LinkEvent, l2_multi does not create any routes until it sees the first packet (not
counting LLDP packets). By that point, usually the network is stable.
If we run the example above using the Mininet rectangle topology, we again find that the spanning tree has disabled flooding on the
s1 – s2 link. However, if we have h1 ping h2 , we see that h1 → h2 traffic does take the s1 – s2 link. Here is
part of the result of ovs-ofctl dump-flows s1 :

cookie=0x0, …,
priority=65535,icmp,in_port=1,…,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02
,nw_src=10.0.0.1,nw_dst=10.0.0.2,…,icmp_type=8… actions=output:2
cookie=0x0, …, priority=65535,icmp,in_port=1,…
0,dl_src=00:00:00:00:00:01,dl_dst=00:00:00:00:00:02,nw_src=10.0.0.1,nw_dst=10.0.0.
2,…,icmp_type=0… actions=output:2
Note that l2_multi creates separate flow-table rules not only for ARP and ICMP, but also for ping (icmp_type=8) and ping
reply (icmp_type=0). Such fine-grained matching rules are a matter of preference.

18.27 https://eng.libretexts.org/@go/page/11003
Here is a brief outline of the Floyd-Warshall algorithm. We assume that the switches are numbered {1,…,N}. The outer loop has
the form for k<=N: ; at the start of stage k, we assume that we’ve found the best path between every i and j for which every
intermediate switch on the path is less than k. For many (i,j) pairs, there may be no such path.
At stage k, we examine, with an inner loop, all pairs (i,j). We look to see if there is a path from i to k and a second path from k to j.
If there is, we concatenate the i-to-k and k-to-j paths to create a new i-to-j path, which we will call P. If there was no previous i-to-j
path, then we add P. If there was a previous i-to-j path Q that is longer than P, we replace Q with P. At the end of the k=N stage, all
paths have been discovered.

18.10 Exercises
Exercises are given fractional (floating point) numbers, to allow for interpolation of new exercises. Exercise 2.5 is distinct, for
example, from exercises 2.0 and 3.0. Exercises marked with a ♢ have solutions or hints at 24.13 Solutions for Mininet.
1.0. In the RIP implementation of 18.5 IP Routers With Simple Distance-Vector Implementation, add Split Horizon (9.2.1.1 Split
Horizon).
2.0. In the RIP implementation of 18.5 IP Routers With Simple Distance-Vector Implementation, add support for link failures (the
third rule of 9.1.1 Distance-Vector Update Rules)
3.0. Explain why, in the example of 18.9.3 l2_nx.py, table 0 and table 1 will always have the same entries.
4.0. Suppose we try to eliminate the source addresses from the l2_pairs implementation.
by default, all switches report all packets to the controller, and the controller then tells the switch to flood the packet.
if a packet from ha to hb arrives at switch S, and S reports the packet to the controller, and the controller knows how to reach hb
from S, then the controller installs forwarding rules into S for destination hb. The controller then tells S to re-forward the
packet. In the future, S will not report packets to hb to the controller.
when S reports to the controller a packet from ha to hb, then the controller notes that ha is reachable via the port on S by which
the packet arrived.
Why does this not work? Hint: consider the switchline example (18.3 Multiple Switches in a Line), with h1 sending to h4, h4
sending to h1, h3 sending to h1, and finally h1 sending to h3.
5.0. Suppose we make the following change to the above strategy:
if a packet from ha to hb arrives at switch S, and S reports the packet to the controller, and the controller knows how to reach
both ha and hb from S, then the controller installs forwarding rules into S for destinations ha and hb. The controller then tells S
to re-forward the packet. In the future, S will not report packets to ha or hb to the controller.
Show that this still does not work for the switchline example.
6.0. Suppose we try to implement an Ethernet switch as follows:
the default switch action for an unmatched packet is to flood it and send it to the controller.
if a packet from ha to hb arrives at switch S, and S reports the packet to the controller, and the controller knows how to reach
both ha and hb from S, then the controller installs forwarding rules into S for destinations ha and hb. In the future, S will not
report packets with these destinations to the controller.
Unlike in exercise 4.0, the controller then tells S to flood the packet from ha to hb, even though it could be forwarded directly.
Traffic is sent in the network below:

h1 h2 h3
│ │ │
s1─────s2─────s3

(a)♢. Show that, if the traffic is as follows: h1 pings h2, h3 pings h1, then all three switches learn where h3 is.
(b). Show that, if the traffic is as follows: h1 pings h2, h1 pings h3, then none of the switches learn where h3 is.
Recall that each ping for a new destination starts with a broadcast ARP. Broadcast packets are always sent to the controller, as
there is no destination match.

18.28 https://eng.libretexts.org/@go/page/11003
7.0. In 18.9.5 loadbalance31.py, we could have configured the ti to have default router 10.0.0.3, say, and then created the
appropriate static ARP entry for 10.0.0.3:

ip route add to default via 10.0.0.3 dev ti-eth0


arp -s 10.0.0.3 00:00:00:00:00:04

Everything still works, even though the ti think their router is at 10.0.0.3 and it is actually at 10.0.0.2. Explain why. (Hint: how
is the router IPv4 address actually used by the ti ?)
8.0. As discussed in the text, a race condition can arise in the example of 18.9.4 multitrunk.py, where at the time the first TCP
packet the controller still does not know where h2 is, even though it should learn that after processing the first ARP packet.
Explain why a similar race condition cannot occur in 18.9.5 loadbalance31.py.
9.0. Create a Mininet network with four hosts and four switches as below:

h1────s1────────s2────h2
│ │
│ │
h4────s4────────s3────h3

The switches should use an external controller. Now let Pox be that controller, with

./pox.py openflow.discovery openflow.spanning_tree l2_pairs.py

10.0. Create the topology below with Mininet. Run the l2_multi Pox module as controller, with the
openflow.spanning_tree option, and identify the spanning tree created. Also identify the path taken by icmp traffic from
h1 to h2 .

h1 s1 s2 s3

s4 s5 s6

s7 s8 s9 h2

This page titled 18: Mininet is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

18.29 https://eng.libretexts.org/@go/page/11003
19: Queuing and Scheduling
Page created for new attachment

This page titled 19: Queuing and Scheduling is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars
Dordal.

19.1 https://eng.libretexts.org/@go/page/11005
20: Quality of Service
Page created for new attachment

This page titled 20: Quality of Service is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

20.1 https://eng.libretexts.org/@go/page/11006
21: Network Management and SNMP

21 Network Management and SNMP


Network management, broadly construed, consists of all the administrative actions taken to keep a network running efficiently.
This may include a number of non-technical considerations, eg staffing the help desk and negotiating contracts with vendors, but
we will restrict attention exclusively to the technical aspects of network management.
The ISO and the International Telecommunications Union have defined a formal model for telecommunications and network
management. The original model defined five areas of concern, and was sometimes known as FCAPS after the first letter of each
area:
fault
configuration
accounting
performance
security
Most non-ISP organizations have little interest in network accounting (the A in FCAPS is often replaced with “administration” for
that reason, but that is a rather vague category). Network security is arguably its own subject entirely. As for the others, we can
identify some important subcategories:
fault:
device management: monitoring of all switches, routers, servers and other network hardware to make sure they are running
properly.
server management: monitoring of the network’s application layer, that is, all network-based software services; these include
login authentication, email, web servers, business applications and file servers.
link management: monitoring of long-haul links to ensure they are working.
configuration:
network architecture: the overall design, including topology, switching vs routing and subnet layout.
configuration management: arranging for the consistent configuration of large numbers of network devices.
change management: how does a site roll out new changes, from new IP addresses to software updates and patches?
performance:
traffic management: using the techniques of 19 Queuing and Scheduling to allocate bandwidth shares (and perhaps bucket
sizes) among varying internal or external clients or customers.
service-level management: making sure that agreed-upon service targets – eg bandwidth – are met (depending on the focus,
this could also be placed in the fault category).
While all these aspects are important, technical network management narrowly construed often devolves to an emphasis on fault
management and its companion, reliability management: making sure nothing goes wrong, and, when it does, fixing it promptly. It
is through fault management that some network providers achieve the elusive availability goal of 99.999% uptime.
SNMP versus Management
While SNMP is a very important tool for network management, it is just a tool. Network management is the process of making
decisions to achieve the goals outlined above, subject to resource constraints. SNMP simply provides some input for those
decisions.
By far the most common device-monitoring protocol, and the primary focus for this chapter, is the Simple Network Management
Protocol or SNMP (21.2 SNMP Basics). This protocol allows a device to report information about its current operational state; for

21.1 https://eng.libretexts.org/@go/page/11007
example, a switch or router may report the configuration of each interface and the total numbers of bytes and packets sent via each
interface.
Implicit in any device-monitoring strategy is initial device discovery: the process by which the monitor learns of new devices. The
ping protocol (7.11 Internet Control Message Protocol) is common here, though there are other options as well; for example, it is
possible to probe the UDP port on a node used for SNMP – usually 161. As was the case with router configuration (9 Routing-
Update Algorithms), manual entry is simply not a realistic alternative.
SNMP and the Application Layer
SNMP can be studied entirely from a network-management perspective, but it also makes an excellent self-contained case study of
the application layer. Like essentially all applications, SNMP defines rules for client and server roles and for the format of requests
and responses. SNMP also contains its own authentication mechanisms (21.11 SNMPv1 communities and security and 21.15
SNMPv3), generally unrelated to any operating-system-based login authentication.
It is a practical necessity, for networks of even modest size, to automate the job of checking whether everything is working
properly. Waiting for complaints is not an option. Such a monitoring system is known as a network management system or NMS;
there are a wide range of both proprietary and open-source NMS’s available. At its most basic, an NMS consists of a library of
scripts to discover new network devices and then to poll each device (possibly but not necessarily using SNMP) at regular
intervals. Generally the data received is recorded and analyzed, and alarms are sounded if a failure is detected.
When SNMP was first established, there was a common belief that it would soon be replaced by the OSI’s Common Management
Interface Protocol. CMIP is defined in the International Telecommunication Union’s X.711 protocol and companion protocols.
CMIP uses the same ASN.1 syntax as SNMP, but has a richer operations set. It remains the network management protocol of
choice for OSI networks, and was once upon a time believed to be destined for adoption in the TCP/IP world as well.
But it was not to be. TCP/IP-based network-equipment vendors never supported CMIP widely, and so any network manager had to
support SNMP as well. With SNMP support essentially universal, there was never a need for a site to bother with CMIP.
CMIP, unlike SNMP, is supported over TCP, using the CMIP Over Tcp, or CMOT, protocol. One advantage of using TCP is the
elimination of uncertainty as to whether a request or a reply was received.

21.1 Network Architecture


Before turning to SNMP in depth, we offer a few references to other parts of this book relating to network architecture. At the
LAN and Internetwork layers local to a site, perhaps the main issues are redundancy, bandwidth and cost. Cabling between
buildings, in particular, needs to provide redundancy. See 2.5 Spanning Tree Algorithm and Redundancy and 2.6 Virtual LAN
(VLAN) for some considerations at the Ethernet level, and 7.6.3 Subnets versus Switching.
Next, a site must determine what sort of connection to the Internet it will have. ISP contracts vary greatly in terms of bandwidth,
burst bandwidth, and agreed-upon responses in the event of an outage. Some aspects of service-level specification appear in 19.9
Token Bucket Filters and 20.7.2 Assured Forwarding.
Organizations with geographically dispersed internal networks – ISPs and larger corporations – must decide how their internal sites
should be connected. Should they communicate over the public Internet? Should a VPN be used (3.1 Virtual Private Networks)? Or
should private lines (such as SONET, 4.2.2 SONET, or carrier Ethernet, 3.2 Carrier Ethernet) be leased between sites? If private
lines are used, link monitoring becomes essential.
One increasingly important architectural decision at the application layer is the extent to which network software services are
outsourced to the cloud, and run on remote servers managed by third parties.

21.2 SNMP Basics


SNMP is far and away the most popular protocol for supporting network device monitoring. At its most basic level, SNMP allows
polling of individual designated device attributes, such as the system name or the number of packets received via interface eth0
. Attributes may, however, be organized into records, sets and tables. Tables may be indexed contiguously, like an array – eg
interface[1], interface[2], interface[3], etc – or sparsely – eg interface[1], interface[32767].

21.2 https://eng.libretexts.org/@go/page/11007
While the simplest routers and switches may have quite limited provisions for SNMP, virtually all “serious” networking hardware
provides extensive support, for both standard sets of “basic” device attributes and for proprietary attributes as well. Devices with
significant SNMP support are sometimes referred to as managed devices.
An SNMP node that replies to requests for information is known as an SNMP agent. The network node doing the SNMP querying
is known as the manager; it may be part of an NMS or – more simply – be a standalone tool known as an SNMP browser or MIB
browser (where MIB stands for Management Information Base, below). While most MIB browsers are understood to have a
graphical user interface, there are also command-line tools to make SNMP queries, such as the snmpget and snmpwalk
commands of the Net-SNMP project at net-snmp.org [http://www.net-snmp.org]. These can be invoked by scripting languages to
build a simple if rudimentary NMS.
SNMP runs exclusively over UDP. The choice of UDP was made to avoid the connection overhead of what was envisioned to be a
simple request-reply protocol; if a manager polls 1,000 devices once a minute, that is 2,000 packets in all over UDP but might
easily be 8,000 packets with TCP and the necessary SYN/FIN packets. This may be especially significant when the network is
congested to near the point of failure.
The use of UDP does raise two problems: lost packets and having more data than will fit in one packet. For the simplest case of
manager-initiated data requests, a manager can handle packet loss by polling a device until a response is received. If a response (or
even a request) is too big, the usual strategy is to use IP-layer fragmentation (7.4 Fragmentation and 8.5.4 IPv6 Fragment Header).
This is not ideal, but as most SNMP data stays within local and organizational networks it is at least workable.
Another consequence of the use of UDP is that every SNMP device needs an IP address. For devices acting at the IP layer and
above, this is not an issue, but an Ethernet switch or hub has no need of an IP address for its basic function. Devices like switches
and hubs that are to use SNMP must be provided with a suitable IP interface and address. Sometimes, for security, these IP
addresses for SNMP management are placed on a private, more-or-less hidden subnet.
SNMP also supports the writing of attributes to devices, thus implementing a remote-configuration mechanism. As writing to
devices has rather more security implications than reading from them, this mechanism must be used with care. Even for read-only
SNMP, however, security is an important issue; see 21.11 SNMPv1 communities and security and 21.15 SNMPv3.
Writing to agents may be done either to configure the network behavior of the device – eg to bring an interface up or down, or to
set an IP address – or specifically to configure the SNMP behavior of the agent.
Finally, SNMP supports asynchronous notification through its traps mechanism: a device can be configured to report an error
immediately rather than wait until asked. While traps are quite important at many sites, we will largely ignore them here.
SNMP is sometimes used for server monitoring as well as device monitoring; alternatively, simple scripts can initiate connections
to services and discern at least something about their readiness. It is one thing, however, to verify that an email server is listening
on TCP port 25 and responds with the correct RFC 5321 [https://tools.ietf.org/html/rfc5321.html] (originally RFC 821
[https://tools.ietf.org/html/rfc821.html]) EHLO message; it is another to verify that messages are accepted and then actually
delivered.

21.2.1 SNMP versions


SNMP has three official versions, SNMPv1, SNMPv2 and SNMPv3. SNMPv1 made its first appearance in 1988 in a collection of
RFCs starting with RFC 1065 [https://tools.ietf.org/html/rfc1065.html] (updated in RFC 1155
[https://tools.ietf.org/html/rfc1155.html]); the current definition for “core” attribute reporting was released as RFC 1213
[https://tools.ietf.org/html/rfc1213.html] in 1991. We will return to this below in 21.10 MIB-2.
SNMPv2 was introduced in 1993 with RFC 1441 [https://tools.ietf.org/html/rfc1441.html]. Loosely speaking, SNMPv2 expanded
on the basic information, starting with RFC 1442 [https://tools.ietf.org/html/rfc1442.html] (currently RFC 2578
[https://tools.ietf.org/html/rfc2578.html]), and also introduced improved techniques for managing tables.
SNMPv2 also included a proposed security mechanism, but it was largely rejected by the marketplace. Ultimately, a version of
SNMPv2 that used the SNMPv1 “community” security mechanism (21.11 SNMPv1 communities and security) was introduced; see
RFC 1901 [https://tools.ietf.org/html/rfc1901.html]. This “community”-security version became known as SNMPv2c.
SNMPv3 then finally delivered a model for reasonably effective security. The “User-based Security Model” or USM was first
proposed in 1998 in RFC 2264 [https://tools.ietf.org/html/rfc2264.html]; the final 2002 version is in RFC 3414

21.3 https://eng.libretexts.org/@go/page/11007
[https://tools.ietf.org/html/rfc3414.html].

This page titled 21: Network Management and SNMP is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by
Peter Lars Dordal.

21.4 https://eng.libretexts.org/@go/page/11007
22: Security

22 Security
How do we keep intruders out of our computers? How do we keep them from impersonating us, or from listening in to our
conversations or downloading our data? Computer security problems are in the news on almost a daily basis. In this chapter we
take a look at just a few of the issues involved in building secure networks.
For our limited overview here, we will divide attacks into three categories:
1. Attacks that execute the intruder’s code on the target computer
2. Attacks that extract data from the target, without code injection
3. Eavesdropping on or interfering with computer-to-computer communications
The first category is arguably the most serious; this usually amounts to a complete takeover, though occasionally the attacker’s
code is limited by operating-system privileges. A computer taken over this way is sometimes said to have been “owned”. We
discuss these attacks below in 22.1 Code-Execution Intrusion. Perhaps the simplest form of such an attack is through stolen or
guessed passwords to a system that offers remote login to command-shell accounts. More technical forms of attack may involve a
virus, a buffer overflow (22.2 Stack Buffer Overflow and 22.3 Heap Buffer Overflow), a protocol flaw (22.1.2 Christmas Day
Attack), or some other software flaw (22.1.1 The Morris Worm).
In the second category are intrusions that do not involve remote code execution; a server application may be manipulated to give up
data in ways that its designers did not foresee.
For example, in 2008 David Kernell gained access to the Yahoo [http://yahoo.com] email account of then-vice-presidential
candidate Sarah Palin, by guessing or looking up the answers to the forgotten-password security questions for the account. One
question was Palin’s birthdate. Another was “where did you meet your spouse?”, which, after some research and trial-and-error,
Kernell was able to guess was “Wasilla High”; Palin grew up in and was at one point mayor of Wasilla, Alaska. Much has been
made since of the idea that the answers to many security questions can be found on social-networking sites.
As a second example in this category, in 2010 Andrew “weev” Auernheimer and Daniel Spitler were charged in the “AT&T iPad
hack”. IPad owners who signed up for network service with AT&T had their iPad’s ICC-ID
[en.Wikipedia.org/wiki/Subscri..._module#ICCID] recorded along with their other information. If one of these owners later
revisited the AT&T website, the site would automatically request the iPad’s ICC-ID and then populate the web form with the user’s
information. If a randomly selected ICC-ID were presented to the AT&T site that happened to match a real account, that user’s
name, phone number and email address would be returned. ICC-ID strings contain 20 decimal digits, but the individual-device
portion of the identifier is much smaller and this brute-force attack yielded 114,000 accounts.
This attack is somewhat like a password intrusion, except that there was no support for running commands via the “compromised”
accounts.
Auernheimer was convicted for this “intrusion” in November 2012, but his sentence was set aside on appeal in April 2014.
Auernheimer’s conviction remains controversial as the AT&T site never requested a password in the usual sense, though the site
certainly released information not intended by its designers.
Finally, the third category here includes any form of eavesdropping. If the password for a login-shell account is obtained this way, a
first-category attack may follow. The usual approach to prevention is the use of encryption. Encryption is closely related to secure
authentication; encryption and authentication are addressed below in 22.6 Secure Hashes through 22.10 SSH and TLS.
Encryption does not always work as desired. In 2006 intruders made off with 40 million credit-card records from TJX Corp by
breaking the WEP Wi-Fi encryption (22.7.7 Wi-Fi WEP Encryption Failure) used by the company, and thus gaining access to
account credentials and to file servers. Albert Gonzalez pleaded guilty to this intrusion in 2009. This was the largest retail credit-
card breach until the Target hack of late 2013.

22.1 https://eng.libretexts.org/@go/page/11008
22.1 Code-Execution Intrusion
The most serious intrusions are usually those in which a vulnerability allows the attacker to run executable code on the target
system.
The classic computer virus is broadly of this form, though usually without a network vulnerability: the user is tricked – often
involving some form of social engineering – into running the attacker’s program on the target computer; the program then makes
itself at home more or less permanently. In one form of this attack, the user receives a file
interesting_picture.jpg.exe or IRS_deficiency_notice.pdf.exe . The attack is made slightly easier by
the default setting in Windows of not displaying the final file extension .exe .
Early viruses had to be completely self-contained, but, for networked targets, once an attacker is able to run some small initial
executable then that program can in turn download additional malware. The target can also be further controlled via the network.
The reach of an executable-code intrusion may be limited by privileges on the target operating system; if I am operating a browser
on my computer as user “pld” and an intruder takes advantage of a flaw in that browser, then the intruder’s code will also run as
“pld” and not as “root” or “Administrator”. This may prevent the intruder from rewriting my kernel, though that is small comfort to
me if my files are encrypted and held for ransom.
On servers, it is standard practice to run network services with the minimum privileges practical, though see 22.2.3 Defenses
Against Buffer Overflows.
Exactly what is “executable code” is surprisingly hard to state. Scripting languages usually qualify. In 2000, the ILOVEYOU virus
began spreading on Windows systems; users received a file LOVE-LETTER.TXT.vbs (often with an enticing Subject: line
such as “love letter for you”). The .vbs extension, again not displayed by default, meant that when the file was opened it was
automatically run as a visual basic script. The ILOVEYOU virus was later attributed to Reonel Ramones and Onel de Guzman of
the Philippines, though they were never prosecuted. The year before, the Melissa virus spread as an emailed Microsoft Word
attachment; the executable component was a Word macro.
Under Windows, a number of configuration-file formats are effectively executable; among these are the program-information-file
format .PIF [en.Wikipedia.org/wiki/Program...ormation_file] and the screen-saver format .SCR
[en.Wikipedia.org/wiki/Screens...oft_Windows_2].

22.1.1 The Morris Worm


The classic Morris Worm was launched on the infant Internet in 1988 by Robert Tappan Morris
[en.Wikipedia.org/wiki/Robert_Tappan_Morris]. Once one machine was infected, it would launch attacks against other machines,
either on the same LAN or far away. The worm used a number of techniques, including taking advantage of implementation flaws
via stack buffer overflows (22.2 Stack Buffer Overflow). Two of the worm’s techniques, however, had nothing to do with code
injection. One worm module contained a dictionary of popular passwords that were used to try against various likely system
accounts. Another module relied on a different kind of implementation vulnerability: a (broken) diagnostic feature of the
sendmail email server. Someone could connect to the sendmail TCP port 25 and send the command
WIZ <password> ; that person would then get a shell and be able to execute arbitrary commands. It was the intent to require a
legitimate sendmail -specific password, but an error in sendmail ’s frozen-configuration-file processing meant that an
empty password often worked.

22.1.2 Christmas Day Attack


The 1994 “Christmas day attack” (12.10.1 ISNs and spoofing) used a TCP protocol weakness combined with a common
computer-trust arrangement to gain privileged login access to several computers at UCSD. Implementations can be fixed
immediately, once the problem is understood, but protocol changes require considerable negotiation and review.
The so-called “rlogin” trust arrangement meant that computer A might be configured to trust requests for remote-command
execution from computer B, often on the same subnet. But the ISN-spoofing attack meant that an attacker M could send a
command request to A that would appear to come from the trusted peer B, at least until it was too late. The command might be as
simple as “open up a shell connection to M”. At some point the spoofed connection would fail, but by then the harmful command

22.2 https://eng.libretexts.org/@go/page/11008
would have been executed. The only fix is to stop using rlogin. (Ironically, the ISN spoofing attack was discovered by Morris but
was not used in the Morris worm above; see [RTM85].)
Note that, as with the sendmail WIZ attack of the Morris worm, this attack did not involve network delivery of an executable
fragment (a “shellcode”).

This page titled 22: Security is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

22.3 https://eng.libretexts.org/@go/page/11008
23: Selected Solutions
Page created for new attachment

This page titled 23: Selected Solutions is shared under a CC BY-NC-ND license and was authored, remixed, and/or curated by Peter Lars Dordal.

23.1 https://eng.libretexts.org/@go/page/11009
Bibliography

23 Bibliography
Note that RFCs are not included here.

ABDGHSTVWZ15
David Adrian, Karthikeyan Bhargavan, Zakir Durumeric, Pierrick Gaudry, Matthew Green, J. Alex Halderman, Nadia
Heninger, Drew Springall, Emmanuel Thomé, Luke Valenta, Benjamin VanderSloot, Eric Wustrow, Santiago Zanella-Béguelin,
Paul Zimmermann, “Imperfect Forward Secrecy: How Diffie-Hellman Fails in Practice”, preprint at weakdh.org [weakdh.org/],
May 2015.
AEH75
Eralp Akkoyunlu, Kattamuri Ekanadham and R V Huber, “Some constraints and tradeoffs in the design of network
communications”, SOSP ‘75: Proceedings of the fifth ACM symposium on Operating systems and principles, November 1975.

AO96
Aleph One (Elias Levy), “Smashing The Stack For Fun And Profit”, Phrack volume 7 number 49, 1996, available at
http://insecure.org/stf/smashstack.html.
AGMPS10
Mohammad Alizadeh, Albert Greenberg, David Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta and
Murari Sridharan, “Data Center TCP (DCTCP)”, Proceedings of ACM SIGCOMM 2010, September 2010.
AP99
Mark Allman and Vern Paxson, “On Estimating End-to-End Network Path Properties”, Proceedings of ACM SIGCOMM 1999,
August 1999.
ACPRT16
Joël Alwen, Binyi Chen, Krzysztof Pietrzak, Leonid Reyzin and Stefano Tessaro, “Scrypt is Maximally Memory-Hard”,
Cryptology ePrint Archive, Report 2016/989, 2016, available at https://eprint.iacr.org/2016/989.
AHLR07
Chris Anley, John Heasman, Felix “FX” Linder and Gerardo Richarte, “The Shellcoder’s Handbook”, second edition, Wiley,
2007.
AKM04
Guido Appenzeller, Isaac Keslassy and Nick McKeown, “Sizing Router Buffers”, ACM SIGCOMM Computer Communication
Review, October 2004

JA05
John Assalone, “Exploiting the GDI+ JPEG COM Marker Integer Underflow Vulnerability”, Global Information Assurance
Certification technical note, January 2005, available at www.giac.org/paper/gcih/679/exploiting-gdi-plus-jpeg-marker-integer-
underflow-vulnerability/106878 [www.giac.org/paper/gcih/679/e...bility/106878].
PB62
Paul Baran, “On Distributed Computing Networks”, Rand Corporation Technical Report P-2626, 1962.
BCL09
Steven Bauer, David Clark and William Lehr, “The Evolution of Internet Congestion”, Telecommunications Policy Research
Conference (TPRC) 2009, August 2009, available at ssrn.com/abstract=1999830 [ssrn.com/abstract=1999830].
MB06
Mihir Bellare, “New Proofs for NMAC and HMAC: Security without Collision-Resistance”, Advances in Cryptology -
CRYPTO ‘06 Proceedings, Lecture Notes in Computer Science volume 4117, Springer-Verlag, 2006.

1 https://eng.libretexts.org/@go/page/11018
BCK96
Mihir Bellare, Ran Canetti and Hugo Krawczyk, “Keying Hash Functions for Message Authentication”, Advances in
Cryptology - CRYPTO ‘96 Proceedings, Lecture Notes in Computer Science volume 1109, Springer-Verlag, 1996.
BN00
Mihir Bellare and Chanathip Namprempre, “Authenticated Encryption: Relations among notions and analysis of the generic
composition paradigm”, Advances in Cryptology — ASIACRYPT 2000 / Lecture Notes in Computer Science volume 1976,
Springer-Verlag, 2000; updated version July 2007.
BZ97
Jon Bennett and Hui Zhang, “Hierarchical Packet Fair Queueing Algorithms”, IEEE/ACM Transactions on Networking,
volume 5, October 1997.
DB08
Daniel Bernstein, “The Salsa20 family of stream ciphers”, Chapter, New Stream Cipher Designs, Matthew Robshaw and
Olivier Billet, editors, Springer-Verlag, 2008.
JB05
John Bickett, “Bit-rate Selection in Wireless Networks”, MS Thesis, Massachusetts Institute of Technology, 2005.
BCTCU16
Timm Böttger, Felix Cuadrado, Gareth Tyson, Ignacio Castro and Steve Uhlig, “Open Connect Everywhere: A Glimpse at the
Internet Ecosystem through the Lens of the Netflix CDN”, ARXIV, arXiv e-print (arXiv:1606.05519
[arxiv.org/abs/1606.05519]), June 2016.
BP95
Lawrence Brakmo and Larry Peterson, “TCP Vegas: End to End Congestion Avoidance on a Global Internet”, IEEE Journal on
Selected Areas in Communications, volume 13 number 8, 1995.
BBGO08
Vladimir Brik, Suman Banerjee, Marco Gruteser and Sangho Oh, “Wireless Device Identification with Radiometric
Signatures”, Proceedings of the 14th ACM International Conference on Mobile Computing and Networking (MobiCom ‘08),
September 2008.
AB03
Andreis Brouwer, “Hackers Hut”, http://www.win.tue.nl/~aeb/linux/hh/hh.html, April 1, 2003
CF04
Carlo Caini and Rosario Firrincieli, “TCP Hybla: a TCP enhancement for heterogeneous networks”, International Journal of
Satellite Communications and Networking, volume 22, pp 547-566, 2004.
CGYJ16
Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh and Van Jacobson, “BBR Congestion-Based
Congestion Control”, ACM Queue, volume 14 number 5, September-October 2016.
CM03
Zehra Cataltepe and Prat Moghe, “Characterizing Nature and Location of Congestion on the Public Internet”, Proceedings of
the Eighth IEEE International Symposium on Computers and Communication, 2003.
CDLMR00
Stefania Cavallar, Bruce Dodson, Arjen K Lenstra, Walter Lioen, Peter Montgomery, Brian Murphy, Herman te Riele, Karen
Aardal, Jeff Gilchrist, Gerard Guillerm, Paul Leyland, Joel Marchand, François Morain, Alec Muffett, Craig Putnam, Chris
Putnam and Paul Zimmermann, “Factorization of a 512-bit RSA Modulus”, Advances in Cryptology — EUROCRYPT 2000,
Lecture Notes in Computer Science volume 1807, Springer-Verlag, 2000.
CK74

2 https://eng.libretexts.org/@go/page/11018
Vinton G Cerf and Robert E Kahn, “A Protocol for Packet Network Intercommunication”, IEEE Transactions on
Communications, volume 22 number 5, May 1974.
CJ89
Dah-Ming Chiu and Raj Jain, “Analysis of the Increase/Decrease Algorithms for Congestion Avoidance in Computer
Networks”, Journal of Computer Networks volume 17, pp. 1-14, 1989.
CJ91
David Clark and Van Jacobson, “Flexible and efficient resource management for datagram networks”, Presentation, September
1991
CSZ92
David Clark, Scott Shenker and Lixia Zhang, “Supporting Real-Time Applications in an Integrated Services Packet Network:
Architecture and Mechanism”, Proceedings of ACM SIGCOMM 1992, August 1992.
CBcDHL14
David Clark, Steven Bauer, kc claffy, Amogh Dhamdhere, BBradley Huffaker, William Lehr, and Matthew Luckie,
“Measurement and Analysis of Internet Interconnection and Congestion”, Telecommunications Policy Research Conference
(TPRC) 2014, September 2014.
DR02
Joan Daemen and Vincent Rijmen, “The Design of Rijndael: AES – The Advanced Encryption Standard.”, Springer-Verlag,
2002.
DS78
Yogen Dalal and Carl Sunshine, “Connection Management in Transport Protocols”, Computer Networks 2, 1978.
ID89
Ivan Dåmgard, “A Design Principle for Hash Functions”, Advances in Cryptology - CRYPTO ‘89 Proceedings, Lecture Notes
in Computer Science volume 435, Springer-Verlag, 1989.
DKS89
Alan Demers, Srinivasan Keshav and Scott Shenker, “Analysis and Simulation of a Fair Queueing Algorithm”, ACM
SIGCOMM Proceedings on Communications Architectures and Protocols, 1989.
DH76
Whitfield Diffie and Martin Hellman, “New Directions in Cryptography”, IEEE Transactions on Information Theory, volume
IT-22, November 1976.
EGMR05
Mihaela Enachescu, Ashish Goel, Yashar Ganjali, Nick McKeown and Tim Roughgarden. “Part III: Routers with Very Small
Buffers”, ACM SIGCOMM Computer Communication Review, volume 35 number 2, July 2005.
FF96
Kevin Fall and Sally Floyd, “Simulation-based Comparisons of Tahoe, Reno and SACK TCP”, ACM SIGCOMM Computer
Communication Review, July 1996.
FRZ13
Nick Feamster, Jennifer Rexford and Ellen Zegura, “The Road to SDN: An Intellectual History of Programmable Networks”,
ACM Queue, December 2013.

FGMPC02
Roberto Ferorelli, Luigi Grieco, Saverio Mascolo, G Piscitelli, P Camarda, “Live Internet Measurements Using Westwood+
TCP Congestion Control”, IEEE Global Telecommunications Conference, 2002.
F91

3 https://eng.libretexts.org/@go/page/11018
Sally Floyd, “Connections with Multiple Congested Gateways in Packet-Switched Networks, Part 1”, ACM SIGCOMM
Computer Communication Review, October 1991.
FJ92
Sally Floyd and Van Jacobson, “On Traffic Phase Effects in Packet-Switched Gateways”, Internetworking: Research and
Experience, volume 3, pp 115-156, 1992.
FJ93
Sally Floyd and Van Jacobson, “Random Early Detection Gateways for Congestion Avoidance”, IEEE/ACM Transactions on
Networking, volume 1, August 1993.
FJ95
Sally Floyd and Van Jacobson, “Link-sharing and Resource Management Models for Packet Networks”, IEEE/ACM
Transactions on Networking, volume 3, June 1995.
FP01
Sally Floyd and Vern Paxson, “Difficulties in Simulating the Internet”, IEEE/ACM Transactions on Networking, volume 9,
August 2001.
FMS01
Scott Fluhrer, Itsik Mantin and Adi Shamir, “Weaknesses in the Key Scheduling Algorithm of RC4”, SAC ‘01 Revised Papers
from the 8th Annual International Workshop on Selected Areas in Cryptography, Springer-Verlag, 2001.
FL03
Cheng Fu and Soung Liew, “TCP Veno: TCP Enhancement for Transmission over Wireless Access Networks”, IEEE Journal on
Selected Areas in Communications, volume 21 number 2, February 2003.

LG01
Lixin Gao, “On Inferring Autonomous System Relationships in the Internet”, IEEE/ACM Transactions on Networking, volume
9, December 2001.
GR01
Lixin Gao and Jennifer Rexford, “Stable Internet Routing without Global Coordination”, IEEE/ACM Transactions on
Networking, volume 9, December 2001.
JG93
Jose J Garcia-Lunes-Aceves, “Loop-Free Routing Using Diffusing Computations”, IEEE/ACM Transactions on Networking,
volume 1, February 1993.
GP11
L Gharai and C Perkins, “RTP with TCP Friendly Rate Control”, Internet Draft, http://tools.ietf.org/html/draft-gharai-avtcore-
rtp-tfrc-00.
GV02
Sergey Gorinsky and Harrick Vin, “Extended Analysis of Binary Adjustment Algorithms”, Technical Report TR2002-39,
Department of Computer Sciences, University of Texas at Austin, 2002.
GM03
Luigi Grieco and Saverio Mascolo, “End-to-End Bandwidth Estimation for Congestion Control in Packet Networks”,
Proceedings of the Second International Workshop on Quality of Service in Multiservice IP Networks, 2003.
GM04
Luigi Grieco and Saverio Mascolo, Performance Evaluation and Comparison of Westwood+, New Reno, and Vegas TCP
Congestion Control, ACM SIGCOMM Computer Communication Review, volume 34 number 2, April 2004.
GG03

4 https://eng.libretexts.org/@go/page/11018
Marco Gruteser and Dirk Grunwald, “Enhancing Location Privacy in Wireless LAN Through Disposable Interface Identifiers:A
Quantitative Analysis”, Proceedings of the 1st ACM International Workshop on Wireless Mobile Applications and Services on
WLAN Hotspots (WMASH ‘03), September, 2003.
HRX08
Sangtae Ha, Injong Rhee and Lisong Xu, “CUBIC: A New TCP-Friendly High-Speed TCP Variant”, ACM SIGOPS Operating
Systems Review - Research and developments in the Linux kernel, volume 42 number 5, July 2008.
SH04
Steve Hanna, “Shellcoding for Linux and Windows Tutorial”, http://www.vividmachines.com/shellcode/shellcode.html, July
2004
DH08
Dan Harkins, “Simultaneous Authentication of Equals: A Secure, Password-Based Key Exchange for Mesh Networks”, 2008
Second International Conference on Sensor Technologies and Applications (sensorcomm 2008), August 2008.
MH04
Martin Hellman, “Oral history interview with Martin Hellman”, Charles Babbage Institute, 2004. Retrieved from the University
of Minnesota Digital Conservancy, http://purl.umn.edu/107353.
JH96
Janey Hoe, “Improving the Start-up Behavior of a Congestion Control Scheme for TCP”, ACM SIGCOMM Symposium on
Communications Architectures and Protocols, August 1996.
HVB01
Gavin Holland, Nitin Vaidya and Paramvir Bahl, “A rate-adaptive MAC protocol for multi-Hop wireless networks”, MobiCon
‘01: Proceedings of the 7th annual International Conference on Mobile Computing and Networking, 2001.
CH99
Christian Huitema, Routing in the Internet, second edition, Prentice Hall, 1999.
HBT99
Paul Hurley, Jean-Yves Le Boudec and Patrick Thiran, “A Note on the Fairness of Additive Increase and Multiplicative
Decrease”, Proceedings of ITC-16, 1999.
JK88
Van Jacobson and Michael Karels, “Congestion Avoidance and Control”, Proceedings of the Sigcomm ‘88 Symposium, volume
18(4), 1988.
JKMOPSVWZ13
Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer,
Junlan Zhou, Min Zhu, Jonathan Zolla, Urs Hölzle, Stephen Stuart and Amin Vahdat, “B4: Experience with a Globally-
Deployed Software Defined WAN”, Proceedings of the ACM SIGCOMM 2013 conference on SIGCOMM, August 12-16,
2013.
JWL04
Cheng Jin, David Wei and Steven Low, “FAST TCP: Motivation, Architecture, Algorithms, Performance”, IEEE INFOCOM
2004 Proceedings, March 2004.
KM97
Ad Kamerman and Leo Monteban, “WaveLAN-II: A high-performance wireless LAN for the unlicensed band”, AT&T Bell
Laboratories Technical Journal, volume 2 number 3, 1997.
SK88
Srinivasan Keshav, “REAL: A Network Simulator” (Technical Report), University of California at Berkeley, 1988.
KKCQ06

5 https://eng.libretexts.org/@go/page/11018
Jongseok Kim, Seongkwan Kim, Sunghyun Choi and Daji Qiao, “CARA: Collision-Aware Rate Adaptation for IEEE 802.11
WLANs”, IEEE INFOCOM 2006 Proceedings, April 2006.
LK78
Leonard Kleinrock, “On Flow Control in Computer Networks”, Proceedings of the International Conference on
Communications, June 1978.
LH06
Mathieu Lacage and Thomas Henderson, “Yet Another Network Simulator”, Proceedings of WNS2 ‘06: Workshop on ns-2: the
IP network simulator, 2006.
LM91
Xuejia Lai and James L. Massey, “A Proposal for a New Block Encryption Standard”, EUROCRYPT ‘90 Proceedings of the
workshop on the theory and application of cryptographic techniques on Advances in cryptology, Springer-Verlag, 1991.
LKCT96
Eliot Lear, Jennifer Katinsky, Jeff Coffin and Diane Tharp, “Renumbering: Threat or Menace?”, Tenth USENIX System
Administration Conference, Chicago, 1996.
LSL05
DJ Leith, RN Shorten and Y Lee, “H-TCP: A framework for congestion control in high-speed and long-distance networks”,
Hamilton Institute Technical Report, August 2005.
LSM07
DJ Leith, RN Shorten and G McCullagh, “Experimental evaluation of Cubic-TCP”, Extended version of paper presented at
Proc. Protocols for Fast Long Distance Networks, 2007.

LBS06
Shao Liu, Tamer Basar and R Srikant, “TCP-Illinois: A Loss and Delay-Based Congestion Control Algorithm for High-Speed
Networks”, Proceedings of the 1st international conference on performance evaluation methodologies and tools, 2006.
AM90
Allison Mankin, “Random Drop Congestion Control”, ACM SIGCOMM Symposium on Communications Architectures and
Protocols, 1990.
MCGSW01
Saverio Mascolo, Claudio Casetti, Mario Gerla, MY Sanadidi, Ren Wang, “TCP westwood: Bandwidth estimation for enhanced
transport over wireless links”, MobiCon ‘01: Proceedings of the 7th annual International Conference on Mobile Computing and
Networking, 2001.
McK90
Paul McKenney, “Stochastic Fairness Queuing”, IEEE INFOCOM ‘90 Proceedings, June 1990.
MABPPRST08
Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer Rexford, Scott Shenker and
Jonathan Turner, “OpenFlow: Enabling innovation in campus networks”, ACM SIGCOMM Computer Communications
Review, April 2008.
RM78
Ralph Merkle, “Secure Communications over Insecure Channels”, Communications of the ACM, volume 21, April 1978.
RM88
Ralph Merkle, “A Digital Signature Based on a Conventional Encryption Function”, Advances in Cryptology — CRYPTO ‘87,
Lecture Notes in Computer Science volume 293, Springer-Verlag, 1988.
MH81

6 https://eng.libretexts.org/@go/page/11018
Ralph Merkle and Martin Hellman, “On the Security of Multiple Encryption”, Communications of the ACM, volume 24, July
1981.
MB76
Robert Metcalfe and David Boggs, “Ethernet: Distributed Packet Switching for Local Computer Networks”, Communications
of the ACM, volume 19 number 7, 1976.
MW00
Jeonghoon Mo and Jean Walrand, “Fair End-to-End Window-Based Congestion Control”, IEEE/ACM Transactions on
Networking, volume 8 number 5, October 2000.

JM92
Jeffrey Mogul, “Observing TCP Dynamics in Real Networks”, ACM SIGCOMM Symposium on Communications
Architectures and Protocols, 1992.
MM94
Mart Molle, “A New Binary Logarithmic Arbitration Method for Ethernet”, Technical Report CSRI-298, Computer Systems
Research Institute, University of Toronto, 1994.
RTM85
Robert T Morris, “A Weakness in the 4.2BSD Unix TCP/IP Software”, AT&T Bell Laboratories Technical Report, February
1985.
NJ12
Kathleen Nichols and Van Jacobson, “Controlling Queue Delay”, ACM Queue, May 2012.
OKM96
Teunis Ott, JHB Kemperman and Matt Mathis, “The stationary behavior of ideal TCP congestion avoidance”, Technical Report,
1996.
PFTK98
Jitendra Padhye, Victor Firoiu, Don Towsley and Jim Kurose, “Modeling TCP Throughput: A Simple Model and its Empirical
Validation”, ACM SIGCOMM conference on Applications, technologies, architectures, and protocols for computer
communication, 1998.
PG93
Abhay Parekh and Robert Gallager, “A Generalized Processor Sharing Approach to Flow Control in Integrated Services
Networks - The Single-Node Case”, IEEE/ACM Transactions on Networking, volume 1 number 3, June 1993.
PG94
Abhay Parekh and Robert Gallager, “A Generalized Processor Sharing Approach to Flow Control in Integrated Services
Networks - The Multiple Node Case”, IEEE/ACM Transactions on Networking, volume 2 number 2, April 1994.
VP97
Vern Paxson, “End-to-End Internet Packet Dynamics”, ACM SIGCOMM conference on Applications, technologies,
architectures, and protocols for computer communication, 1997.
PWZMTQ17
Changhua Pei, Zhi Wang, Youjian Zhao, Zihan Wang, Yuan Meng, Dan Pei, Yuanquan Peng, Wenliang Tang and Xiaodong Qu,
“Why It Takes So Long to Connect to a WiFi Access Point”, IEEE International Conference on Computer Communications,
May 2017.
CP09
Colin Percival, “Stronger Key Derivation Via Sequential Memory-Hard Functions”, BSDCan - The Technical BSD Conference,
May 2009.
PB94

7 https://eng.libretexts.org/@go/page/11018
Charles Perkins and Pravin Bhagwat, “Highly Dynamic Destination-Sequenced Distance-Vector Routing (DSDV) for Mobile
Computers”, ACM SIGCOMM Computer Communications Review, volume 24 number 4, October 1994.
PR99
Charles Perkins and Elizabeth Royer, “Ad-hoc On-Demand Distance Vector Routing”, Proceedings of the Second IEEE
Workshop on Mobile Computing Systems and Applications, February 1999.
RP85
Radia Perlman, “An Algorithm for Distributed Computation of a Spanning Tree in an Extended LAN”, ACM SIGCOMM
Computer Communication Review 15(4), 1985.
RP04
Radia Perlman, “Rbridges: Transparent Routing”, IEEE INFOCOM 2004 Proceedings, March 2004.
JP88
John M Pollard, “Factoring with Cubic Integers”, unpublished manuscript circulated 1988; included in “The Development of
the Number Field Sieve”, Lecture Notes in Mathematics volume 1554, Springer-Verlag, 1993.
PDG12
Balaji Prabhakar, Katherine N Dektar and Deborah M Gordon, “The Regulation of Ant Colony Foraging Activity without
Spatial Information”, PLoS Computational Biology 8(8), http://journals.plos.org/ploscompbiol/article?
id=10.1371/journal.pcbi.1002670
PN98
Thomas Ptacek and Timothy Newsham, “Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection”,
Technical report, Secure Networks Inc, January 1998.

RJ90
Kadangode Ramakrishnan and Raj Jain, “A Binary Feedback Scheme for Congestion Avoidance in Computer Networks”, ACM
Transactions on Computer Systems, volume 8 number 2, May 1990.
RX05
Injong Rhee and Lisong Xu, “Cubic: A new TCP-friendly high-speed TCP variant,” 3rd International Workshop on Protocols
for Fast Long-Distance Networks, February 2005.
RR91
Ronald Rivest, “The MD4 message digest algorithm”, Advances in Cryptology - CRYPTO ‘90 Proceedings, Springer-Verlag,
1991.
RSA78
Ronald Rivest, Adi Shamir and Leonard Adleman, “A Method for Obtaining Digital Signatures and Public-Key
Cryptosystems”, Communications of the ACM, volume 21, February 1978.

SRC84
Jerome Saltzer, David Reed and David Clark, “End-to-End Arguments in System Design”, ACM Transactions on Computer
Systems, volume 2 number 4, November 1984.
BS93
Bruce Schneier, “Description of a New Variable-Length Key, 64-Bit Block Cipher (Blowfish)”, Fast Software Encryption,
Cambridge Security Workshop Proceedings (December 1993), Springer-Verlag, 1994.
SM90
Nachum Shacham and Paul McKenney, “Packet recovery in high-speed networks using coding and buffer management”, IEEE
INFOCOM ‘90 Proceedings, June 1990.
SP03

8 https://eng.libretexts.org/@go/page/11018
Umesh Shankar and Vern Paxson, “Active Mapping: Resisting NIDS Evasion without Altering Traffic”, Proceedings of the
2003 IEEE Symposium on Security and Privacy, 2003.

SV96
M Shreedhar and George Varghese, “Efficient Fair Queuing Using Deficit Round Robin”, IEEE/ACM Transactions on
Networking, volume 4 number 3, June 1996.
JS05
John Ivan Simon, “John Simon on Music: Criticism, 1979-2005”, Applause Books, 2005.
SKS06
Rade Stanojević, Christopher Kellett and Robert Shorten, “Adaptive Tuning of Drop-Tail Buffers for Reducing Queuing
Delays”, IEEE Communications Letters, volume 10 number 7, August 2006.
TSZS06
Kun Tan, Jingmin Song, Qian Zhang and Murari Sidharan, “Compound TCP: A Scalable and TCP-friendly Congestion Control
for High-speed Networks”, 4th International Workshop on Protocols for Fast Long-Distance Networks (PFLDNet), 2006.
TWHL05
Ao Tang, Jintao Wang, Sanjay Hegde and Steven Low, “Equilibrium and Fairness of Networks Shared by TCP Reno and
Vegas/FAST”, Telecommunications Systems special issue on High Speed Transport Protocols, 2005.
TWP07
Erik Tews, Ralf-Philipp Weinmann and Andrei Pyshkin, “Breaking 104-bit WEP in less than 60 seconds”, WISA‘07
Proceedings of the 8th International Conference on Information Security Applications, Springer-Verlag, 2007.
VMCCP16
Mathy Vanhoef, Célestin Matte, Mathieu Cunche, Leonardo Cardoso and Frank Piessens, “Why MAC Address Randomization
is not Enough: An Analysis of Wi-Fi Network Discovery Mechanisms”, ACM Asia Conference on Computer and
Communications Security, May 2016.
VP17
Mathy Vanhoef and Frank Piessens, “Key Reinstallation Attacks: Forcing Nonce Reuse in WPA2”, Proceedings of the 2017
ACM SIGSAC Conference on Computer and Communications Security, October 2017.
VGE00
Kannan Varadhan, Ramesh Govindan and Deborah Estrin, “Persistent Route Oscillations in Inter-domain Routing”, Computer
Networks, volume 32, January, 2000.
SV02
Serge Vaudenay, “Security Flaws Induced by CBC Padding – Applications to SSL, IPSEC, WTLS…”, EUROCRYPT ‘02
Proceedings, 2002.
WJLH06
David Wei, Cheng Jin, Steven Low and Sanjay Hegde, “FAST TCP: Motivation, Architecture, Algorithms, Performance”, ACM
Transactions on Networking, December 2006.
WM05
Damon Wischik and Nick McKeown, “Part I: Buffer Sizes for Core Routers”, ACM SIGCOMM Computer Communication
Review, volume 35 number 2, July 2005.
LZ89
Lixia Zhang, “A New Architecture for Packet Switching Network Protocols”, PhD Thesis, Massachusetts Institute of
Technology, 1989.
ZSC91

9 https://eng.libretexts.org/@go/page/11018
Lixia Zhang, Scott Shenker and David Clark, “Observations on the Dynamics of a Congestion Control Algorithm: The Effects
of Two-Way Traffic”, ACM SIGCOMM Symposium on Communications Architectures and Protocols, 1991.

10 https://eng.libretexts.org/@go/page/11018
Index
Index
Symbols | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Z

Symbols
.onion addresses 4G
/64, IPv6 5 GHz
/etc/resolv.conf 6in4, IPv6
0-RTT protection, QUIC 6LoWPAN
0-RTT protection, TLS v1.3 802.11
1-RTT protection, QUIC 802.11i, IEEE
1-RTT protection, TLS v1.3 802.11r handoff
127.0.1.1 802.16
2-D parity 802.1D, IEEE
2.4 GHz 802.1Q
3DES 802.1X, IEEE
4B/5B 802.3, IEEE Ethernet

1 https://eng.libretexts.org/@go/page/11020
A and AAAA records, DNS algorithm, Karn/Partridge
A record, DNS algorithm, link-state
accelerated open, TCP algorithm, loop-free distance vector
access point, CDN algorithm, Nagle
access point, Wi-Fi algorithm, Shortest-Path First
accurate costs algorithm, spanning-tree
ACD, IPv4 Alice
ACK compression alignment, memory
ACK, TCP all-nodes multicast address
ACK[N] all-routers multicast address
acknowledgment ALOHA
acknowledgment number, TCP AMI
acknowledgments, QUIC Answer, DNS
ACKs of unsent data, TCP anternet
active close, TCP anycast address, IPv6
active queue management anycast, via BGP
active subqueue Aodh
ad hoc configuration, Wi-Fi AODV, [1]
ad hoc wireless network APNIC
adaptive droptail algorithm AQM
ADDITIONAL, DNS ARC4
additive increase, multiplicative decrease ARCFOUR
address architecture, network
address configuration, manual IPv6 argon2
address ownership, QUIC ARIN
address randomization arithmetic, fast
Address Resolution Protocol ARP
AddressFamily ARP cache
Administratively Prohibited ARP failover
admission control, RSVP ARP spoofing
ADT ARPANET
advertised window size ARQ protocols
aero AS-path
AES AS-set
AF drop precedence ASLR
AfriNIC ASN.1, [1]
agent configuration, SNMP ASN.1 enumerated type
agent, SNMP association, Wi-Fi
AH, IPsec Assured Forwarding
AIMD, [1] Assured Forwarding PHB
algorithm, AODV asymmetric routes
algorithm, distance-vector Asynchronous Transfer Mode
algorithm, DSDV at-least-once semantics
algorithm, EIGRP ATM, [1]
algorithm, Ethernet learning augmentation, SNMP
algorithm, exponential backoff authentication header, IPsec
algorithm, fair queuing bit-by-bit round-robin authentication with secure-hash functions
algorithm, fair queuing GPS authenticator, WPA
algorithm, fair queuing, quantum authoritative nameserver
algorithm, Floyd-Warshall AUTHORITY, DNS
algorithm, hierarchical weighted fair queuing autoconfiguration, IPv4
algorithm, HWMP autonomous system, [1], [2]

2 https://eng.libretexts.org/@go/page/11020
B8ZS BGP speaker
backbone BGP table size
backoff, Ethernet big-endian
backup link, BGP binary data
backwards compatibility, TCP bind()
bad news, distance-vector bit stuffing
badssl.com bit-by-bit round robin
band width, wireless biz
bandwidth BLAM
bandwidth delay Blowfish
bandwidth guarantees Bluetooth
bandwidth × delay, [1] Bob
base station, WiMAX boot counter
basic encoding rules border gateway protocol
BBR TCP border routers
BBRR bottleneck link, [1], [2]
bcrypt BPDU message, spanning tree
BDP bpf
beacon packets, Wi-Fi bps, [1]
beacon, Wi-Fi broadcast IPv4 address
beefcafe broadcast, Ethernet
BER broadcast, MANETs
Berkeley Packet Filter BSD
Berkeley Unix buffer overflow
best-effort, [1] buffer overflow, heap
best-path selection, BGP bufferbloat, [1], [2]
BGP, [1] bugs
BGP relationships byte stuffing

3 https://eng.libretexts.org/@go/page/11020
CA CMOT
cache poisoning, DNS CNAME, DNS
cache, DNS, [1] CoDel
CAM table collision, [1]
canonical name, DNS collision attack, MD5
Canopy collision avoidance
capture effect, Ethernet collision detection, [1], [2]
care-of address collision detection, wireless
carrier Ethernet collision domain
CBC mode collision problem, hash
CDN and IntServ collision, Wi-Fi
CDN closest edge server commands, Mininet
CDNs community attribute, BGP, [1]
cell-loss priority bit, ATM community, SNMP
censorship concave cwnd graph
certificate authorities configuring FreeRADIUS
certificate authority configuring WPA2-Enterprise
certificate pinning congestion, [1], [2]
certificate revocation congestion avoidance phase
certificate revocation list congestion bit
CFB mode congestion window
CGA connect(), [1]
chain of trust connection
challenge-handshake authentication connection table, virtual circuit
channel width connection-oriented
channel, Wi-Fi connection-oriented networking
chap authentication connectionless networking
checksum offloading, [1] conservative
checksum, TCP Content-Distribution Networks
checksum, UDP contention
Christmas day attack contention interval
chrome://flags contributing source, RTP
chrome://net-internals control packet, Wi-Fi
CIDR control packet, Wi-Fi ACK
cipher feedback mode control packet, Wi-Fi RTS/CTS
cipher modes convergence to TCP fairness, [1]
Cisco, [1], [2] convex cwnd graph
class A/B/C addressing counter mode
Class Selector PHB CRC code
class, queuing discipline cross-site scripting
classful queuing discipline crossbar switch
Classless Internet Domain Routing cryptographically generated address, [1]
clear-to-send, Wi-Fi CSMA, [1]
client CSMA persistence
client-server CSMA/CA
ClientHello, TLS CSMA/CD
ClientKey CSRC
ClientSignature CTR mode
cliff, [1] CTS, Wi-Fi
CLNP Cubic, TCP
clock recovery cumulative ACK
clock synchronization customer, BGP
close, TCP, [1] cut-through
closest CDN edge server cwnd
CMIP CWR, [1]
CMNS cyclical redundancy check

4 https://eng.libretexts.org/@go/page/11020
D
DAD, IPv6 distance-vector
DALLY, TFTP distance-vector routing update
DANE distribution network, Wi-Fi
dark website DIX Ethernet
Data Center TCP DNS, [1], [2], [3], [4], [5], [6], [7]
data rate DNS A and AAAA records
data types, SNMPv1 DNS and CDNs
Data[N] DNS and IPv6, [1], [2]
datacenter DNS attack, 2018
Datagram Congestion Control Protocol DNS authentication
datagram forwarding DNS over HTTPS
DCCP DNS server, public
DCCP congestion control DNS, round-robin
DCCP connection establishment DNSKEY, DNS
DCTCP DNSSEC
deadbeef DNSSEC OK flag
DECbit dnssec-failed.org
DECnet DoH
default forwarding entry domain fronting
default route, [1] Domain Name System
defense in depth, security domain name system
deflation, cwnd domain registrar
delay constraints domain registrar, DNS
delay, bandwidth Dont Fragment bit
delay, largest-packet doze mode, Wi-Fi
delay, propagation draft standard
delay, queuing Dragonfly
delay, store-and-forward drop precedence, DiffServ AF
delay-based congestion control DS
delayed ACKs, [1] DS domain
denial of service attack DS field, IPv4 header
dense wavelength-division multiplexing DS, DNS
DES DS1 line
Destination Unreachable DS3 line
DHCP DSDV, [1]
DHCP Relay DSO channel
DHCPv6, [1] dual stack
Differentiated Services, [1] DuckDuckGo
Diffie-Hellman-Merkle key exchange dumbbell network topology
DiffServ duplicate address detection, IPv4
DIFS, Wi-Fi, [1] duplicate connection request
dig, [1] duplicate-address detection, IPv6
digital signatures, RSA durian
discrete logarithm problem DWDM
distance vector, loop-free versions dynamic rate scaling

5 https://eng.libretexts.org/@go/page/11020
EAP error-correcting code
EAPOL error-detection code
EasyConnect, Wi-Fi ESP, IPsec
EasyMesh Wi-Fi ESS, Wi-Fi
ECB mode estimated flightsize
ECE, [1] Ethernet
Echo Request/Reply Ethernet address
ECMP, [1] Ethernet hub
ECN, [1] Ethernet repeater
ECN and VPNs Ethernet switch
ECT Ethernet switch hardware
edge server, CDN ethtool, Linux
edge server, closest Euclidean algorithm
EFS EUI-64 identifier
EGP evasion, intrusion detection
EIGRP exactly-once semantics
elephant flow Expedited Forwarding
elevator algorithm Expedited Forwarding PHB
elliptic-curve cryptography expiration date, certificate
eNB, LTE Explicit Congestion Notification, [1]
encapsulated security payload, IPsec exponential backoff, Ethernet
encoding exponential backoff, Wi-Fi
encrypt-and-MAC exponential growth
encrypt-then-MAC export filtering, BGP
encryption extended interface table, SNMP
end-to-end encryption extended-validation certificates
End-to-End principle, [1] extender, Wi-Fi
engines, SNMPv3 Extensible Authentication Protocol
enumerated type, SNMP extension headers
equal-cost multipath routing, [1] exterior routing
error message, ICMP extreme TCP unfairness

6 https://eng.libretexts.org/@go/page/11020
face:b00c flow control, QUIC
facebookcorewwwi flow control, TCP
factoring RSA keys Flow Label
fading flow specification
fair queuing flow tables
fair queuing and AF flow, IPv6
fair queuing and EF flow, TCP
fairness flowspec, RSVP
fairness, TCP, [1], [2], [3] Floyd-Warshall algorithm
fallback to flooding fluid model, fair queuing
fast arithmetic foraging
Fast Ethernet foreign agent
fast handoff, Wi-Fi fortune
Fast Open, TCP forward secrecy
fast primality testing forwarding delay
Fast Recovery Forwarding Information Base
fast retransmit forwarding table
fastest sequence, token-bucket forwarding, IP
FCAPS forwarding, MANET
Feistel network four-way handshake, Wi-Fi
FIB foxes
Fibonacci sequence fragment header, IPv6
FIFO queuing fragment offset
fill time, voice fragmentation
filtering, BGP Fragmentation Required
filterspec fragmentation, IP
FIN fragmentation, Wi-Fi
FIN packet frame
finishing order, WFQ frames, QUIC
firewall, [1], [2] framing
fixed wireless FreeRADIUS
flights of packets frequency band
flightsize Friendliness, TCP
flightsize, estimated Frost, Robert
flow control full-duplex Ethernet
fwmark, [1]

G
generalized processor sharing gigabit Ethernet
generic hierarchical queuing glibc-2.2.4
geoDNS global scope, IPv6 addresses
geographical routing GNS
geolocation, IP Gnu DNS
GET request, HTTP Gnu Name System
getaddrinfo() goodput, [1]
getAllByName(), Java governments, DANE
getAllByName(), java GPS, fair queuing
GetBulk, SNMPv2 granularity, loss counting
getByName() gratuitous ARP
getByName(), Java greediness in TCP
getByName(), java group, OpenFlow
gethostbyname() GTC, EAP

7 https://eng.libretexts.org/@go/page/11020
H-TCP hidden-node collisions
half-closed, TCP hidden-node problem
half-open, TCP hierarchical routing, [1], [2]
Hamilton TCP hierarchical token bucket
Hamming code hierarchical token bucket, Linux
handoff, Wi-Fi high-bandwidth TCP problem
Handshake packet, QUIC Highspeed TCP
handshake protocol, TLS history, RMON
Happy Eyeballs, IPv6 HMAC, [1]
hard fail, OCSP hold down
hardware, Ethernet switch home address
Hash Message Authentication Code home agent
hash, password host command
HDLC host key, ssh
head-of-line blocking, [1], [2] Host Top N, RMON
header Host Unreachable
header, Ethernet host-specific forwarding
header, IPv4 hot-potato routing
header, IPv6 htb, Linux, [1]
header, TCP htonl
header, UDP htons
heap buffer overflow HTTP, [1]
heap vulnerability https
Hellman (Diffie-Hellman-Merkel) hub, Ethernet
Hello, spanning-tree hulu
henhouse HWMP
hidden node problem Hybla, TCP

8 https://eng.libretexts.org/@go/page/11020
IAB interior routing
IANA, [1], [2] Internet Architecture Board
ICANN Internet checksum
ICMP Internet Engineering Task Force
ICMPv6 Internet exchange point
IDEA Internet Key Exchange
idempotency and 0-RTT Internet Society
idempotent intersymbol interference
IDENT field intrusion detection
IEEE, [1] IntServ
IEEE 802.11 IP, [1]
IEEE 802.1D ip command (Linux)
IEEE 802.1Q IP forwarding
IEEE 802.1X IP fragmentation
IEEE 802.3 Ethernet IP multicast
IETF IP network
ifconfig IP-in-IP encapsulation
ifDescr ipconfig
ifIndex IPsec
IFS, Wi-Fi iptables, [1]
ifType, SNMP IPv4 header
ifXTable, SNMP IPv6, [1]
IKEv2 IPv6 /64
Illinois, TCP IPv6 address configuration, manual
implementations, at least two IPv6 addresses
import filtering, BGP IPv6 connections, link-local
incarnation, connection IPv6 extension headers
incast, TCP IPv6 header
inetCidrRouteTable ipv6 interface identifier
inflation, cwnd IPv6 link-local connections
info IPv6 multicast
infrastructure configuration, Wi-Fi IPv6 Neighbor Discovery
Initial packet, QUIC IPv6 programming
initial sequence number, TCP, [1], [2] IPv6 tunnel
initialization vector irregular prime
instability, BGP IS-IS
integrated services ISM band
integrated services, generic ISN, [1]
interconnection fabric ISOC
interface ISP
interface identifier, ipv6 IV
interface table, extended, SNMP IXP

J
jail, staying out of jitter, [1], [2], [3], [4]
Java getAllByName() join, multicast
java getAllByName() joining a Wi-Fi network
Java getByName() JPEG heap vulnerability
java getByName() JSON
javascript jumbogram, IPv6

9 https://eng.libretexts.org/@go/page/11020
Karn/Partridge algorithm keystream
kB KiB
KeepAlive, TCP kings
key exchange, TLS knee, [1], [2]
key-scheduling algorithm, RC4 known_hosts, ssh
key-signing parties KRACK attack, WPA

L
LACNIC listen
ladder diagram little-endian
LAN, [1] load balancer, Pox
LAN layer load-balancing
LARTC load-balancing, SDN
latching on, TFTP load-balancing, traditional
layer 3 switch local traffic
layers local-area network
leaky bucket, alternative formulation logical link layer
learning, Ethernet switch, [1], [2] logjam attack
legacy routing lollipop numbering
length-extension vulnerability longest-match rule, [1]
liberal loopback address
licensing this book loopback interface
lightning loss recovery, sliding windows
limiting delay loss synchronization
link-layer ACK loss synchronization, TCP
link-local address loss-based congestion control
link-state packets loss-tolerant, [1]
link-state routing update lossy-link TCP problem
Linux, [1], [2], [3], [4], [5], [6], [7], [8], [9], [10] lost final ACK
Linux advanced routing lost final ACK, TFTP
Linux htb LRO, TCP
Linux IPv6 routing LSA
Linux sfq LSO, TCP
Lisp LSP
LTE

10 https://eng.libretexts.org/@go/page/11020
MAC address middleboxes
MAC address randomization middleboxes and ECN
MAC layer MIMO
MAC-then-encrypt minimal network configuration
MAE minimizing route cost
MAE-East minimum RTO, TCP
man-in-the-middle attack, [1] mininet
managed device, SNMP Mininet commands
management frame protection, Wi-Fi MISO
management packet, Wi-Fi, [1] Mitnick, Kevin, [1]
manager, SNMP mixer, RTP
Manchester encoding mobile IP
MANET mobile wireless network
mangling, [1] modified EUI-64 identifier
manual IPv6 address configuration MPLS
MapReduce MPTCP
market, IPv4 addresses msieve
master secret, TLS MTU
Matrix, RMON multi-exit discriminator, [1]
max-min fairness multi-protocol label switching
Maximum Transfer Unit multicast
MB multicast address allocation
Mbone multicast IP address
Mbps, [1] multicast programming
MD5 multicast subscription
MED, [1], [2] multicast tree
media-access control multicast, Ethernet
Merkle (Diffie-Hellman-Merkel) multicast, IP, [1]
Merkle-Dåmgard construction multihomed, [1]
mesh network multihoming and ARP
mesh network, HWMP multihoming and TCP
MiB multihoming and UDP
MIB browsers multipath interference
MIB-2 Multipath TCP
MIC multiple flow tables
Michael multiple losses
Microsoft SNMP agent multiple token buckets
middlebox MUST
MX records, DNS

11 https://eng.libretexts.org/@go/page/11020
Nagle algorithm network prefix, IPv6
naked domain, DNS Network Unreachable
nameserver NewReno, TCP
NAT, [1], [2], [3], [4] next_hop
NAT and ICMP NEXT_HOP attribute, BGP
NAT and IPsec NFS
NAT problems NMS
NAT-PT, IPv6-to-IPv4 no-transit, BGP
NAT64 no-valley theorem, BGP
nc node information message, IPv6
Neighbor Advertisement, IPv6 non-compliant, token-bucket
neighbor discovery security non-congestive loss
Neighbor Discovery, IPv6 non-executable stack
Neighbor Solicitation, IPv6 non-recursive DNS lookup
net neutrality non-repudiation, [1]
Net-SNMP, [1], [2], [3], [4] nonce, [1]
Net-SNMP and SNMPv3 nonce, cryptographic
netcat, [1], [2] nonpersistence
netcat HTTP GET request NOPslide
netem queue noSuchObject, SNMP
netsh (Windows) NoTCP Manifesto
netstat, [1] NRZ
network address NRZI
network address translation NS record, DNS
network architecture ns-2 trace file
network entry, WiMAX ns-2 tracefiles, reading with python
Network File System NSEC
network interface, Ethernet NSEC, DNS
network management NSEC3, DNS
Network Management System NSFNet
network model NSIS
five layer nslookup, [1], [2]
four layer ntohl
seven layer ntohs
number-field sieve
network number
NX page bit
network prefix
NXDOMAIN, DNS

O
Object ID, SNMP ones-complement
OBJECT-IDENTITY, SNMP onion addresses
OBJECT-TYPE, SNMP OpenBSD
OC-3 OpenFlow
OCSP openssl
OCSP stapling openSSL programming
OFDM Opportunistic Wireless Encryption
OFDMA Optical Transport Network
offloading, TCP optimistic DAD
OID, SNMP opus
old duplicate packets orthogonal frequency-division multiplexing
old duplicates, TCP, [1] OSI
OLSR OSPF
on-demand mode, HWMP OTN
one-time pad overhead, ns-2

12 https://eng.libretexts.org/@go/page/11020
P
packet loss rate polling, TCP
packet pairs POODLE vulnerability
packet size Port Control Protocol
PAP, EAP port exhaustion, TCP
Parekh-Gallager claim port numbers, UDP
Parekh-Gallager theorem Port Unreachable
parking-lot topology potatoes, hot
partial ACKs power and wireless
passive close, TCP power management, Wi-Fi
password hash Pox controller
password sniffing, [1] PPP
path attributes, BGP PPPoE
path bandwidth pre-master secret, TLS
Path MTU Discovery, [1] prefix information, IPv6
path MTU discovery, TCP presentation layer
PATH packet, RSVP presidents
PAWS, TCP primality testing, fast
PBKDF2 Prime Number Theorem
pcap primitive root modulo p
pcapng priority for small packets, WFQ
PCF Wi-Fi priority queuing, [1]
PEAP priority queuing and AF
peer, BGP privacy
peer-to-peer privacy extensions, SLAAC
per-hop behaviors, DiffServ private IPv4 address
perfect forward secrecy private IPv6 address
persist timer, TCP PRNG
persistence proactive mode, HWMP
phase effects, TCP probe request, Wi-Fi
PHBs, DiffServ probing, IPv6 networks
physical address, Ethernet promiscuous mode
physical layer propagation delay
PIFS, Wi-Fi proportional fairness
PIM-SM protection against wrapped segments, TCP
ping, [1] protocol graph
ping6 protocol-independent multicast
pinning, TLS certificates provider, BGP
pipe drain provider-based routing
pipelining, SMTP proxy ARP
PKI pseudorandom number generator
playback buffer PSH
PMK, 802.1X PSK, WPA2
Pogonomyrmex PTK
Point of Presence public DNS server
point-to-point protocol public-key encryption
poison reverse public-key infrastructure
policing pulse stuffing, TDM
policy-based routing, [1] push, TCP
Pollard's rho algorithm python tracefile script, [1], [2], [3]
polling mode, Wi-Fi python, reading ns-2 tracefiles

13 https://eng.libretexts.org/@go/page/11020
QoS queue utilization, token bucket
QoS and routing queue-competition rule
quality of service, [1], [2], [3] queuing delay
quantum algorithm, fair queuing queuing discipline
Query Identifier, ICMP queuing theory
query, ICMP queuing, priority
queue capacity, typical, [1] QUIC, [1]
queue overflow quiet time on startup, TCP

R
race condition, Pox RIB
RADIUS Rinjdael
radvd, [1] RIPE
random drop queuing RMON
Random Early Detection roaming, Wi-Fi
randomization of MAC addresses root nameserver
ranging intervals, WiMAX round-robin DNS
ranging, WiMAX and LTE round-trip time
rate control, Wi-Fi route
rate scaling, Wi-Fi route aggregation, IP
rate-adaptive traffic router
RC4 Router Advertisement, IPv6
Real-Time Protocol Router Discovery, IPv6
real-time traffic, [1] Router Solicitation, IPv6
Real-time Transport Protocol routerless IPv6 examples
reassembly routing and addressing
reassociation, Wi-Fi routing domain, [1], [2]
reboot and RPC routing header, IPv6
reboots Routing Information Base
Record Route, IP option routing loop, [1]
recursive DNS lookup routing loop, ephemeral
RED routing policies, BGP
Reed-Solomon codes routing policy database
regional registry routing update algorithms
registry, regional routing, IP
reliable routing, MANET
reliable flooding RPC
Remote Procedure Call RR, DNS
rendezvous point, multicast RRset, DNS
Reno, [1] RRSIG, DNS
Reno vs Vegas RSA
repeater, Ethernet RSA factoring challenge
repeaters, Wi-Fi RST
request for comment RSVP, [1]
request-to-send, Wi-Fi RTCP
request/reply, [1], [2] RTCP measurement of
reservations RTO, TCP
reset, TCP RTO, TCP minimum
resolver, DNS RTP, [1]
resource record, DNS RTP and VoIP
RESV packet, RSVP RTP mixer
retransmit-on-duplicate RTS, Wi-Fi
retransmit-on-timeout RTT
return-to-libc attack RTT bias in TCP
reverse DNS RTT inflation
RFC RTT-noLoad
RFC 1034, [1], [2], [3]

14 https://eng.libretexts.org/@go/page/11020
RFC 1065
RFC 1066
RFC 1122, [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12],
[13], [14]
RFC 1123
RFC 1142
RFC 1155, [1], [2], [3], [4], [5], [6], [7]
RFC 1213, [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11]
RFC 1271, [1]
RFC 1286
RFC 1312
RFC 1321
RFC 1323
RFC 1350, [1], [2], [3], [4]
RFC 1354, [1], [2]
RFC 1441
RFC 1442, [1]
RFC 1450
RFC 1518
RFC 1519
RFC 1550
RFC 1644
RFC 1650
RFC 1661
RFC 1700, [1]
RFC 1812
RFC 1831
RFC 1854
RFC 1883
RFC 1884, [1]
RFC 1901, [1]
RFC 1909
RFC 1948
RFC 1981
RFC 1994
RFC 2001
RFC 2003
RFC 2011
RFC 2026
RFC 2096, [1]
RFC 2104, [1]
RFC 2119
RFC 2131
RFC 2136
RFC 2233
RFC 2264
RFC 2309
RFC 2328
RFC 2362
RFC 2386
RFC 2433
RFC 2453, [1], [2]
RFC 2460, [1], [2], [3], [4], [5], [6], [7], [8]
RFC 2461
RFC 2464
RFC 2473
RFC 2474
RFC 2475
RFC 2481, [1], [2]

15 https://eng.libretexts.org/@go/page/11020
RFC 2529
RFC 2570
RFC 2574
RFC 2578, [1], [2], [3]
RFC 2579, [1]
RFC 2581, [1]
RFC 2582
RFC 2597, [1], [2], [3]
RFC 2616
RFC 2675
RFC 2759
RFC 2766
RFC 2780
RFC 2786
RFC 2819, [1]
RFC 2827, [1]
RFC 2851
RFC 2856
RFC 2863, [1], [2], [3], [4], [5], [6], [7]
RFC 2865
RFC 2898
RFC 2925
RFC 3022, [1]
RFC 3056
RFC 3092
RFC 3168
RFC 3207
RFC 3246, [1]
RFC 3261, [1]
RFC 3360, [1]
RFC 3410, [1]
RFC 3411
RFC 3414, [1], [2], [3], [4], [5], [6]
RFC 3415, [1], [2]
RFC 3418
RFC 3448
RFC 3465, [1]
RFC 3513
RFC 3519
RFC 3540
RFC 3550, [1]
RFC 3551, [1], [2]
RFC 3561, [1]
RFC 3635
RFC 3649, [1], [2], [3]
RFC 3715
RFC 3748
RFC 3756
RFC 3757
RFC 3775
RFC 3826
RFC 3833
RFC 3879
RFC 3927
RFC 3947
RFC 3948
RFC 3971
RFC 3972
RFC 4001

16 https://eng.libretexts.org/@go/page/11020
RFC 4007, [1]
RFC 4022, [1], [2]
RFC 4033
RFC 4034
RFC 4035
RFC 4080
RFC 4188
RFC 4193
RFC 4213
RFC 4251
RFC 4252
RFC 4253, [1], [2]
RFC 4271
RFC 4273
RFC 4291, [1], [2], [3]
RFC 4292, [1]
RFC 4293
RFC 4294
RFC 4301
RFC 4303, [1]
RFC 4340
RFC 4341, [1]
RFC 4342
RFC 4344
RFC 4380
RFC 4429
RFC 4443
RFC 4451
RFC 4472
RFC 4502
RFC 4524
RFC 4560
RFC 4566
RFC 4620
RFC 4648
RFC 4681
RFC 4861, [1], [2], [3]
RFC 4862, [1], [2]
RFC 4919
RFC 4941
RFC 4953
RFC 4961
RFC 4966
RFC 5095
RFC 5116
RFC 5227
RFC 5246
RFC 5247
RFC 5280
RFC 5321, [1]
RFC 5348
RFC 5508
RFC 5694
RFC 5702
RFC 5802
RFC 5925
RFC 5944
RFC 5952
RFC 5961

17 https://eng.libretexts.org/@go/page/11020
RFC 5974
RFC 6040
RFC 6052
RFC 6057, [1]
RFC 6066
RFC 6093
RFC 6105, [1]
RFC 6106
RFC 6125
RFC 6146, [1]
RFC 6147
RFC 6151
RFC 6164
RFC 6182
RFC 6275, [1], [2]
RFC 6282
RFC 6298, [1]
RFC 6325, [1]
RFC 6356
RFC 6394
RFC 6472
RFC 6553
RFC 6554, [1]
RFC 6564, [1]
RFC 6582
RFC 6633, [1]
RFC 6698
RFC 6724, [1], [2], [3], [4], [5]
RFC 6742
RFC 6824
RFC 6887
RFC 6891
RFC 6918
RFC 6960
RFC 7045
RFC 7208
RFC 7217, [1], [2], [3], [4], [5], [6]
RFC 7296, [1]
RFC 7413
RFC 7421
RFC 7469
RFC 7540
RFC 7567, [1]
RFC 7568
RFC 7664
RFC 7686
RFC 7707
RFC 7721
RFC 783, [1]
RFC 7844
RFC 7858
RFC 7860
RFC 7871, [1], [2]
RFC 791, [1], [2]
RFC 792
RFC 793, [1], [2], [3], [4], [5], [6], [7]
RFC 8110
RFC 821
RFC 8257, [1]

18 https://eng.libretexts.org/@go/page/11020
RFC 8305
RFC 8446
RFC 8484
RFC 854
RFC 896, [1]
RFC 917
RFC 950, [1]
RFC 970
RFC 988

S
SACK TCP SNMP, [1]
SACK TCP in ns-2 SNMP agent configuration
SAE SNMP agents and managers
Salsa20 SNMP enumerated type
salt, password SNMP versions
satellite Internet SNMPv1 data types
satellite-link TCP problem SNMPv3 engines
sawtooth, TCP, [1], [2], [3], [4], [5] Snorri Sturluson
scalable routing SO_LINGER, TCP
scanning, IPv6 networks socket
scope, IPv6 address socket address
scope, IPv6 link-local socketpair
SCRAM authentication soft fail, OCSP
scrypt soft state
SCTP software-defined networking
SDN SONET
search warrant Sorcerer's Apprentice bug
secure DNS Source Quench, [1]
secure hash functions source-specific multicast tree
secure neighbor discovery spanning-tree algorithm
secure shell sparse-mode multicast
security spatial streams, MIMO
security association, IPsec SPB
segment speex
segmentation SPF algorithm
segments, TCP split horizon
select group spoofing, IP
select() call spoofing, TCP
Selective ACKs, TCP SQL injection
selective export property SSH
selector, IPsec ssh, [1]
self-ARP ssh host key
self-clocking SSID, Wi-Fi, [1]
SEND ssl
sequence number, TCP SSL programming
serial execution, RPC SSRC
server stack canary
ServerHello, TLS stalk, TCP
ServerName extension, TLS stalk, UDP
session key star topology
session layer STARTTLS
sfq state diagram, TCP
SHA-1 stateless autoconfiguration
SHA-2 stateless forwarding
Shannon-Hartley theorem STM-1
shaping stochastic fair queuing
shared-key ciphers stop-and-wait transport

19 https://eng.libretexts.org/@go/page/11020
shared-memory switch stop-and-wait, TFTP
shellcode store-and-forward
shortest-path bridging store-and-forward delay
Shortest-Path First algorithm StoredKey
SHOULD stream ciphers
shutdown, TCP, [1] Stream Control Transmission Protocol
sibling family, BGP stream-oriented
sibling, BGP streaming video, [1]
SIFS, Wi-Fi streams, QUIC
signaling losses STS-1
signatures, intrusion STS-3
signatures, RSA stub resolver
silly window syndrome, TCP subnet mask
SIMO subnets
Simple Network Management Protocol subnets, IPv6, [1]
SimpleHTTPServer subnets, vs switching
simplex-talk, TCP subpoena
simplex-talk, UDP subqueue
simultaneous authentication of equals subscription, multicast
simultaneous open, TCP Sun RPC
single link-state superfish
single-responsibility principle supplicant, WPA
singlebell network topology switch
site resolver, DNS switch fabrics
site-local IPv6 address switch, crossbar
size, packet switch, shared-memory
SLAAC, [1] switching, vs subnets
SLAAC privacy extensions switchline.py
sliding windows, [1] symbol, data, [1]
sliding windows, TCP symmetric ciphers
slot time, Wi-Fi SYN
slow convergence SYN flooding
small-packet priority, WFQ SYN packet
SMI synchronization source, RTP
SMTP, [1] synchronized loss hypothesis, TCP
SNI, TLS synchronized loss, TCP, [1], [2]
synchronized states, TCP

T
T/TCP TJX attack, [1]
T1 line tls
T3 line TLS client example
tables, SNMP TLS connection setup
Tahoe, [1] TLS handshake protocol
tail drop TLS programming
tangle, cords TLS server example
tbf, Linux TLS version 1.3
tc, Linux, [1], [2] token bucket
TCO, TCP token bucket queue utilization
TCP, [1], [2] token bus Ethernet
TCP accelerated open token ring
TCP application close token-bucket applications
TCP BBR token-bucket, RSVP
TCP checksum offloading topology
TCP close topology table, EIGRP
TCP Cubic Tor project
TCP fairness, [1] ToS and routing
TCP Fast Open TP4

20 https://eng.libretexts.org/@go/page/11020
TCP Friendliness trace file, ns-2
TCP Hamilton tracefiles, ns-2, reading with python
TCP header traceroute, [1]
TCP Hybla tracking, Wi-Fi
TCP Illinois trading
TCP minimum RTO traffic amplification, QUIC
TCP NewReno traffic amplification, UDP
TCP NewReno in ns-2 traffic anomalies
TCP old duplicates traffic engineering, [1], [2], [3]
TCP Reno, [1] traffic management
TCP sawtooth, [1], [2], [3], [4], [5] tragedy of the commons
TCP segmentation offloading Trango
TCP state diagram transient queue peak
TCP Tahoe transit capacity
TCP timeout interval transit traffic, [1]
TCP Vegas, [1] Transmission Control Protocol
TCP Westwood transmission, Ethernet
TCP Westwood+ Transport layer
TCP, Highspeed transport mode, IPsec
TCP, SACK traps, SNMP
TCP_NODELAY tree
TCP_QUICKACK triggered updates
TDM TRILL
Teredo tunneling triple DES
termshark Trivial File Transport Protocol
terrestrial broadband TRR, DNS
terrestrial wireless trust anchor
TestAndIncr trust anchors
TEXTUAL-CONVENTION, SNMP trust and public keys
TFTP trust on first use, SSH
thepiratebay trust on first use, TLS
thermonuclear trust, DANE
three-way handshake trusted recursive resolver
three-way handshake, TCP tshark
threshold slow start TSO, TCP
throughput Tspec
tier-1 provider TTL
Time to Live tunnel mode, IPsec
time-division multiplexing tunnel, IPv6
timeout and retransmission tunneling, [1]
timeout interval, TCP, [1] two implementations
Timestamp, IP option two-generals problem
TIMEWAIT, TCP twos-complement
Type of Service

U
u32 unlicensed spectrum
UDP, [1], [2], [3] unnumbered IP interface
UDP advisory upgrades, network
UDP, for real-time traffic uplink scheduling, WiMAX and LTE
unbounded slow start URG
unicast User Datagram Protocol
unique-local IPv6 address usmUserTable
unknown destinations, Ethernet utilities, network

21 https://eng.libretexts.org/@go/page/11020
V
VACM virtual tributary
VarBind list VLANs
VCI voice over IP
video, streaming void, cast into
virtual circuit, [1], [2] VoIP
virtual hosting VoIP and RTP
Virtual LANs VoIP bandwidth guarantees
virtual link voting
virtual private network VPN
VPNs and ECN

W
W^X Windows, [1]
wavelength-division multiplexing Windows XP SP1 vulnerability
web of trust winsize
web server wireless
weighted fair queuing wireless LANs
WEP encryption failure wireless, fixed
WEP, Wi-Fi wireless, satellite
Westwood, TCP wireless, terrestrial
Wi-Fi WireShark, [1], [2]
Wi-Fi EasyMesh wireshark
Wi-Fi extender WireShark, TCP example
Wi-Fi fragmentation work-conserving queuing
Wi-Fi polling mode WPA authenticator
Wi-Fi repeaters WPA supplicant
Wi-Fi security WPA, Wi-Fi
Wi-Fi, HWMP mesh WPA-Enterprise
WiMAX WPA-Personal
window WPA2-Enterprise, configuring
window scale option, TCP WPA3, [1]
window size, [1] write-or-execute
wtf

X
XD page bit XML
xkcd, [1] XSS

Z
ZigBee zones, DNS
zone identifier, IPv6 ZSK, DNS

22 https://eng.libretexts.org/@go/page/11020
Index
C I Token Ring
circuit switching IPv6 Header 3.3: Token Ring
4.2: Time-Division Multiplexing 8.2: The IPv6 Header
collision domain V
2.5: Ethernet Switches L Virtual LAN
LTE 2.7: Virtual LAN (VLAN)
F 3.8: WiMAX and LTE VLAN
Firewalls 2.7: Virtual LAN (VLAN)
1.13: Firewalls T
Token Bus W
3.3: Token Ring WiMAX
3.8: WiMAX and LTE

1 https://eng.libretexts.org/@go/page/40612
Glossary
Sample Word 1 | Sample Definition 1

1 https://eng.libretexts.org/@go/page/40613
Detailed Licensing
Overview
Title: An Introduction to Computer Networks (Dordal)
Webpages: 172
Applicable Restrictions: Noncommercial, No Derivatives
All licenses found:
CC BY-NC-ND 4.0: 60.5% (104 pages)
Undeclared: 39.5% (68 pages)

By Page
An Introduction to Computer Networks (Dordal) - CC BY- InfoPage - Undeclared
NC-ND 4.0 2.1: Prelude to Ethernet - CC BY-NC-ND 4.0
Front Matter - CC BY-NC-ND 4.0 2.2: 10-Mbps Classic Ethernet - CC BY-NC-ND 4.0
Preface - CC BY-NC-ND 4.0 2.3: 100 Mbps (Fast) Ethernet - CC BY-NC-ND 4.0
TitlePage - Undeclared 2.4: Gigabit Ethernet - CC BY-NC-ND 4.0
InfoPage - Undeclared 2.5: Ethernet Switches - CC BY-NC-ND 4.0
Table of Contents - Undeclared 2.6: Spanning Tree Algorithm and Redundancy - CC
Licensing - Undeclared BY-NC-ND 4.0
2.7: Virtual LAN (VLAN) - CC BY-NC-ND 4.0
1: An Overview of Networks - CC BY-NC-ND 4.0
2.8: TRILL and SPB - CC BY-NC-ND 4.0
Front Matter - Undeclared
2.9: Software-Defined Networking - CC BY-NC-ND
TitlePage - Undeclared 4.0
InfoPage - Undeclared 2.E: Ethernet (Exercises) - CC BY-NC-ND 4.0
1.1: Layers - CC BY-NC-ND 4.0 Back Matter - Undeclared
1.2: Data Rate, Throughput and Bandwidth - CC BY- Index - Undeclared
NC-ND 4.0
3: Other LANs - CC BY-NC-ND 4.0
1.3: Packets - CC BY-NC-ND 4.0
1.4: Datagram Forwarding - CC BY-NC-ND 4.0 Front Matter - Undeclared
1.5: Topology - CC BY-NC-ND 4.0 TitlePage - Undeclared
1.6: Routing Loops - CC BY-NC-ND 4.0 InfoPage - Undeclared
1.7: Congestion - CC BY-NC-ND 4.0 3.1: Virtual Private Networks - CC BY-NC-ND 4.0
1.8: Packets Again - CC BY-NC-ND 4.0 3.2: Carrier Ethernet - CC BY-NC-ND 4.0
1.9: LANs and Ethernet - CC BY-NC-ND 4.0 3.3: Token Ring - CC BY-NC-ND 4.0
1.10: IP - Internet Protocol - CC BY-NC-ND 4.0 3.4: Virtual Circuits - CC BY-NC-ND 4.0
1.11: DNS - CC BY-NC-ND 4.0 3.5: Asynchronous Transfer Mode - ATM - CC BY-
1.12: Transport - CC BY-NC-ND 4.0 NC-ND 4.0
1.13: Firewalls - CC BY-NC-ND 4.0 3.6: Adventures in Radioland - CC BY-NC-ND 4.0
1.14: Some Useful Utilities - CC BY-NC-ND 4.0 3.7: Wi-Fi - CC BY-NC-ND 4.0
1.15: IETF and OSI - CC BY-NC-ND 4.0 3.8: WiMAX and LTE - CC BY-NC-ND 4.0
1.16: Berkeley Unix - CC BY-NC-ND 4.0 3.9: Fixed Wireless - CC BY-NC-ND 4.0
1.E: An Overview of Networks (Exercises) - CC BY- 3.10: Epilog and Exercises - CC BY-NC-ND 4.0
NC-ND 4.0 Back Matter - Undeclared
Back Matter - Undeclared Index - Undeclared
Index - Undeclared 4: Links - CC BY-NC-ND 4.0
2: Ethernet - CC BY-NC-ND 4.0 Front Matter - Undeclared
Front Matter - Undeclared TitlePage - Undeclared
TitlePage - Undeclared InfoPage - Undeclared

1 https://eng.libretexts.org/@go/page/115117
4.1: Encoding and Framing - CC BY-NC-ND 4.0 8.4: Network Prefixes - CC BY-NC-ND 4.0
4.2: Time-Division Multiplexing - CC BY-NC-ND 4.0 8.5: IPv6 Multicast - CC BY-NC-ND 4.0
4.E: Links (Exercises) - CC BY-NC-ND 4.0 8.6: IPv6 Extension Headers - CC BY-NC-ND 4.0
Back Matter - Undeclared 8.7: Neighbor Discovery - CC BY-NC-ND 4.0
Index - Undeclared 8.8: IPv6 Host Address Assignment - CC BY-NC-ND
5: Packets - CC BY-NC-ND 4.0 4.0
6: Abstract Sliding Windows - CC BY-NC-ND 4.0 8.9: Globally Exposed Addresses - CC BY-NC-ND
4.0
Front Matter - Undeclared
8.10: ICMPv6 - CC BY-NC-ND 4.0
TitlePage - Undeclared 8.11: IPv6 Subnets - CC BY-NC-ND 4.0
InfoPage - Undeclared 8.12: Using IPv6 and IPv4 Together - CC BY-NC-ND
6.1: Building Reliable Transport - Stop-and-Wait - 4.0
CC BY-NC-ND 4.0 8.13: IPv6 Examples Without a Router - CC BY-NC-
6.2: Sliding Windows - CC BY-NC-ND 4.0 ND 4.0
6.3: Linear Bottlenecks - CC BY-NC-ND 4.0 8.14: IPv6 Connectivity via Tunneling - CC BY-NC-
6.4: Epilog and Exercises - CC BY-NC-ND 4.0 ND 4.0
Back Matter - Undeclared 8.15: IPv6-to-IPv4 Connectivity - CC BY-NC-ND 4.0
Index - Undeclared 8.16: Epilog and Exercises - CC BY-NC-ND 4.0
7: IP version 4 - CC BY-NC-ND 4.0 Back Matter - Undeclared

Front Matter - Undeclared Index - Undeclared

TitlePage - Undeclared 9: Routing-Update Algorithms - CC BY-NC-ND 4.0


InfoPage - Undeclared Front Matter - Undeclared
7.1: Prelude to IP version 4 - CC BY-NC-ND 4.0 TitlePage - Undeclared
7.2: The IPv4 Header - CC BY-NC-ND 4.0 InfoPage - Undeclared
7.3: Interfaces - CC BY-NC-ND 4.0 9.1: Prelude to Routing-Update Algorithms -
7.4: Special Addresses - CC BY-NC-ND 4.0 Undeclared
7.5: Fragmentation - CC BY-NC-ND 4.0 9.2: Distance-Vector Routing-Update Algorithm -
7.6: The Classless IP Delivery Algorithm - CC BY- Undeclared
NC-ND 4.0 9.3: Distance-Vector Slow-Convergence Problem -
7.7: IPv4 Subnets - CC BY-NC-ND 4.0 Undeclared
7.8: Network Address Translation - CC BY-NC-ND 9.4: Observations on Minimizing Route Cost -
4.0 Undeclared
7.9: DNS - CC BY-NC-ND 4.0 9.5: Loop-Free Distance Vector Algorithms -
7.10: Address Resolution Protocol - ARP - CC BY- Undeclared
NC-ND 4.0 9.6: Link-State Routing-Update Algorithm -
7.11: Dynamic Host Configuration Protocol (DHCP) Undeclared
- CC BY-NC-ND 4.0 9.7: Routing on Other Attributes - Undeclared
7.12: Internet Control Message Protocol - CC BY-NC- 9.8: ECMP - Undeclared
ND 4.0 9.9: Epilog and Exercises - Undeclared
7.13: Unnumbered Interfaces - CC BY-NC-ND 4.0 Back Matter - Undeclared
7.14: Mobile IP - CC BY-NC-ND 4.0 Index - Undeclared
7.15: Epilog and Exercises - CC BY-NC-ND 4.0
10: Large-Scale IP Routing - CC BY-NC-ND 4.0
Back Matter - Undeclared
Front Matter - Undeclared
Index - Undeclared
TitlePage - Undeclared
8: IP version 6 - CC BY-NC-ND 4.0
InfoPage - Undeclared
Front Matter - Undeclared
10.1: Classless Internet Domain Routing - CIDR -
TitlePage - Undeclared Undeclared
InfoPage - Undeclared 10.2: Hierarchical Routing - Undeclared
8.1: Prelude to IP version 6 - CC BY-NC-ND 4.0 10.3: Legacy Routing - Undeclared
8.2: The IPv6 Header - CC BY-NC-ND 4.0 10.4: Provider-Based Routing - Undeclared
8.3: IPv6 Addresses - CC BY-NC-ND 4.0

2 https://eng.libretexts.org/@go/page/115117
10.5: Geographical Routing - Undeclared 18: Mininet - CC BY-NC-ND 4.0
10.6: Border Gateway Protocol, BGP - Undeclared 19: Queuing and Scheduling - CC BY-NC-ND 4.0
10.7: Epilog and Exercises - Undeclared 20: Quality of Service - CC BY-NC-ND 4.0
Back Matter - Undeclared 21: Network Management and SNMP - CC BY-NC-ND
Index - Undeclared 4.0
11: UDP Transport - CC BY-NC-ND 4.0 22: Security - CC BY-NC-ND 4.0
12: TCP Transport - CC BY-NC-ND 4.0 23: Selected Solutions - CC BY-NC-ND 4.0
13: TCP Reno and Congestion Management - CC BY- Back Matter - CC BY-NC-ND 4.0
NC-ND 4.0 Bibliography - CC BY-NC-ND 4.0
14: Dynamics of TCP - CC BY-NC-ND 4.0 Index - CC BY-NC-ND 4.0
15: Newer TCP Implementations - CC BY-NC-ND 4.0 Index - Undeclared
16: Network Simulations - ns-2 - CC BY-NC-ND 4.0 Glossary - Undeclared
17: The ns-3 Network Simulator - CC BY-NC-ND 4.0 Detailed Licensing - Undeclared

3 https://eng.libretexts.org/@go/page/115117

You might also like