CNC UNIT-2 6thSEM
CNC UNIT-2 6thSEM
7.1.1 Packetizing
The first duty of the network layer is definitely packetizing: encapsulating the payload (data
received from the upper layer) in a network-layer packet at the source and decapsulating the
payload from the network-layer packet at the destination. In other words, one duty of the network
layer is to carry a payload from the source to the destination without changing it or using it.
The network layer is doing the service of a carrier such as the postal office, which is responsible
for delivery of packages from a sender to a receiver without changing or using the contents.
The source host receives the payload from an upper-layer protocol, adds a header that
contains the source and destination addresses and some other information that is required by
the network-layer protocol (as discussed later), and delivers the packet to the data-link layer.
The destination host receives the network-layer packet from its data-link layer, decapsulates
the packet, and delivers the payload to the corresponding upper-layer protocol. If the packet is
fragmented at the source or at routers along the path, the network layer is responsible for waiting
until all fragments arrive, reassembling them, and delivering them to the upper-layer protocol.
7.1.2 Routing
The network layer is responsible for routing a network-layer packet from its source to the
destination. A physical network is a combination of networks (LANs and WANs) and routers that
connect them. This means that there is more than one route from the source to the destination. The
network layer is responsible for finding the best one among these possible routes. The network
layer needs to have some specific strategies for defining the best route.
7.1.7 Security
Another issue related to communication at the network layer is security. Security was not a concern
when the Internet was originally designed because it was used by a small number of users at
universities for research activities; other people had no access to the Internet. The network layer
was designed with no security provision. Today, however, security is a big concern. To provide
security for a connectionless network layer, we need to have another virtual level that
changes the connectionless service to a connection-oriented service. This virtual layer, called
IPSec, is discussed in Chapter 13.
7.3 PERFORMANCE
The upper-layer protocols that use the service of the network layer expect to receive an ideal
service, but the network layer is not perfect. The performance of a network can be measured
in terms of delay, throughput, and packet loss. Congestion control is an issue that can improve
the performance.
7.3.1 Delay
All of us expect an instantaneous response from a network, but a packet, from its source to its
destination, encounters delays. The delays in a network can be divided into four types:
transmission delay, propagation delay, processing delay, and queuing delay. Let us first
discuss each of these delay types and then show how to calculate a packet delay from the source
to the destination.
Transmission Delay
A source host or a router cannot send a packet instantaneously. A sender needs to put the bits in
a packet on the line one by one. If the first bit of the packet is put on the line at time t1 and the
last bit is put on the line at time t2, transmission delay of the packet is (t2 − t1). Definitely, the
transmission delay is longer for a longer packet and shorter if the sender can transmit faster. In
other words, the transmission delay is
Propagation Delay
Propagation delay is the time it takes for a bit to travel from point A to point B in the
transmission media. The propagation delay for a packet-switched network depends on the
propagation delay of each network (LAN or WAN). The propagation delay depends on the
propagation speed of the media, which is 3 × 108 m/s in a vacuum and normally much less in a
wired medium; it also depends on the distance of the link. In other words, propagation delay is
Processing Delay
The processing delay is the time required for a router or a destination host to receive a packet
from its input port, remove the header, perform an error-detection procedure, and deliver
the packet to the output port (in the case of a router) or deliver the packet to the upper-layer
protocol (in the case of the destination host). The processing delay may be different for each
packet, but normally is calculated as an average.
Queuing Delay
Queuing delay can normally happen in a router. As we discuss in Chapter 8, a router has an
input queue connected to each of its input ports to store packets waiting to be processed; the router
also has an output queue connected to each of its output ports to store packets waiting to be
transmitted. The queuing delay for a packet in a router is measured as the time a packet waits
in the input queue and output queue of a router. We can compare the situation with a busy
airport. Some planes may need to wait to get the landing band (input delay); some planes may need
to wait to get the departure band (output delay).
Total Delay
Assuming equal delays for the sender, routers, and receiver, the total delay (source-to-destination
delay) a packet encounters can be calculated if we know the number of routers, n, in the whole
path.
7.3.2 Throughput
Throughput at any point in a network is defined as the number of bits passing through the point in
a second, which is actually the transmission rate of data at that point. In a path from source to
destination, a packet may pass through several links (networks), each with a different transmission
rate. How, then, can we determine the throughput of the whole path? To see the situation, assume
that we have three links, each with a different transmission rate, as shown in Figure 7.2.
Figure 7.2 Throughput in a path with three links in a series
In Figure 7.2, the data can flow at the rate of 200 kbps in link 1. However, when the data
arrives at router R1, it cannot pass at this rate. Data need to be queued at the router and sent at 100
kbps. When data arrive at router R2, they could be sent at the rate of 150 kbps, but there is not
enough data to be sent. In other words, the average rate of the data flow in link 3 is also 100 kbps.
We can conclude that the average data rate for this path is 100 kbps, the minimum of the three
different data rates. Figure 7.2 also shows that we can simulate the behavior of each link with pipes
of different sizes; the average throughput is determined by the bottleneck, the pipe with the
smallest diameter. In general, in a path with n links in series, we have
Although the situation in Figure 7.2 shows how to calculate the throughput when the data are
passed through several links, the actual situation in the Internet is that the data normally pass
through two access networks and the Internet backbone, as shown in Figure 7.3.
Figure 7.3 A path through the Internet backbone
Address Space
A protocol like IPv4 that defines addresses has an address space. An address space is the total
number of addresses used by the protocol. If a protocol uses b bits to define an address, the
address space is 2b because each bit can have two different values (0 or 1). IPv4 uses 32-bit
addresses, which means that the address space is 232 or 4,294,967,296 (more than 4 billion).
If there were no restrictions, more than 4 billion devices could be connected to the Internet.
Notation
There are three common notations to show an IPv4 address: binary notation (base 2), dotted-
decimal notation (base 256), and hexadecimal notation (base 16). In binary notation, an IPv4
address is displayed as 32 bits. To make the address more readable, one or more spaces are usually
inserted between each octet (8 bits). Each octet is often referred to as a byte. To make the IPv4
address more compact and easier to read, it is usually written in decimal form with a decimal point
(dot) separating the bytes. This format is referred to as dotted-decimal notation. Note that because
each byte (octet) is only 8 bits, each number in the dotted-decimal notation is between 0 and
255. We sometimes see an IPv4 address in hexadecimal notation. Each hexadecimal digit is
equivalent to 4 bits. This means that a 32-bit address has eight hexadecimal digits. This
notation is often used in network programming. Figure 7.4 shows an IP address in the three
discussed notations.
Figure 7.4 Three different notations in IPv4 addressing
Hierarchy in Addressing
In any communication network that involves delivery, such as a telephone network or a postal
network, the addressing system is hierarchical. In a postal network, the postal address (mailing
address) includes the country, state, city, street, house number, and the name of the mail recipient.
Similarly, a telephone number is divided into the country code, area code, local exchange, and the
connection.
A 32-bit IPv4 address is also hierarchical but is divided only into two parts. The first part
of the address, called the prefix, defines the network; the second part of the address, called
the suffix, defines the node (connection of a device to the Internet). Figure 7.5 shows the prefix
and suffix of a 32-bit IPv4 address. The prefix length is n bits, and the suffix length is (32 − n)
bits.
A prefix can be fixed length or variable length. The network identifier in the IPv4 was
first designed as a fixed-length prefix. This scheme, which is now obsolete, is referred to as
classful addressing. The new scheme, which is referred to as classless addressing, uses a
variable-length network prefix. First, we briefly discuss classful addressing; then we concentrate
on classless addressing.
Figure 7.5 Hierarchy in addressing
Classful Addressing
When the Internet started, an IPv4 address was designed with a fixed-length prefix, but to
accommodate both small and large networks, three fixed-length prefixes were designed instead
of one (n = 8, n = 16, and n = 24). The whole address space was divided into five classes (classes
A, B, C, D, and E), as shown in Figure 7.6. This scheme is referred to as classful addressing.
Although classful addressing belongs to the past, it helps us to understand classless addressing.
Figure 7.6 Occupation of the address space in classful addressing
In class A, the network length is 8 bits, but because the first bit, which is 0, defines the class,
we can have only 7 bits as the network identifier. This means there are only 27 = 128 networks in
the world that can have a class A address.
In class B, the network length is 16 bits, but because the first 2 bits, which are (10)2, define
the class, we can have only 14 bits as the network identifier. This means there are only 214 = 16,384
networks in the world that can have a class B address.
All addresses that start with (110)2 belong to class C. In class C, the network length is 24
bits, but because 3 bits define the class, we can have only 21 bits as the network identifier. This
means there are 221 = 2,097,152 networks in the world that can have a class C address.
Class D is not divided into prefix and suffix. It is used for multicast addresses.
All addresses that start with 1111 in binary belong to class E. As in class D, class E is
not divided into prefix and suffix and is used as reserve.
Address Depletion
The reason that classful addressing has become obsolete is address depletion. Because the
addresses were not distributed properly, the Internet was faced with the problem of the addresses
being rapidly used up, resulting in no more addresses being available for organizations and
individuals that needed to have an Internet connection. To understand the problem, let us think
about class A. This class can be assigned to only 128 organizations in the world, but each
organization would need to have one single network (seen by the rest of the world) with 16,777,216
nodes (computers in this single network). Because there were only a few organizations that are
this large, most of the addresses in this class were wasted (unused). Class B addresses were
designed for midsize organizations, but many of the addresses in this class also remained unused.
Class C addresses have a completely different design flaw. The number of addresses that can be
used in each network (256) was so small that most companies were not comfortable using a block
in this address. Class E addresses were almost never used, wasting the whole class.
Classless Addressing
With the growth of the Internet, it was clear that a larger address space was needed as a long-term
solution. The larger address space, however, requires that the length of IP addresses also be
increased, which means the format of the IP packets needs to be changed. Although the long-range
solution has already been devised and is called IPv6 (discussed in Section 7.5), a short-term
solution was also devised to use the same address space but to change the distribution of addresses
to provide a fair share to each organization. The short-term solution still uses IPv4 addresses, but
it is called classless addressing. In other words, the class privilege was removed from the
distribution to compensate for the address depletion.
There was another motivation for classless addressing. During the 1990s, Internet Service
Providers (ISPs) came into prominence. An ISP is an organization that provides Internet access
and services for individuals, small businesses, and midsize organizations that do not want to create
an Internet site and become involved in providing Internet services (such as electronic mail) for
their employees. An ISP is granted a large range of addresses and then subdivides the addresses
(in groups of 1, 2, 4, 8, 16, and so on), giving a range of addresses to a household or a small
business. The customers are connected via a dial-up modem, DSL, or cable modem to the ISP.
However, each customer needs some IPv4 addresses.
In 1996, the Internet authorities announced a new architecture called classless addressing. In
classless addressing, variable-length blocks are used that belong to no classes. We can have a block
of 1 address, 2 addresses, 4 addresses, 128 addresses, and so on.
In classless addressing, the whole address space is divided into variable-length blocks. The
prefix in an address defines the block (network); the suffix defines the node (device). Theoretically,
we can have a block of 20, 21, 22, …, 232 addresses. One of the restrictions is that the number of
addresses in a block needs to be a power of 2. An organization can be granted one block of
addresses. Figure 7.7 shows the division of the whole address space into non-overlapping blocks.
Figure 7.7 Variable-length blocks in classless addressing
In other words, an address in classless addressing does not, per se, define the block or network
to which the address belongs; we need to give the prefix length also.
The first address can be found by keeping the first 27 bits and changing the rest of the bits to 0s.
Address: 167.199.170.82/27 10100111 11000111 10101010 01010010
The last address can be found by keeping the first 27 bits and changing the rest of the bits to 1s.
Address: 167.199.170.82/27 10100111 11000111 10101010 01010010
Address Mask
Another way to find the first and last addresses in the block is to use the address mask. The
address mask is a 32-bit number in which the n leftmost bits are set to 1s and the rest of the
bits (32 − n) are set to 0s. A computer can easily find the address mask because it is the
complement of (232−n − 1). The reason for defining a mask in this way is that it can be used by a
computer program to extract the information in a block, using the three bit-wise operations NOT,
AND, and OR.
1. The number of addresses in the block N = NOT (Mask) + 1.
2. The first address in the block = (Any address in the block) AND (Mask).
3. The last address in the block = (Any address in the block) OR [(NOT (Mask)].
Example 7.2
We repeat Example 7.1 167.199.170.82/27. using the mask. The mask in dotted-decimal notation
is 255.255.255.224. The AND, OR, and NOT operations can be applied to individual bytes using
calculators and applets at the book website.
Network Address
The preceding examples show that, given any address, we can find all information about the block.
The first address, the network address, is particularly important because it is used in routing a
packet to its destination network. For the moment, let us assume that an internet is made up
of m networks and a router with m interfaces. When a packet arrives at the router from any source
host, the router needs to know to which network the packet should be sent and from which interface
the packet should be sent out. When the packet arrives at the network, it reaches its destination
host using link-layer addressing, which was discussed in Chapter 3 (Section 3.4). Figure 7.9 shows
the idea.
Figure 7.9 Network address
After the network address has been found, the router consults its forwarding table to find the
corresponding interface from which the packet should be sent out. The network address is actually
the identifier of the network; each network is identified by its network address.
Block Allocation
The next issue in classless addressing is block allocation. How are the blocks allocated? The
ultimate responsibility of block allocation is given to a global authority called the Internet
Corporation for Assigned Names and Numbers (ICANN). However, ICANN does not normally
allocate addresses to individual Internet users. It assigns a large block of addresses to an ISP (or a
larger organization that is considered an ISP in this case).
For the proper operation of the CIDR, three restrictions need to be applied to the allocated block.
1. The number of requested addresses, N, needs to be a power of 2. The reason is that N = 232−n or n = 32 −
log2 N. If N is not a power of 2, we cannot have an integer value for n.
2. The requested block n The requested block needs to be allocated where there are a contiguous
number of available addresses in the address space. However, there is a restriction on
choosing the first address in the block.
3. The first address needs to be divisible by the number of addresses in the block. The reason is
that the first address needs to be the prefix followed by (32 − n) number of 0s. The decimal
value of the first address is then
Example 7.4
An ISP has requested a block of 1000 addresses. Because 1000 is not a power of 2, 1024 addresses
are granted. The prefix length is calculated as n = 32 − log2 1024 = 22. An available block,
18.14.12.0/22, is granted to the ISP. It can be seen that the first address in decimal is 302,910,464,
which is divisible by 1024.
Subnetting
More levels of hierarchy can be created using subnetting. An organization (or an ISP) that is
granted a range of addresses may divide the range into several subranges and assign each
subrange to a subnetwork (or subnet). Note that nothing stops the organization from creating more
levels. A subnetwork can be divided into several sub-subnetworks. A sub-subnetwork can be
divided into several sub-sub-subnetworks, and so on.
Designing Subnets The subnetworks in a network should be carefully designed to enable the
routing of packets. We assume the total number of addresses granted to the organization is N, the
prefix length is n, the assigned number of addresses to each subnetwork is Nsub, and the prefix
length for each subnetwork is nsub.
Then the following steps need to be carefully followed to guarantee the proper operation of
the subnetworks.
The number of addresses in each subnetwork should be a power of 2.
The prefix length for each subnetwork should be found using the following formula:
The starting address in each subnetwork should be divisible by the number of addresses in that
subnetwork. This can be achieved if we first assign addresses to larger subnetworks.
Finding Information about Each Subnetwork After designing the subnetworks, the information
about each subnetwork, such as first and last address, can be found using the process we described
to find the information about each network in the Internet.
Example 7.5
An organization is granted a block of addresses with the beginning address 14.24.74.0/24. The
organization needs to have three subblocks of addresses to use in its three subnets: one subblock
of 10 addresses, one subblock of 60 addresses, and one subblock of 120 addresses. Design the
subblocks.
Solution
There are 232–24 = 256 addresses in this block. The first address is 14.24.74.0/24; the last address is
14.24.74.255/24. To satisfy the third requirement, we assign addresses to subblocks, starting with
the largest and ending with the smallest one.
a. The number of addresses in the largest subblock, which requires 120 addresses, is not a
power of 2. We allocate 128 addresses. The subnet mask for this subnet can be found as n1 =
32 − log2 128 = 25. The first address in this block is 14.24.74.0/25; the last address is
14.24.74.127/25.
b. The number of addresses in the second largest subblock, which requires 60 addresses, is not a power of
2 either. We allocate 64 addresses. The subnet mask for this subnet can be found as n2 = 32 − log2 64 =
26. The first address in this block is 14.24.74.128/26; the last address is 14.24.74.191/26.
c. The number of addresses in the smallest subblock, which requires 10 addresses, is not a power of 2
either. We allocate 16 addresses. The subnet mask for this subnet can be found as n3 = 32 − log2 16 = 28.
The first address in this block is 14.24.74.192/28; the last address is 14.24.74.207/28.
If we add all addresses in the previous subblocks, the result is 208 addresses, which means 48
addresses are left in reserve. The first address in this range is 14.24.74.208. The last address is
14.24.74.255. We don’t know about the prefix length yet. Figure 7.10 shows the configuration of
blocks. It shows the first address in each block.
Figure 7.10 Solution to Example 7.5
Address Aggregation
One of the advantages of the CIDR strategy is address aggregation (sometimes called address
summarization or route summarization). When blocks of addresses are combined to create a larger
block, routing can be done based on the prefix of the larger block. ICANN assigns a large block
of addresses to an ISP. Each ISP in turn divides its assigned block into smaller subblocks and
grants the subblocks to its customers.
Example 7.6
Figure 7.11 shows how four small blocks of addresses are assigned to four organizations by an
ISP. The ISP combines these four blocks into one single block and advertises the larger block to
the rest of the world. Any packet destined for this larger block should be sent to this ISP. It is the
responsibility of the ISP to forward the packet to the appropriate organization. This is similar to
the routing we find in a postal network. All packages coming from outside a country are sent first
to the capital and then distributed to the corresponding destination.
Figure 7.11 Example of address aggregation
IPv4 is an unreliable datagram protocol—a best-effort delivery service. The term best
effort means that IPv4 packets can be corrupted, be lost, arrive out of order, be delayed or create
congestion for the network. If reliability is important, IPv4 must be paired with a reliable transport-
layer protocol such as TCP. An example of a common best-effort delivery service is the post office.
The post office does its best to deliver the regular mail but does not always succeed. If an
unregistered letter is lost or damaged in the mail, the would-be recipient will not receive the
correspondence and the sender will need to re-create it.
IPv4 is also a connectionless protocol that uses the datagram approach. This means that
each datagram is handled independently, and each datagram can follow a different route to the
destination. This implies that datagrams sent by the same source to the same destination could
arrive out of order. Again, IPv4 relies on a higher-level protocol to take care of all these problems.
Datagram Format
In this section, we begin by discussing the first service provided by IPv4, packetizing. We show
how IPv4 defines the format of a packet in which the data coming from the upper layer or other
protocols are encapsulated. Packets used by the IP are called datagrams. Figure 7.13 shows the
IPv4 datagram format.
A datagram is a variable-length packet consisting of two parts: the header and payload (data).
The header is 20 to 60 bytes in length and contains information essential to routing and delivery.
The first 20 bytes are essential and together are called the main header. The next 40 bytes include
options and padding that may or may not be present. It is customary in TCP/IP to show the header
in 4-byte sections.
Discussing the meaning and rationale for the existence of each field is essential to
understanding the operation of IPv4; a brief description of each field is in order.
Version number. The 4-bit version number (VER) field defines the version of the IPv4 protocol,
which, obviously, has the value of 4.
Header length. The 4-bit header length (HLEN) field defines the total length of the datagram
header in 4-byte words. The IPv4 datagram has a variable-length header. When a device receives a
datagram, it needs to know when the header stops and the data, which are encapsulated in the packet,
start. However, to make the value of the header length (number of bytes) fit in a 4-bit header length,
the total length of the header is calculated as 4-byte words. The total length is divided by 4, and the
value is inserted in the field. The receiver needs to multiply the value of this field by 4 to find the
total length.
Service type. In the original design of the IP header, this field was referred to as type of service
(TOS), which defined how the datagram should be handled. In the late 1990s, the Internet
Engineering Task Force (IETF) redefined the field to provide differentiated services (DiffServ),
which divide applications into different classes according to their priority. The use of a 4-byte word
for the header length is also logical because the IP header is always needed to be aligned in 4-byte
boundaries.
Total length. This 16-bit field defines the total length (header plus data) of the IP datagram in bytes.
A 16-bit number can define a total length of up to 65,535 (when all bits are 1s). However, the size
of the datagram is normally much less than this. This field helps the receiving device know when
the packet has completely arrived. To find the length of the data coming from the upper layer,
subtract the header length from the total length. The header length can be found by multiplying the
value in the HLEN field by 4.
Solution
There is an error in this packet. The 4 leftmost bits (0100)2 show the version, which is correct. The
next 4 bits (0010)2 show an invalid header length (2 × 4 = 8). The minimum number of bytes in
the header must be 20. The packet has been corrupted in transmission.
Example 7.8
In an IPv4 packet, the value of HLEN is (1000)2. How many bytes of options are being carried by
this packet?
Solution
The HLEN value is 8, which means the total number of bytes in the header is 8 × 4, or 32 bytes.
The first 20 bytes are the base header, the next 12 bytes are the options.
Example 7.9
In an IPv4 packet, the value of HLEN is 5, and the value of the total length field is (0028)16. How
many bytes of data are being carried by this packet?
Solution
The HLEN value is 5, which means the total number of bytes in the header is 5 × 4, or 20 bytes
(no options). The total length is (0028)16 or 40 bytes, which means the packet is carrying 20 bytes
of data (40 − 20).
Example 7.10
An IPv4 packet has arrived with the first few hexadecimal digits as shown.
(45000028000100000102 … )16
How many hops can this packet travel before being dropped? To which upper-layer protocol
do the data belong?
Solution
First 4 bytes (8 bits)-Version, HLEN, Service type, total length
Second 4 bytes (8 bits)-Identification bits, flags, fragment offset
Third bytes is about TTL, protocol
To find the time-to-live field, we skip 8 bytes (16 hexadecimal digits). The time-to-live field is the
ninth byte, which is (01)16. This means the packet can travel only one hop. The protocol field is the
next byte (02)16, which means that the upper-layer protocol is IGMP.
Example 7.11
Figure 7.15 shows an example of a checksum calculation for an IPv4 header without options. The
header is divided into 16-bit sections. All the sections are added, and the sum is complemented
after wrapping the leftmost digit. The result is inserted in the checksum field.
Figure 7.15 Example of checksum calculation in IPv4
Note that the calculation of wrapped sum and checksum can also be done as follows in
hexadecimal:
Wrapped Sum = Sum mod FFFF
Checksum = FFFF − Wrapped Sum
Fragmentation
A datagram can travel through different networks. Each router decapsulates the IP datagram from
the frame it receives, processes it, and then encapsulates it in another frame. The format and size
of the received frame depend on the protocol used by the physical network through which the
frame has just traveled. The format and size of the sent frame depend on the protocol used by the
physical network through which the frame is going to travel. For example, if a router connects a
LAN to a WAN, it receives a frame in the LAN format and sends a frame in the WAN format.
The value of the maximum transfer unit (MTU) differs from one physical network protocol to
another. For example, the value for a LAN is normally 1500 bytes, but for a WAN it can be larger
or smaller.
To make the IP protocol independent of the physical network, the designers decided to make
the maximum length of the IP datagram equal to 65,535 bytes. This makes transmission more
efficient if one day we use a link-layer protocol with an MTU of this size. However, for other
physical networks, we must divide the datagram to make it possible for it to pass through these
networks. This is called fragmentation.
When a datagram is fragmented, each fragment has its own header with most of the fields
repeated, but some have been changed. A fragmented datagram may itself be fragmented if it
encounters a network with an even smaller MTU. In other words, a datagram may be fragmented
several times before it reaches the final destination.
A datagram can be fragmented by the source host or any router in the path. The reassembly of
the datagram, however, is done only by the destination host, because each fragment becomes an
independent datagram. Whereas the fragmented datagram can travel through different routes, and
we can never control or guarantee which route a fragmented datagram may take, all the fragments
belonging to the same datagram should finally arrive at the destination host. So it is logical to do
the reassembly at the final destination. An even stronger objection for reassembling packets during
the transmission is the loss of efficiency it incurs.
When we talk about fragmentation, we mean that the payload of the IP datagram is
fragmented. However, most parts of the header, with the exception of some options, must be copied
by all fragments. The host or router that fragments a datagram must change the values of three
fields: flags, fragmentation offset, and total length. The rest of the fields must be copied. Of course,
the value of the checksum must be recalculated regardless of fragmentation.
Example
An IP datagram of 4020Bytes (Header 20 bytes and payload 4000 bytes) reached at the router and
must be forward to link with MTU of 1420 bytes. How many fragments will be generated and also
write MF, offset field and total length value for all the fragments.
Figure 7.17 shows a datagram with a data size of 4000 bytes fragmented into three fragments.
The bytes in the original datagram are numbered 0 to 3999. The first fragment carries bytes 0 to
1399. The offset for this datagram is 0/8 = 0. The second fragment carries bytes 1400 to 2799; the
offset value for this fragment is 1400/8 = 175. Finally, the third fragment carries bytes 2800 to
3999. The offset value for this fragment is 2800/8 = 350.
Figure 7.17 Fragmentation example
Remember that the value of the offset is measured in units of 8 bytes. This is done because
the length of the offset field is only 13 bits long and cannot represent a sequence of bytes greater
than 8191. This forces hosts or routers that fragment datagrams to choose the size of each fragment
so that the first byte number is divisible by 8.
Figure 7.18 shows an expanded view of the fragments in Figure 7.17. The original packet
starts at the client; the fragments are reassembled at the server. The value of the identification field
is the same in all fragments, as is the value of the flags field with the more fragment bit (M bit) set
for all fragments except the last. Also, the value of the offset field for each fragment is shown.
Note that although the fragments arrived out of order at the destination, they can be correctly
reassembled.
Figure 7.18 also shows what happens if a fragment itself is fragmented. In this case the value
of the offset field is always relative to the original datagram. For example, in the figure, the second
fragment is itself fragmented later into two fragments of 800 bytes and 600 bytes, but the offset
shows the relative position of the fragments to the original data.
It is obvious that even if each fragment follows a different path and arrives out of order, the
final destination host can reassemble the original datagram from the fragments received (if none
of them is lost) using the following strategy:
a. The first fragment has an offset field value of zero.
b. Divide the length of the first fragment by 8. The second fragment has an offset value equal to that result.
c. Divide the total length of the first and second fragment by 8. The third fragment has an offset
value equal to that result.
d. Continue the process. The last fragment has its M bit set to 0.
e. Continue the process. The last fragment has an M bit value of 0.
Example 7.12
A packet has arrived with an M bit value of 0. Is this the first fragment, the last fragment, or a
middle fragment? Do we know if the packet was fragmented?
Solution
If the M bit is 0, it means that there are no more fragments; the fragment is the last one. However,
we cannot say if the original packet was fragmented or not. A nonfragmented packet is considered
the last fragment.
A packet has arrived with an M bit value of 1. Is this the first fragment, the last fragment, or a
middle fragment? Do we know if the packet was fragmented?
Solution
If the M bit is 1, it means that there is at least one more fragment. This fragment can be the first
one or a middle one, but not the last one. We don’t know if it is the first one or a middle one; we
need more information (the value of the fragmentation offset).
Example 7.14
A packet has arrived with an M bit value of 1 and a fragmentation offset value of 0. Is this the first
fragment, the last fragment, or a middle fragment?
Solution
Because the M bit is 1, it is either the first fragment or a middle one. Because the offset value is 0,
it is the first fragment.
Example 7.15
A packet has arrived in which the offset value is 100. What is the number of the first byte? Do we
know the number of the last byte?
Solution
To find the number of the first byte, we multiply the offset value by 8. This means that the first
byte number is 800. We cannot determine the number of the last byte unless we know the length
of the data.
Example 7.16
A packet has arrived in which the offset value is 100, the value of HLEN is 5, and the value of the
total length field is 100. What are the numbers of the first byte and the last byte?
Solution
The first byte number is 100 × 8 = 800. The total length is 100 bytes, and the header length is 20
bytes (5 × 4), which means that there are 80 bytes in this datagram. If the first byte number is 800,
the last byte number must be 879.
Representation
A computer normally stores the address in binary, but it is clear that 128 bits cannot easily be
handled by humans. Several notations have been proposed to represent IPv6 addresses when they
are handled by humans. The following shows two of these notations: binary and colon
hexadecimal.
Binary (128 bits) 11111110111101101011 … 1111111100000000
Binary notation is used when the addresses are stored in a computer. The colon hexadecimal
notation (or colon hex for short) divides the address into eight sections, each made of four
hexadecimal digits separated by colons.
Abbreviation
Although an IPv6 address, even in hexadecimal format, is very long, many of the digits are zeros.
In this case, we can abbreviate the address. The leading zeros of a section can be omitted. Using
this form of abbreviation, 0074 can be written as 74, 000F as F, and 0000 as 0. Note that 3210
cannot be abbreviated. Further abbreviation, often called zero compression, can be applied to
colon hex notation if there are consecutive sections consisting of zeros only. We can remove all
the zeros altogether and replace them with a double semicolon.
FDEC:0:0:0:0:BBFF:0:FFFF → FDEC::BBFF:0:FFFF
Note that this type of abbreviation is allowed only once per address. If there is more than one
run of zero sections, only one of them can be compressed.
Mixed Notation
Sometimes we see a mixed representation of an IPv6 address: colon hex and dotted-decimal
notation. This is appropriate during the transition period in which an IPv4 address is embedded in
an IPv6 address (as the rightmost 32 bits). We can use the colon hex notation for the leftmost six
sections and 4-byte dotted-decimal notation instead of the rightmost two sections. However, this
happens when all or most of the leftmost sections of the IPv6 address are zeros. For example, the
address (::130.24.24.18) is a legitimate address in IPv6, in which the zero compression shows that
all 96 leftmost bits of the address are zeros.
CIDR Notation
As we will see shortly, IPv6 uses hierarchical addressing. For this reason, IPv6 allows slash or
CIDR notation. For example, the following shows how we can define a prefix of 60 bits using
CIDR. We will later show how an IPv6 address is divided into a prefix and a suffix.
FDEC::BBFF:0:FFFF/60
Unicast Address A unicast address defines a single interface (computer or router). The packet sent
to a unicast address will be routed to the intended recipient.
Anycast Address An anycast address defines a group of computers that all share a single address.
A packet with an anycast address is delivered to only one member of the group, the most reachable
one. An anycast communication is used, for example, when there are several servers that can
respond to an inquiry. The request is sent to the one that is most reachable. The hardware and
software generate only one copy of the request; the copy reaches only one of the servers. IPv6 does
not designate a block for anycasting; the addresses are assigned from the unicast block.
Multicast Address A multicast address also defines a group of computers. However, there is a
difference between anycasting and multicasting. In anycasting, only one copy of the packet is sent
to one of the members of the group; in multicasting each member of the group receives a copy. As
we will see shortly, IPv6 has designated a block for multicasting from which the same address is
assigned to the members of the group. It is interesting that IPv6 does not define broadcasting, even
in a limited version. IPv6 considers broadcasting as a special case of multicasting.
Flow label. The flow label is a 20-bit field that is designed to provide special handling for a particular flow
of data. We will discuss this field later in the section.
Payload length. The 2-byte payload length field defines the length of the IP datagram excluding the header.
Note that IPv4 defines two fields related to the length: header length and total length. In IPv6, the length
of the base header is fixed (40 bytes); only the length of the payload needs to be defined.
Next header. The next header is an 8-bit field defining the type of first extension header (if present) or the
type of the data that follows the base header in the datagram. This field is similar to the protocol field in
IPv4, but we talk more about it when we discuss the payload.
Hop limit. The 8-bit hop limit field serves the same purpose as the TTL field in IPv4.
Source and destination address. The source address field is a 16-byte (128-bit) Internet address that
identifies the original source of the datagram. The destination address field is a 16-byte (128-bit) Internet
address that identifies the destination of the datagram.
Payload. Compared to IPv4, the payload field in IPv6 has a different format and meaning, as shown in Figure
7.47. The payload in IPv6 means a combination of zero or more extension headers (options) followed by
the data from other protocols (UDP, TCP, and so on). In IPv6, options, which are part of the header in IPv4,
are designed as extension headers. The payload can have as many extension headers as required by the
situation. Each extension header has two mandatory fields, next header and the length, followed by
information related to the particular option. Note that each next header field value (code) defines the type
of the next header (hop-by-hop option, source-routing option, …); the last next header field defines the
protocol (UDP, TCP, …) that is carried by the datagram.
Figure 7.47 Payload in an IPv6 datagram
Extension Header
An IPv6 packet is made up of a base header and some extension headers. The length of the base
header is fixed at 40 bytes. However, to give more functionality to the IP datagram, the base header
can be followed by up to six extension headers. Many of these headers are options in IPv4. Six
types of extension headers have been defined. These are hop-by-hop option, source routing,
fragmentation, authentication, encrypted security payload, and destination option (see Figure
7.48).
Figure 7.48 Extension header types
Hop-by-Hop Option
The hop-by-hop option is used when the source needs to pass information to all routers visited
by the datagram. For example, perhaps routers must be informed about certain management,
debugging, or control functions. Or, if the length of the datagram is more than the usual 65,535
bytes, routers must have this information. So far, only three hop-by-hop options have been
defined: Pad1, PadN, and jumbo payload.
Pad1. This option is 1 byte long and is designed for alignment purposes. Some options need to start at a
specific bit of the 32-bit word. If an option falls short of this requirement by exactly 1 byte, the rest will be
filled by 0s.
PadN. PadN is similar in concept to Pad1. The difference is that PadN is used when 2 or more bytes are
needed for alignment.
Jumbo payload. Recall that the length of the payload in the IP datagram can be a maximum of 65,535 bytes.
However, if for any reason a longer payload is required, we can use the jumbo payload option to define this
longer length.
Destination Option
The destination option is used when the source needs to pass information to the destination only.
Intermediate routers are not permitted access to this information. The format of the destination
option is the same as the hop-by-hop option. So far, only the Pad1 and PadN options have been
defined.
Source Routing
The source routing extension header combines the concepts of the strict source route and the loose
source route options of IPv4.
Fragmentation
The concept of fragmentation in IPv6 is the same as that in IPv4. However, the place where
fragmentation occurs differs. In IPv4, the source or a router is required to fragment if the size of
the datagram is larger than the MTU of the network over which the datagram travels. In IPv6,
only the original source can fragment. A source must use a Path MTU Discovery technique to
find the smallest MTU supported by any network on the path. The source then fragments using
this knowledge.
If the source does not use a Path MTU Discovery technique, it fragments the datagram to a
size of 1280 bytes or smaller. This is the minimum size of MTU required for each network
connected to the Internet.
Authentication
The authentication extension header has a dual purpose: It validates the message sender and
ensures the integrity of data. The former is needed so the receiver can be sure that a message is
from the genuine sender and not from an imposter. The latter is needed to check that the data are
not altered in transition by some hacker. We discuss more about authentication in Chapter 13.
CHAPTER 8
Network Layer: Routing of
Packets
8.1 INTRODUCTION
Unicast routing in the Internet, with a large number of routers and a huge number of hosts, can be
done only by using hierarchical routing: routing in several steps using different routing algorithms.
In this section, we first discuss the general concept of unicast routing in an internet: an
internetwork made up of networks connected by routers. After the routing concepts and algorithms
are understood, we show how we can apply them to the Internet using hierarchical routing.
An Internet as a Graph
To find the best route, an internet can be modeled as a graph. A graph in computer science is a set
of nodes and edges (lines) that connect the nodes. To model an internet as a graph, we can think
of each router as a node and each network between a pair of routers as an edge. An internet is, in
fact, modeled as a weighted graph, in which each edge is associated with a cost. If a weighted
graph is used to represent a geographical area, the nodes can be cities and the edges can be roads
connecting the cities; the weights, in this case, are distances between cities. In routing, however,
the cost of an edge has a different interpretation in different routing protocols, which we discuss
when we discuss that routing protocol. For the moment, we assume that there is a cost associated
with each edge. If there is no edge between the nodes, the cost is infinity. Figure 8.1 shows how
an internet can be modeled as a graph.
Least-Cost Trees
If there are N routers in an internet, there are (N − 1) least-cost paths from each router to any other
router. This means we need N × (N − 1) least-cost paths for the whole internet. If we have only 10
routers in an internet, we need 90 least-cost paths. A better way to see all these paths is to combine
them into a least-cost tree.
A least-cost tree is a tree with the source router as the root that spans the whole graph (visits
all other nodes) and in which the path between the root and any other node is the shortest.
In this way, we can have only one shortest-path tree for each node; we have N least-cost path
trees for the whole internet. Figure 8.2 shows the seven least-cost trees for the internet in Figure
8.1. cost of 6. This means that each router needs to find the least-cost route between itself and all
the other routers to be able to route a packet using this criteria.
The least-cost trees for a weighted graph can have several properties if they are created using
consistent criteria.
1. The least-cost route from X to Y in X’s tree is the inverse of the least cost route from Y to X in Y’s tree;
the cost in both directions is the same. For example, in Figure 8.2, the route from A to F in A’s tree is
(A → B → E → F), but the route from F to A in F’s tree is (F → E → B → A), which is the inverse of the first
route. The cost is 8 in each case.
2. Instead of traveling from X to Z using X’s tree, we can travel from X to Y using X’s tree and continue from
Y to Z using Y’s tree. For example, in Figure 8.2, we can go from A to G in A’s tree using the route
(A → B → E → F → G). We can also go from A to E in A’s tree (A → B → E) and then continue in E’s tree
using the route (E → F → G). The combination of the two routes in the second case is the same route as
in the first case. The cost in the first case is 9; the cost in the second case is also 9 (6 + 3).
Bellman-Ford Equation
The heart of distance-vector routing is the famous Bellman-Ford equation. This equation is used
to find the least cost (shortest distance) between a source node, x and a destination node, y, through
some intermediary nodes (a, b, c, …) when the costs between the source and the intermediary
nodes and the least costs between the intermediary nodes and the destination are given. The
following shows the general case in which Dij is the shortest distance and cij is the cost between
nodes i and j.
In distance-vector routing, normally we want to update an existing least cost with a least cost
through an intermediary node, such as z, if the latter is shorter. In this case, the equation becomes
simpler:
We can say that the Bellman-Ford equation enables us to build a new least-cost path from
previously established least-cost paths. In Figure 8.3, we can think of (a → y), (b → y), and
(c → y) as previously established least-cost paths and (x → y) as the new least-cost path. We can
even think of this equation as the builder of a new least-cost tree from previously established least-
cost trees if we use the equation repeatedly. In other words, the use of this equation in distance-
vector routing is a witness that this method also uses least-cost trees, but this use may be in the
background.
We will shortly show how we use the Bellman-Ford equation and the concept of distance
vectors to build least-cost paths for each node in distance-vector routing, but first we need to
discuss the concept of a distance vector.
Distance Vectors
The concept of a distance vector is the rationale for the name -distance-vector routing. A least-
cost tree is a combination of least-cost paths from the root of the tree to all destinations. These
paths are graphically glued together to form the tree. Distance-vector routing unglues these paths
and creates a distance vector, a one-dimensional array to represent the tree. Figure 8.4 shows the
tree for node A in the internet in Figure 8.1 and the corresponding distance vector.
Figure 8.4 The distance vector corresponding to a tree
Note that the name of the distance vector defines the root, the indexes define the destinations,
and the value of each cell defines the least cost from the root to the destination. A distance vector
does not give the path to the destinations as the least-cost tree does; it gives only the least costs to
the destinations. Later we show how we can change a distance vector to a forwarding table, but
we first need to find all distance vectors for an internet.
We know that a distance vector can represent least-cost paths in a least-cost tree, but the
question is how each node in an internet originally creates the corresponding vector. Each node in
an internet, when it is booted, creates a very rudimentary distance vector with the minimum
information the node can obtain from its neighborhood. The node sends some greeting messages
out of its interfaces and discovers the identity of the immediate neighbors and the distance between
itself and each neighbor. It then makes a simple distance vector by inserting the discovered
distances in the corresponding cells and leaves the value of other cells as infinity. Do these distance
vectors represent least-cost paths? They do, considering the limited information a node has. When
we know only one distance between two nodes, it is the least cost. Figure 8.5 shows all distance
vectors for our internet. However, we need to mention that these vectors are made asynchronously,
when the corresponding node has been booted; the existence of all of them in a figure does not
mean synchronous creation of them.
These rudimentary vectors cannot help the internet to effectively forward a packet. For
example, node A thinks that it is not connected to node G because the corresponding cell shows
the least cost of infinity. To improve these vectors, the nodes in the internet need to help each other
by exchanging information. After each node has created its vector, it sends a copy of the vector to
all its immediate neighbors. After a node receives a distance vector from a neighbor, it updates its
distance vector using the Bellman-Ford equation (second case). However, we must understand that
we need to update, not only one least cost, but N of them, in which N is the number of the nodes
in the internet. If we are using a program, we can do this using a loop; if we are showing the
concept on paper, we can show the whole vector instead of the N separate equations. In Figure 8.6,
we show the whole vector instead of seven equations for each update. The figure shows two
asynchronous events, happening one after another with some time in between. In the first event,
node A has sent its vector to node B. Node B updates its vector using the cost cBA = 2. In the second
event, node E has sent its vector to node B. Node B updates its vector using the cost cEA = 4.
After the first event, node B has one improvement in its vector; its least cost to node D has
changed from infinity to 5 (via node A). After the second event, node B has one more improvement
in its vector; its least cost to node F has changed from infinity to 6 (via node E). We hope that we
have convinced the reader that exchanging vectors eventually stabilizes the system and allows all
nodes to find the ultimate least cost between themselves and any other node. We need to remember
that after updating a node, it immediately sends its updated vector to all neighbors. Even if its
neighbors have received the previous vector, the updated one may help more.
Distance-Vector Routing Algorithm
Now we can give a simplified pseudocode for the distance-vector routing algorithm, as shown
in Table 8.1. The algorithm is run by its node independently and asynchronously.
Lines 4 to 11 initialize the vector for the node. Lines 14 to 23 show how the vector can be
updated after receiving a vector from the immediate neighbor. The for loop in lines 17 to 20 allows
all entries (cells) in the vector to be updated after receiving a new vector. Note that the node sends
its vector in line 12, after being initialized, and in line 22, after it is updated.
Now the question is, “How can each node create this LSDB that contains information about
the whole internet?” This can be done by a process called flooding. Each node can send some
greeting messages to all its immediate neighbors (those nodes to which it is connected directly) to
collect two pieces of information for each neighboring node: the identity of the node and the
cost of the link. The combination of these two pieces of information is called the LS packet (LSP);
the LSP is sent out of each interface, as shown in Figure 8.9 for our internet in Figure 8.1. When
a node receives an LSP from one of its interfaces, it compares the LSP with the copy it may already
have. If the newly arrived LSP is older than the one it has (found by checking the sequence
number), it discards the LSP. If it is newer or the first one received, the node discards the old LSP
(if there is one) and keeps the received one. It then sends a copy of it out of each interface except
the one from which the packet arrived. This guarantees that flooding stops somewhere in the
network (where a node has only one interface). We need to convince ourselves that, after receiving
all new LSPs, each node creates the comprehensive LSDB as shown in Figure 8.9. This LSDB is
the same for each node and shows the whole map of the internet. In other words, a node can make
the whole map if it needs to, using this LSDB.
Figure 8.9 LSPs created and sent out by each node to build the LSDB
We can compare the link-state routing algorithm with the distance-vector routing algorithm.
In the distance-vector routing algorithm, each router tells its neighbors what it knows about the
whole internet; in the link-state routing algorithm, each router tells the whole internet what it
knows about its neighbors.
Figure 8.11 shows a small internet with only five nodes. Each source has created its own
spanning tree that meets its policy. The policy imposed by all sources is to use the minimum number
of nodes to reach a destination. The spanning tree selected by A and E is such that the
communication does not pass through D as a middle node. Similarly, the spanning tree selected by
B is such that the communication does not pass through C as a middle node.
Figure 8.11 Spanning trees in path-vector routing
Path (x, y) = best {Path (x, y), [ (x + Path (v, y)]} for all v’s in the internet
In this equation, the operator (+) means to add x to the beginning of the path. We also need
to be cautious to avoid adding a node to an empty path because an empty path means one that does
not exist.
The policy is defined by selecting the best of multiple paths. Path-vector routing also imposes
one more condition on this equation: If path (v, y) includes x, that path is discarded to avoid a loop
in the path. In other words, x does not want to visit itself when it selects a path to y.
Figure 8.13 shows the path vector of node C after two events. In the first event, node C
receives a copy of B’s vector, which improves its vector: Now it knows how to reach node A. In
the second event, node C receives a copy of D’s vector, which does not change its vector. As a
matter of fact, the vector for node C after the first event is stabilized and serves as its forwarding
table.
Figure 8.13 Updating path vectors
Path-Vector Algorithm
Based on the initialization process and the equation used in updating each forwarding table after
receiving path vectors from neighbors, we can write a simplified version of the path vector
algorithm as shown in Table 8.3.
Table 8.3 Path-vector algorithm for a node
Lines 4 to 12 show the initialization for the node. Lines 17 to 24 show how the node updates its
vector after receiving a vector from the neighbor. The update process is repeated forever. We can
see the similarities between this algorithm and the DV algorithm.
Metric
In OSPF, like RIP, the cost of reaching a destination from the host is calculated from the source
router to the destination network. However, each link (network) can be assigned a weight based
on the throughput, round-trip time, reliability, and so on. An administration can also decide to use
the hop count as the cost. An interesting point about the cost in OSPF is that different service types
(TOSs) can have different weights as the cost. Figure 8.19 shows the idea of the cost from a router
to the destination host network. We can compare Figure 8.19 with Figure 8.15 for the RIP.
Forwarding Tables
Each OSPF router can create a forwarding table after finding the shortest-path tree between
itself and the destination using Dijkstra’s algorithm, described earlier in Section 8.2.2. Figure
8.20 shows the forwarding tables for the simple AS in Figure 8.19. Comparing the forwarding
tables for the OSPF and RIP in the same AS, we find that the only difference is the cost values. In
other words, if we use the hop count for OSPF, the tables will be exactly the same. The reason for
this consistency is that both protocols use the shortest-path trees to define the best route from a
source to a destination.
Link-State Advertisement
OSPF is based on the link-state routing algorithm, which requires that a router advertise the state
of each link to all neighbors for the formation of the LSDB. When we discussed the link-state
algorithm, we used the graph theory and assumed that each router is a node and each network
between two routers is an edge. The situation is different in the real world, in which we need to
advertise the existence of different entities as nodes, the different types of links that connect each
node to its neighbors, and the different types of cost associated with each link. This means we need
different types of advertisements, each capable of advertising different situations.
We can have five types of link-state advertisements: router link, network link, summary link
to network, summary link to AS border router, and external link. Figure 8.22 shows these five
advertisements and their uses.
Router link. A router link advertises the existence of a router as a node. In addition to giving the
address of the announcing router, this type of advertisement can define one or more types of links
that connect the advertising router to other entities. A transient link announces a link to a transient
network, a network that is connected to the rest of the networks by one or more routers. This type
of advertisement should define the address of the transient network and the cost of the link. A stub
link advertises a link to a stub network, a network that is not a through network. Again, the
advertisement should define the address of the network and the cost. A point-to-point link should
define the address of the router at the end of the point-to-point line and the cost to get there.
Network link. A network link advertises the network as a node. However, because a network
cannot do announcements itself (it is a passive entity), one of the routers is assigned as the
designated router and does the advertising. In addition to the address of the designated router, this
type of LSP announces the IP address of all routers (including the designated router as a router and
not as speaker of the network), but no cost is advertised because each router announces the cost to
the network when it sends a router link advertisement.
Summary link to network. This is done by an area border router; it advertises the summary of
links collected by the backbone to an area or the summary of links collected by the area to the
backbone. This type of information exchange is needed to glue the areas together.
Summary link to AS. This is done by an AS router that advertises the summary links from other
ASs to the backbone area of the current AS, information which later can be disseminated to the
areas so that they will know about the networks in other ASs. The need for this type of information
exchange is better understood when we discuss inter-AS routing (BGP).
External link. This is also done by an AS router to announce the existence of a single network
outside the AS to the backbone area to be disseminated into the areas.
OSPF Implementation
OSPF is implemented as a program in the network layer that uses the service of the IP for
propagation. An IP datagram that carries a message from OSPF sets the value of the protocol
field to 89. This means that, although OSPF is a routing protocol to help IP to route its datagrams
inside an AS, the OSPF messages are encapsulated inside datagrams. OSPF has gone through two
versions: version 1 and version 2. Most implementations use version 2.
OSPF Messages
OSPF is a very complex protocol; it uses five different types of messages.
In Figure 8.23, we first show the format of the OSPF common header (which is used in all
messages) and the link-state general header (which is used in some messages).
Authentication
As Figure 8.23 shows, the OSPF common header has the provision for authentication of the
message sender. As we will discuss in Chapter 13, this prevents a malicious entity from sending
OSPF messages to a router and causing the router to become part of the routing system to which
it actually does not belong.
OSPF Algorithm
OSPF implements the link-state routing algorithm we discussed in Section 8.2.2. However,
some changes and augmentations need to be added to the algorithm:
After each router has created the shortest-path tree, the algorithm needs to use it to create the
corresponding routing algorithm.
The algorithm needs to be augmented to handle sending and receiving all five types of messages.
Performance
Before ending this section, let us briefly discuss the performance of OSPF:
Update messages. The link-state messages in OSPF have a somewhat complex format. They also are
flooded to the whole area. If the area is large, these messages may create heavy traffic and use a lot of
bandwidth.
Convergence of forwarding tables. When the flooding of LSPs is completed, each router can create its own
shortest-path tree and forwarding table; convergence is fairly quick. However, each router needs to run the
Dijkstra’s algorithm, which may take some time.
Robustness. The OSPF protocol is more robust than RIP because, after receiving the completed LSDB, each
router is independent and does not depend on other routers in the area. Corruption or failure in one router
does not affect other routers as seriously as in RIP.
Introduction
Border Gateway Protocol (BGP), and in particular BGP4, is a complex protocol. In this section,
we introduce the basics of BGP and its relationship with intradomain routing protocols (RIP or
OSPF). Figure 8.24 shows an example of an internet with four autonomous systems. AS2, AS3,
and AS4 are stub autonomous systems; AS1 is a transient one. In our example, data exchange
between AS2, AS3, and AS4 should pass through AS1
Each autonomous system in this figure uses one of the two common intradomain protocols,
RIP or OSPF. Each router in each AS knows how to reach a network that is in its own AS, but it
does not know how to reach a network in another AS.
To enable each router to route a packet to any network in the internet, we first install a
variation of BGP4, called external BGP (eBGP), on each border router (the one at the edge of
each AS that is connected to a router at another AS). We then install the second variation of BGP,
called internal BGP (iBGP), on all routers. This means that the border routers will be running
three routing protocols (intradomain, eBGP, and iBGP), but other routers are running two
protocols (intradomain and iBGP). We discuss the effect of each BGP variation separately.
The reader may have noticed that the messages exchanged during three eBGP sessions help
some routers know how to route packets to some networks in the internet, but the reachability
information is not complete. There are two problems that need to be addressed:
1. Some border routers do not know how to route a packet destined for non-neighbor ASs. For example,
R5 does not know how to route packets destined for networks in AS3 and AS4. Routers R6 and R9 are
in the same situation as R5: R6 does not know about networks in AS4, and R9 does not know about
networks in AS3.
2. None of the nonborder routers know how to route a packet destined for any networks in other ASs.
To address these two problems, we need to allow all pairs of routers (border or nonborder)
to run the second variation of the BGP protocol, iBGP.
Note that we have not shown the physical networks inside ASs because a session is made on
an overlay network (TCP connection), possibly spanning more than one physical network as
determined by the route dictated by the intradomain routing protocol. Also note that in this stage
only four messages are exchanged. The first message (numbered 1) is sent by R1 announcing that
networks N8 and N9 are reachable through the path AS1-AS2, but the next router is R1. This
message is sent, through separate sessions, to R2, R3, and R4. Routers R2, R4, and R6 do the same
thing but send different messages to different destinations. The interesting point is that, at this
stage, R3, R7, and R8 create sessions with their peers, but they actually have no message to send.
The updating process does not stop here. For example, after R1 receives the update message
from R2, it combines the reachability information about AS3 with the reachability information it
already knows about AS1 and sends a new update message to R5. Now R5 knows how to reach
networks in AS1 and AS3. The process continues when R1 receives the update message from R4.
The point is that we need to make certain that at a point in time there are no changes in the previous
updates and that all information is propagated through all ASs. At this time, each router combines
the information received from eBGP and iBGP and creates what we may call a path table after
applying the criteria for finding the best path. To demonstrate, we show the path tables in Figure
8.27 for the routers in Figure 8.24. For example, router R1 now knows that any packet destined
for networks N8 or N9 should go through AS1 and AS2 and the next router to deliver the packet
to is router R5. Similarly, router R4 knows that any packet destined for networks N10, N11, or
N12 should go through AS1 and AS3 and the next router to deliver this packet to is router R1, and
so on.
Injection of Information into Intradomain Routing
The role of an interdomain routing protocol such as BGP is to help the routers inside the AS to
augment their routing information. In other words, the path tables collected and organized by BPG
are not used, per se, for routing packets; they are injected into intradomain forwarding tables (RIP
or OSPF) for routing packets. This can be done in several ways depending on the type of AS.
In the case of a stub AS, the only area border router adds a default entry at the end of
its forwarding table and defines the next router to be the speaker router at the end of the
eBGP connection. In Figure 8.24, router R5 in AS2 defines R1 as the default router for all
networks other than N8 and N9. The situation is the same for router R9 in AS4 with the default
router to be R4. In AS3, R6 set its default router to be R2, but R7 and R8 set their default routers
to be R6. These settings are in accordance with the path tables we describe in Figure 8.27 for these
routers. In other words, the path tables are injected into intradomain forwarding tables by adding
only one default entry.
In the case of a transient AS, the situation is more complicated. Router R1 in AS1 needs
to inject the whole contents of the path table for R1 in Figure 8.27 into its intradomain
forwarding table. The situation is the same for R2, R3, and R4.
One issue to be resolved is the cost value. We know that RIP and OSPF use different metrics.
One solution, which is very common, is to set the cost to the foreign networks at the same cost
value as to reach the first AS in the path. For example, the cost for R5 to reach all networks in
other ASs is the cost to reach N5. The cost for R1 to reach networks N10 to N12 is the cost to
reach N6, and so on. The cost is taken from the intradomain forwarding tables (RIP or OSPF).
Figure 8.28 shows the interdomain forwarding tables. For simplicity, we assume that all ASs
are using RIP as the intradomain routing protocol. The shaded areas are the augmentation injected
by the BGP protocol; the default destinations are indicated as zero.
The first byte in each attribute defines the four attribute flags (as shown in Figure 8.29). The
next byte defines the type of attributes assigned by ICANN (only seven types have been assigned,
as explained next). The attribute value length defines the length of the attribute value field (not the
length of the whole attributes section). The following gives a brief description of each attribute.
ORIGIN (type 1). This is a well-known mandatory attribute, which defines the source of the routing
information. This attribute can be defined by one of the three values: 1, 2, and 3. Value 1 means
that the information about the path has been taken from an intradomain protocol (RIP or OSPF).
Value 2 means that the information comes from BGP. Value 3 means that it comes from an
unknown source.
AS-PATH (type 2). This is a well-known mandatory attribute, which defines the list of autonomous
systems through which the destination can be reached. We have used this attribute in our
examples. The AS-PATH attribute, as we discussed in path-vector routing in Section 8.2.3, helps
prevent a loop. Whenever an update message arrives at a router that lists the current AS as the
path, the router drops that path. The AS-PATH can also be used in route selection.
NEXT-HOP (type 3). This is a well-known mandatory attribute, which defines the next router to
which the data packet should be forwarded. We have also used this attribute in our examples. As
we have seen, this attribute helps to inject path information collected through the operations of
eBGP and iBGP into the intradomain routing protocols such as RIP or OSPF.
MULT-EXIT-DISC (type 4). The multiple-exit discriminator is an optional nontransitive attribute,
which discriminates among multiple exit paths to a destination. The value of this attribute is
normally defined by the metric in the corresponding intradomain protocol (an attribute value of a
4-byte unsigned integer). For example, if a router has multiple paths to the destination with
different values related to these attributes, the one with the lowest value is selected. Note that
this attribute is nontransitive, which means that it is not propagated from one AS to another.
LOCAL-PREF (type 5). The local preference attribute is a well-known discretionary attribute. It is
normally set by the administrator, based on an organization’s policy. The routes the
administrator prefers are given a higher local preference value (an attribute value of a 4-byte
unsigned integer). For example, in an internet with five ASs, the administrator of AS1 can set the
local preference value of 400 to the path AS1-AS2-AS5, the value of 300 to AS1-AS3-AS5, and the
value of 50 to AS1-AS4-AS5. This means that the administrator prefers the first path to the second
one and prefers the second one to the third one. This may be a case where AS2 is the most secured
and AS4 is the least secured AS for the administration of AS1. The last route should be selected if
the other two are not available.
ATOMIC-AGGREGATE (type 6). This is a well-known discretionary attribute, which defines the
destination prefix as not aggregate; it only defines a single destination network. This attribute
has no value field, which means the value of the length field is zero.
AGGREGATOR (type 7). This is an optional transitive attribute, which emphasizes that the
destination prefix is an aggregate. The attribute value gives the number of the last AS that did
the aggregation followed by the IP address of the router that did so.
Route Selection
So far in this section, we have been silent about how a route is selected by a BGP router mostly
because our simple example has one route to a destination. In the case where multiple routes are
received to a destination, BGP needs to select one among them. The route selection process in
BGP is not as easy as the ones in the intradomain routing protocol that is based on the shortest-
path tree. A route in BGP has some attributes attached to it, and it may come from an eBGP session
or an iBGP session.
Figure 8.30 shows the flow diagram used by common implementations.
The router extracts the routes that meet the criteria in each step. If only one route is extracted,
it is selected and the process stops; otherwise, the process continues with the next step. Note that
the first choice is related to the LOCAL-PREF attribute, which reflects the policy imposed by the
administration on the route.
Messages
BGP uses four types of messages for communication between the BGP speakers across the
ASs and inside an AS: open, update, keepalive, and notification
(see Figure 8.31). All BGP packets share the same common header.
Open message. To create a neighborhood relationship, a router running BGP opens a TCP
connection with a neighbor and sends an open message.
Update message. The update message is the heart of the BGP protocol. It is used by a router to
withdraw destinations that have been advertised previously, to announce a route to a new
destination, or both. Note that BGP can withdraw several destinations that were advertised before,
but it can only advertise one new destination (or multiple destinations with the same path attributes)
in a single update message.
Keepalive message. The BGP peers that are running exchange keep-alive messages regularly
(before their hold time expires) to tell each other that they are alive.
Notification. A notification message is sent by a router whenever an error condition is detected
or a router wants to close the session.
Performance
BGP performance can be compared with RIP. BGP speakers exchange a lot of messages to create
forwarding tables, but BGP is free from loops and count-to-infinity. The same weakness we
mention for RIP about propagation of failure and corruptness also exists in BGP.
8.4 MULTICAST ROUTING
Communication in the Internet today is not only unicasting; multicasting communication is
growing fast. In this section, we first discuss the general ideas behind unicasting, multicasting, and
broadcasting. We then talk about some basic issues in multicast routing. Finally, we discuss
multicasting routing protocols in the Internet.
From the previous chapters, we have learned that forwarding a datagram by a router is
normally based on the prefix of the destination address in the datagram, which defines the
network to which the destination host is connected. Understanding the above forwarding
principle, we can now define unicasting, multicasting, and broadcasting. Let us clarify these terms
as they relate to the Internet.
8.4.1 Unicasting
In unicasting, there is one source and one destination network. The relationship between the source
and the destination network is one to one. Each router in the path of the datagram tries to forward
the packet to one and only one of its interfaces. Figure 8.32 shows a small internet in which a
unicast packet needs to be delivered from a source computer to a destination computer attached to
N6. Router R1 is responsible for forwarding the packet only through interface 3; router R4 is
responsible for forwarding the packet only through interface 2. When the packet arrives at N6, the
delivery to the destination host is the responsibility of the network; either it is broadcast to all
hosts, or the Ethernet switch delivers it only to the destination host.
8.4.2 Multicasting
In multicasting, there is one source and a group of destinations. The relationship is one to many.
In this type of communication, the source address is a unicast address, but the destination address
is a group address, a group of one or more destination networks in which there is at least one
member of the group that is interested in receiving the multicast datagram. The group address
defines the members of the group. Figure 8.33 shows the same small internet as in Figure 8.32, but
the routers have been changed to multicast routers (or previous routers have been configured to do
both types of jobs).
In multicasting, a multicast router may have to send out copies of the same datagram through
more than one interface. In Figure 8.33, router R1 needs to send out the datagram through
interfaces 2 and 3. Similarly, router R4 needs to send out the datagram through both its interfaces.
Router R3, however, knows that there is no member belonging to this group in the area reached
by interface 2; it only sends out the datagram through interface 1.
Multicast Applications
Multicasting has many applications today, such as access to distributed databases, information
dissemination, teleconferencing, and distance learning.
Access to distributed databases. Most of the large databases today are distributed. That is, the information
is stored in more than one location, usually at the time of production. The user who needs to access the
database does not know the location of the information. A user’s request is multicast to all the database
locations, and the location that has the information responds.
Information dissemination. Businesses often need to send information to their customers. If the nature of
the information is the same for each customer, it can be multicast. In this way, a business can send one
message that can reach many customers. For example, a software update can be sent to all purchasers of
a particular software package. In a similar manner, news can be easily disseminated through multicasting.
Teleconferencing. Teleconferencing involves multicasting. The individuals attending a teleconference all
need to receive the same information at the same time. Temporary or permanent groups can be formed
for this purpose.
Distance learning. One growing area in the use of multicasting is distance learning. Lessons taught by one
professor can be received by a specific group of students. This is especially convenient for those students
who find it difficult to attend classes on campus.
8.4.3 Distance Vector Multicast Routing Protocol
The Distance Vector Multicast Routing Protocol (DVMRP) is the extension of the Routing
Information Protocol (RIP) that is used in unicast routing. It uses the source-based tree approach
to multicasting. It is worth mentioning that each router in this protocol that receives a
multicast packet to be forwarded implicitly creates a source-based multicast tree in three
steps:
1. The router uses an algorithm called reverse path forwarding (RPF) to simulate creation of part of the
optimal source-based tree between the source and itself.
2. The router uses an algorithm called reverse path broadcasting (RPB) to create a broadcast (spanning)
tree whose root is the router itself and whose leaves are all networks in the internet.
3. The router uses an algorithm called reverse path multicasting (RPM) to create a multicast tree by
cutting some branches of the tree that end in networks with no member in the group.
8.5.1 Messages
There are only two types of messages in IGMP, version 3: query and report messages, as shown
in Figure 8.39. A query message is periodically sent by a router to all hosts attached to it to ask
them to report their interests about membership in groups. A report message is sent by a host as
a response to a query message.
Figure 8.39 IGMP operation
Query Message
The query message is sent by a router to all hosts in each interface to collect information about
their membership. There are three versions of a query message:
a. A general query message is sent about membership in any group. It is encapsulated in a datagram with
the destination address 224.0.0.1 (all hosts and routers). Note that all routers attached to the same
network receive this message to inform them that this message is already sent and that they should refrain
from resending it.
b. A group-specific query message is sent from a router to ask about the membership related to a specific
group. This is sent when a router does not receive a response about a specific group and wants to be sure
that there is no active member of that group in the network. The group identifier (multicast address) is
mentioned in the message. The message is encapsulated in a datagram with the destination address set
to the corresponding multicast address. Although all hosts receive this message, those not interested drop
it.
c. A source-and-group-specific query message is sent from a router to ask about the membership related
to a specific group when the message comes from a specific source or sources. Again the message is sent
when the router does not hear about a specific group related to a specific host or hosts. The message is
encapsulated in a datagram with the destination address set to the corresponding multicast address.
Although all hosts receive this message, those not interested drop it.
Report Message
A report message is sent by a host as a response to a query message. The message contains a list
of records in which each record gives the identifier of the corresponding group (multicast address)
and the addresses of all sources that the host is interested in receiving messages from (inclusion).
The record can also mention the source addresses from which the host does not desire to receive a
group message (exclusion). The message is encapsulated in a datagram with the multicast address
224.0.0.22 (multicast address assigned to IGMPv3). In IGMPv3, if a host needs to join a group, it
waits until it receives a query message and then sends a report message. If a host needs to leave a
group, it does not respond to a query message. If no other host responds to the corresponding
message, the group is purged from the router database.
8.5.3 Encapsulation
The IGMP message is encapsulated in an IP datagram with the value of the protocol field set to
2 and the TTL field set to 1. The destination IP address of the datagram, however, depends on the
type of the message, as shown in Table 8.4.
Report 224.0.0.22