0% found this document useful (0 votes)
222 views197 pages

Internet Routing With BGP

Uploaded by

drone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
222 views197 pages

Internet Routing With BGP

Uploaded by

drone
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 197

Iljitsch van Beijnum

Internet
Routing
with BGP
Introduction

The internet is “a network of networks”. It’s made up of tens of thou


sands of largely independent networks, but somehow the users of one
network can communicate with the users of any of the other networks.
The Border Gateway Protocol (BGP) is the glue that binds these dis
parate networks together.

BGP is a routing protocol: its main job is to allow each network to learn
which ranges of IP addresses are used where, so packets can flow
along the correct route.

However, BGP has a more difficult job to do than other routing proto
cols. Yes, it has to make the packets reach their destination, but BGP
also has to pay attention to the business side: those packets only get to
flow over a network link if either the sender or the receiver pays for
the privilege.

This book covers the fundamentals of the technical side of BGP, and
also looks at the intersection between the technical and business as
pects of internet routing.

The book contains 40 configuration examples that readers can try out
on their own computer in a “BGP minilab”.

2
Table of contents

Introduction 2

3
Table of contents
About this book 6
Intended audience 7
Conventions used in this book 7
Internet routing 9
IP addresses 11
Classes 11
Subnet masks 13
Classless Inter-Domain Routing (CIDR) 13
IPv6 16
The BGP protocol 18
The IETF 18
Distance vector vs link state 19
BGP versions 21
Autonomous Systems 22
BGP neighbor relationships 22
BGP messages 23
Path attributes 25
Multiprotocol BGP 39
36
34
31
29
27
BGP states and finite-state machine
BGP operation
BGP prerequisites
Connectivity
Router hardware
IP addresses and AS numbers
BGP configuration 101 41
Filtering BGP 48
AS path filters 49
Prefix filters 53
Community-based filters 57
Consistency between filters 61
Transit and peering 64
Internet exchanges 64

3
The business of peering: peering policies 66
Hot potato routing 66
Valley-freeness 68
BGP peering configuration 71
Peer groups 74
Internet exchange route servers 77
Traffic engineering 81
The BGP path selection algorithm 81
Route maps 84
Setting the local preference 86
AS path prepending 88
Setting and adjusting the MED 90
Influencing neighboring networks with communities 96
Announcing more specific prefixes 101
Multipath BGP 105
ECMP load balancing strategies 111
iBGP 113
iBGP and internal routing protocols 118
Loopback addresses for iBGP 122
Route reflectors 125
BGP security 129
MD5 passwords 129
The “TTL hack”: GTSM 133
Some scary stories 135
Internet Routing Registries 137
RPKI 142
BGPsec 156
So how secure is BGP? 158
Making BGP faster 159
Adjusting the BGP timers 161
BFD: bidirectional forwarding detection 164
Graceful restart 166
Best practices 169
“Black starts” 169
Shutdown for maintenance 170
Setting a maximum prefix limit 172
Flap damping and MRAI 172
Limiting AS path length 174

4
Best practices documents 174
Martian and bogon filters 175
Tools and resources 178
PeeringDB 179
Meetings and Network Operator Groups 180
Other resources 181
Appendix: the router command line 182
Cisco, Quagga, FRR configuration differences 183
Appendix: BGP minilab 186
Installing the minilab and running examples 187
Appendix: a non-converging BGP configuration 189
Appendix: IP address notes 192
IPv4 subnetting cheat sheet 192
Special addresses 193
About the author 195
Copyright and acknowledgments 196

5
About this book

I already wrote a book about BGP back in 2002. So why another one?

What I’ve learned over the years is that at its core, BGP is quite simple.
However, there are many hidden nuances and caveats that people
usually only begin to understand when they run into them in practice.
But learning those things the hard way on a live network is less than
ideal.

So what I want to do here is provide examples that are as close to real


world BGP internet routing as possible, allowing you, the reader, to
start understanding the forest better by looking at some individual
trees. All of this is based on running BGP training courses for almost
two decades.

To keep both the writing (for me) and reading (for you) of this book
manageable, the book only covers the BGP protocol and BGP configu
ration for connecting a network to the internet. There is a lot more to
running a network, please find that information in other books and
online resources. BGP is also extensively used in data centers and en
terprise networks. This also not covered in this book.

You’ll get the most out of this book by running the virtual example
network yourself and try out the examples. With today’s technology,
it's possible to use Docker to run a bunch of virtual routers on a regu
lar Windows, MacOS or Linux system. The examples are based on Free
Range Routing, open source routing software that is configured very
similar to “classic IOS” Cisco routers. However, the exact configuration
language isn’t the point; once you understand the concepts, looking up
the right keywords in the vendor documentation is the easy part.

That said, if you’d like to see configuration examples for other types of
routers, please let me know and I may be able to add those to a future
version of this book.

6
Intended audience
This book is intended to be useful for anyone who wants or needs to
know more about BGP, and how BGP is used for internet routing. A
large part of the book discusses configuration examples, but even if
you skip those, you should still get a good feel for the problems BGP
solves. (And sometimes creates!)

The book is especially intended for network engineers who’ve just


started using BGP to connect to the internet, and those who are con
sidering doing that. Trying out the examples should give you a good
feel for what that’s like, and enable you to decide whether that’s some
thing you’ll feel comfortable doing yourself after some study, or it's
better to hire someone else to guide you through the process and then
take over yourself, or perhaps outsource configuring and maintaining
your BGP setup.

Conventions used in this book


Links, both within the book and to websites, are in dark blue. Links to
Wikipedia articles that provide further detail about a subject are
tagged with [W]. IETF protocol specifications in the form of Request
For Comment documents (RFCs) are linked like [RFC 4271].

The addresses in examples are documentation addresses and address


es in the private use 10.x.x.x range. In the text, sometimes real address
es are used. Please don’t put those addresses in any configurations, as
this could be problematic for the holders of those addresses.

Similarly, AS numbers in examples are mostly private AS numbers and


in a few cases AS numbers set aside for documentation. Real AS num
bers may appear in the text when relevant.

Information that may help avoid or resolve problems will be


marked with a warning sign to the left of the paragraph.

7
If you're reading the (reflowable) e-book version of this book, try make
your font size and/or window size settings such that this text will fit
on a single line without wrapping:
This is how long some lines in the example router output may get...

This way, the example output will be formatted as intended and easier
to read. If you find that setting the font size so the line fits makes the
text too small, try turning the reader into landscape mode, possibly
only for looking at the examples.

8
Internet routing

The internet consists of tens of thousands of networks that are owned


and run by different companies/organizations. And yet, users of any
of those networks can communicate with users of any of the other
networks. It’s an amazing thing.

To make this possible, at some point each network is connected to one


or more other networks, creating network paths between any two loca
tions. When two networks connect directly, routing decisions are sim
ple: just hand off the packets to the destination network over that di
rect connection.

Things get more complicated when there’s one or more networks in


the middle that connect the source and destination network. Typically,
there will be several paths that go through different intermediate net
works, making routing decisions somewhat more complex. But that’s
nothing any routing protocol worth its salt can’t handle.

However, the real complication with internet routing is that the job is
not simply finding the shortest path between any two locations, but
also taking into consideration the business aspects of running a net
work. What if networks A and B both connect to Microsoft? After all,
users of both networks A and B want to be able to download their
Windows updates and work on their Office365 documents with the
highest possible performance.

So in theory, a user at network A can send packets to a user at network


B through Microsoft. The physical connections are there, and left to
their own devices, the routers will see those paths and use them if
they’re shorter than alternative paths.

But Microsoft is not an Internet Service Provider (ISP)—they’re not in


the business of providing connectivity between their users. So Mi
crosoft will want to hide such paths in order to make sure that their

9
network isn’t used for traffic that falls outside the scope of the services
they provide.

As a result, the Border Gateway Protocol (BGP), the routing protocol


that’s used between the networks that collectively make up the inter
net, must do everything that’s normally expected from a routing pro
tocol, but in addition to that, apply policy restrictions to conform with
business realities. We’ll see what this means a little later in the book in
the chapter Transit and peering.

10
IP addresses

Especially the first part of this chapter is pretty basic, but in order to
make sure that everyone is on the same page later in the book, I’m go
ing over this material anyway. Feel free to skip this chapter; you can
always come back to it later if necessary.

Modern networks are connectionless packet switched networks. That


means that all data that flows across the network is put into relatively
small packets that then find their way across the network based on the
destination address that’s put on every packet. So every computer
that’s connected to a network needs an address. There’s two ways of
naming or addressing things: flat and hierarchical. Local addresses,
such as Ethernet MAC (media access control) addresses are usually the
flat type: there’s no logic to which address goes where.

For larger networks, this doesn’t work very well, as the overhead of
keeping track of where in the network a given MAC address is used
quickly becomes problematic. So protocols intended for larger inter
networks [W] use addresses that consist of a network part and a local
part (or “host” part). This way, routing tables remain manageable:
routers only need to know which network is used where, rather than
keep track of individual addresses.

Classes
Of course we want to have both a large number of networks, as well as
a large number of local addresses per network. However, that way,
addresses get rather large. So the designers of the Internet Protocol (IP)
used a trick and allowed for a small number of very large networks
(class A), a medium-sized number of medium-sized networks (class B)
and a large number of small networks (class C).

• Class A networks are numbered from 0 to 127, with local ad


dresses numbered from 0.0.0 to 255.255.255

11
• Class B networks are numbered from 128.0 to 191.255, with local
addresses numbered from 0.0 to 255.255

• Class C networks are numbered from 192.0.0 to 223.255.255,


with local addresses numbered from 0 to 255

(There are also class D addresses from 224.0.0.0 to 239.255.


255.255 used for multicast [W] and class E networks from 240.0.0.0
to 255.255.255.254, which were reserved for future use, but couldn’t
be un-reserved when it was time for that future use. 255.255.255.255
is the broadcast address [W]. See the IP address notes appendix for
more details.)

In IP networking, there can also be subnets. Suppose a university has


network 128.2. A class B network allows for some 65,000 local ad
dresses, but one big Ethernet with tens of thousands of devices con
nected to it is not a very good idea. So departments, buildings, labs,
floors and so on may have their own local network. Together those
make up the university network. A convenient way to subnet a class B
network is to use the third number in the IP address to number sub
nets. The result could look like this:

• 128.2.10.1: PC 1, geology department

• 128.2.10.2: PC 2, geology department

• 128.2.30.1: PC 1, computer science department

• 128.2.30.2: PC 2, computer science department

So the hierarchy in IP addresses is network / subnet / local. However,


the subnet part is often not stated explicitly: the external world simply
sees 128.2, and doesn’t need to know about how the local part is
divvied up. Local systems just see that (for instance) 128.2.30 is the
fixed part, and the remaining number identifies devices connected to
the local network.
12
Subnet masks
Without subnetting, the class tells us which part of an IP address is the
network part and which is the local part. When using subnets, we need
an explicit mechanism to determine the boundary between the subnet
and local parts of the address. This is what a subnet mask does.

A subnet mask indicates which of the 32 bits in an IP address belong to


the network or subnet part, and which to the local / host part. In the
subnet mask, the bits that correspond to the network or subnet bits in
an address are set to 1, the bits that correspond to the local part are set
to 0. The subnet mask is then written down as four decimal numbers
that correspond to eight bits each, with dots between them, the same
as IP addresses.

For example, our PC 2 in the CS department mentioned above has the


following address in binary, with the first two numbers the class B
network part and the subnet part. So this address in binary:

10000000 00000010 00011110 00000010

has the following subnet mask:

11111111 11111111 11111111 00000000

See the subnet cheat sheet at the end of the book for a full table of sub
net sizes. Computers and routers can now easily use binary operations
to derive the network and host parts of the address.

Classless Inter-Domain Routing (CIDR)


Having three classes of IP addresses is a pretty nifty way to accommo
date three network sizes with addresses that are only 32 bits long.
However, in practice the three-class structure created some limitations.

Back in the late 1980s and the early 1990s, many universities started to
connect to the internet. So they needed address space. For a university,
a class A network is way too big; a university is not going to connect

13
millions of devices. (Well, UCL, MIT and Stanford at some point held
networks 11, 18 and 36, respectively [RFC 790].)

On the other hand, a class C network with only 256 addresses is gener
ally not enough for a university. So they tended to get class B address
blocks. But as there are only 16384 of those, they started to run out
pretty quickly. A university would perhaps need 4000 addresses, wast
ing more than 60,000 when using a class B address block. The alterna
tive was to use a number of class C blocks instead. For instance, 16
class Cs adds up to just over 4000 addresses, so that would be a good
fit.

But when this new policy was adopted, the routing tables started to
grow much faster than before: a class B network takes up one entry in
the routing table, but 16 class C networks take 16 entries in the routing
table. So the number of entries in the BGP table started to outgrow the
capacity of early 1990s routers very rapidly.

This problem was solved by the introduction of classless inter-domain


routing (CIDR) [RFC 4632]. As the name suggests, CIDR abandons the
class system for inter-domain routing. And inter-domain as in: be
tween the separate networks that collectively make up the internet.

Rather than having IP addresses fall into separate classes that implicit
ly specify how many address bits are used for the network part, CIDR
explicitly indicates the number of address bits.

In 192.0.2.0 there are 24 address bits, as addresses starting with 192


belong to class C, and class C networks have 24 address bits. With
CIDR, we write this down as 192.0.2.0/24. Of course it’s also per
fectly possible to have a network like 169.254.2.0/24, even though
169 is class B and thus previously, that would have indicated 16 ad
dress bits.

We call address ranges in CIDR notation “prefixes”. A variation of


CIDR / prefix notation is rather than specify the base (lowest) address
in the block followed by a slash and the number of bits in the network
part (the prefix length) is leaving out the unnecessary part of the ad

14
dress. So 192.0.2/24 or 169.254/16 rather than 192.0.2.0/24 or
169.254.0.0/16. Routers typically require the full version.

Prefix notation can get somewhat unintuitive when the number of bits
isn’t an even 8, 16 or 24. For instance, how many addresses are in the
range 172.16.0.0/12? After all, 172.16 specifies 16 bits, not 12. This
gets clearer in binary:

172.16 = 10101100 00010000

So if we just take the 12 bits, we get:

172.16 = 10101100 0001

We get to fill in the remaining 32 - 12 = 20 bits ourselves, giving us the


address range:

10101100 0001 0000 00000000 00000000 = 172.16.0.0

to

10101100 0001 1111 11111111 11111111 = 172.31.255.255

Again, have a look at the subnet cheat sheet at the end of the book.

Basically, a prefix is a way of saying “all IP addresses where the first


<prefix length> bits are the same as in the <network part of the
prefix>”. There are two special prefix lengths: /0 and /32. /0 only oc
curs for 0.0.0.0/0 (or 0/0): this is the CIDR representation of the de
fault route. So to a router, this means “this prefix matches all the ad
dresses for which the first zero bits are zero”. And that would be all
addresses. Trust me, to the routers this makes perfect sense.

The /32 prefix length simply means “the entire address”. So


192.0.2.31/32 is simply the single address 192.0.2.31.

A crucial concept with CIDR is that of longest match first. Unlike with
classful routing, with CIDR it’s possible to have overlapping prefixes.
For each address, there’s a prefix at every possible prefix length that
matches that address. So that’s 0-32. Three examples of prefixes that
match the address 172.22.1.1 are:

15
• 172.16.0.0/12

• 172.22.0.0/16

• 172.22.1.0/24

So how do we resolve this ambiguity? The same way we do in real life.


Suppose you’re on a road trip to San Francisco, and there’s a sign
pointing to the left that says “California” and a sign pointing to the
right that says “San Francisco”. So you would turn left because San
Francisco is in California, right? While that’s not completely unreason
able, it makes more sense to follow the signs to San Francisco. After all,
if there wasn’t a more direct path to San Francisco than just follow the
general path towards California, why would there be a separate sign?

Same thing for overlapping prefixes: we use the most specific match, the
one with the largest number following the slash. That’s the longest
match first rule. Note that longest match first rule supersedes the BGP
path selection algorithm: the path selection algorithm only applies to
routes towards the exact same prefix, while longest match first decides
between different but overlapping prefixes.

IPv6
IPv6 was introduced around 1995, a few years after the deployment of
CIDR. The point of IPv6 is to allow for more IP addresses, hence the
much larger size: while IPv4 addresses are 32 bits, IPv6 addresses are
four times as long at 128 bits.

IPv6 address notation is rather different from IPv4 address notation.


The 32-bit IPv4 addresses are represented as of four 8-bit values writ
ten down in decimal, separated by periods. The 128-bit IPv6 addresses
are represented as eight 16-bit values written down in hexadecimal
[W], separated by colons.

Although 2001:0DB8:0000:0000:0000:0000:0000:0001 is a valid


representation of an IPv6 address, the recommended IPv6 address no
tation [RFC 5952] for that address is 2001:db8::1. The letters are in
lower case, leading zeros (the 000 in 0001) are left out, and where ap

16
plicable, the longest consecutive series of 0: sequences is replaced by a
single :: sequence.

This means that the address 2001:db8:0:0:0:0:0:0 is simply written


as 2001:db8::. The IPv6 link local address is 127 zero bits followed by
a single one bit, so that’s just ::1.

IPv6 prefixes work the same way as IPv4 prefixes, for instance
2001:db8::/32. The IPv6 default route is ::/0. IPv6 is always class
less, and subnet sizes are thus written in prefix notation. However,
there is a strong convention that IPv6 subnets are /64 in size. And /48
is a very common network size, leaving 16 bits for subnet numbering,
providing room for 65,536 subnets.

There are many special purpose IPv6 addresses and address types
[RFC 4291] and see the appendix IP address notes, but “global uni
cast” IPv6 addresses are the most relevant to BGP routing. Currently,
2000::/3 is set aside for global unicast use. That’s all IPv6 addresses
that start with 2xxx: and 3xxx:. Global unicast means regular ad
dresses for one-to-one communication, as opposed to multicast (one
to-many) and anycast (one-to-any) addresses.

17
The BGP protocol

Most of this chapter provides background information about BGP that


doesn't directly impact operation. If you want to skip this for now,
please skip ahead to the last section in this chapter, BGP operation.

“BGP” stands for “border gateway protocol”. Back in 1989, when the
first BGP specification was published, the word “gateway” was used
for what we now call a router. So BGP really means “border router pro
tocol”. A border router is, of course, the last router in your network,
which connects to the first router in the next network. BGP is the pro
tocol these two border routers in neighboring networks use to ex
change routing information.

This makes BGP an “exterior gateway protocol” (EGP), not to be con


fused with the exterior gateway protocol that’s actually called EGP
[RFC 904], which has long been obsolete. All other routing protocols
are “interior gateway protocols” (IGPs), meant for handling routing
within a single network. Networks that run BGP almost always also
run one of the IGPs to handle their internal routing.

The IETF
Internet protocols such as BGP are developed and maintained by the
Internet Engineering Task Force (IETF). The IETF is an unusual stan
dards organization, as it doesn’t have members: everyone can partici
pate simply by joining the mailing lists for the different working
groups. Three times a year, there are IETF meetings. The meeting fee
(currently $875) is the main source of revenue for the IETF. As there is
no formal participation, IETF decision making is done by “rough con
sensus”. This means a decision must be supported by a large majority
of those who express an opinion, but it doesn’t have to be completely
unanimous.

IETF standards and other documents are published as a “request for


comment” (RFC). Each RFC has a number. A new version of a docu

18
ment is published under a new RFC number. RFCs start their life as a
working group “internet-draft”, which is iterated until the document is
ready for publication as an RFC. Individuals may also write drafts,
which may or may not be adopted by a working group and progress to
an RFC.

Not every RFC is a standard. RFCs that specify a protocol that is in


tended to become an official standard at some point are published as
“standards track”. Within standards track, a document used to start as
a “draft standard”. That stage is now merged with “proposed stan
dard”. After significant operational experience and refinement, a pro
tocol specification may become an official internet standard.

BGP-4 [RFC 4271] is still a draft standard—moving protocols along


through the standards track process isn’t always given the highest pri
ority within the IETF.

Protocol specifications may also be published as “experimental”. Doc


uments of various kinds are published as “informational” and opera
tional guidance may become “best current practice” and receive a BCP
number. When a document is no longer relevant it is given the status
“historic”.

The best way to read RFCs online is as the HTML version at the RFC
Editor website www.rfc-editor.org. Originally, RFCs were published in
a very simple text-only format. The HTML versions add information
about a document’s status at the top, as well as links to related RFCs.

Distance vector vs link state


And now it's time for some routing protocol theory. There are two
ways to distribute routing information through a network: distance
vector and link state. The idea behind distance vector is that a router
collects routing information (paths towards each prefix) from its
neighbors, then chooses the best path towards each prefix, and tells its
neighbors that best path. Alternative paths that are not considered
“best” at this time thus remain hidden from other routers.

19
With link state protocols, a router doesn’t tell its neighbors about the
conclusions of its path calculations, but rather, the data it used to reach
those conclusion. So each router independently calculates the best path
to reach each destination.

Link state protocols have the advantage that they’re faster than dis
tance vector protocols. With a link state protocol, whenever a router
detects that it has lost the connection to a neighboring router, it will
send out an update to its remaining neighbors, which is quickly
“flooded” throughout the network. Then each router recalculates the
best paths. With a distance vector protocol, a router first has to recom
pute all paths, and only then it can inform its remaining neighbors of
the change.

A limitation of link state protocols is that all routers must use the same
algorithm and the same parameters to calculate paths. If they didn’t,
routing loops would be possible.

The main example of a distance vector protocol is RIP [W]. RIP is a very
simple protocol that uses a hop count as a way to determine which
path is best. That can mean that one 1 Gbps hop is preferred over two
10 Gbps hops, which is usually not what you’d want. A big downside
of RIP is that it's very slow to react to lost connectivity due to the
count-to-infinity problem [W]. The current IPv4 version of RIP is
RIPv2, the IPv6 version is RIPng.

Cisco built its own more advanced distance vector routing protocols:
IGRP and EIGRP [W].

OSPF is the most widely used example of a distance vector protocol.


With OSPF, each link between two routers has a “cost” associated with
it, and OSPF then uses the “Dijkstra” a.k.a. “shortest path first” (SPF)
algorithm to calculate the best path between any two points in the
network. The current version for IPv4 is OSPFv2 [RFC 2328] and for
IPv6 OSPFv3 [RFC 5340].

20
IS-IS [W] is a link state protocol created for the OSI CNLP protocol. It
was later extended to also support routing IPv4 and IPv6. IS-IS is
mainly used in very large IP networks.

Which brings us to BGP: is it a distance vector or a link state protocol?

As we’ll discuss in the chapter Transit and peering, internet routing


requires using policies that limit the propagation of routing informa
tion. This makes it impossible to use a link state routing protocol for
inter-domain routing. So BGP is mostly a distance vector protocol, but
unlike other distance vector protocols, BGP carries path information in
its updates. This makes it possible to detect routing loops much faster,
so BGP can reroute more quickly after a failure than a simple distance
vector routing protocol such as RIP.

BGP versions
BGP version 1 was published in 1989 [RFC 1105]. Versions 2
[RFC 1163] and 3 [RFC 1267] quickly followed over the next two
years. With version 3, BGP looked a lot like the BGP we know today,
except that it still only supported classful addressing. BGP-4 added
support for classless inter-domain routing. BGP-4 was first published
in 1994 [RFC 1654]. There have been two revisions of the specification
(not of the protocol), with the most recent one published in 2006
[RFC 4271].

Amazingly, we still use BGP version 4 today, 28 years after the protocol
specification was first published. There are two reasons for this:

1. It's really hard to change the routing protocol that's used inter
net-wide.

2. BGP-4 is designed to be extended in backward compatible ways,


so new features could be added without having to create a new
version of the protocol.

21
Autonomous Systems
Networks that run BGP are called autonomous systems (ASes). The idea
is that each AS presents a consistent view of itself to the outside world,
and what happens inside an AS is irrelevant to other ASes, as far as
BGP is concerned.

One definition of an AS as “all routers under common administrative


control”. However, that definition doesn’t work for service provider
networks, as the service provider only has administrative control over
its own routers; many customers administer their own routers them
selves. But if these customer routers don’t run BGP themselves, they’re
still part of the service provider’s AS.

Each AS has an AS number. These used to be 16-bit numbers, but BGP


was extended to support 32-bit (sometimes called “4-byte” or “4
octet”) AS numbers. As of the middle of the 2010s, all BGP routers
support 32-bit AS numbers. But if a router doesn’t understand 32-bit
AS numbers, it will simply see AS number 23456 any time an AS num
ber shows up that’s not 16-bit compatible.

BGP neighbor relationships


Like all routing protocols, BGP maintains relationships with neighbor
ing routers. Unlike other routing protocols, BGP doesn’t discover
neighboring routers automatically. Instead, BGP neighbor relationships
must be explicitly set up on both sides through administrative configu
ration. I.e., you’ll have to tell the router the IP addresses of its neigh
bors along with the remote AS number and other information that’s
relevant to that specific neighbor relationship. We’ll start doing that in
the chapter BGP configuration 101.

BGP routers communicate with their neighbors over TCP port 179.
Both neighbors try to connect to the other on port 179. This means that
sometimes router A is the “client” and router B is the “server”, and
sometimes the other way around. After the TCP session has been es
tablished, the two routers start to exchange BGP messages. The TCP

22
session stays connected indefinitely. So it’s not unusual to see BGP
TCP sessions that have been up for weeks or even months.

When the TCP session goes away, the BGP routers on both sides throw
out all the routing information they’ve learned over that BGP session
and then try to set up a new TCP session.

BGP messages
When a BGP TCP session connects, the two routers will start to ex
change BGP messages. The following is a brief description of each
message type; for detailed information see section 4 of RFC 4271.

All BGP messages start with a “marker” for compatibility with older
BGP versions, with the rest of the message following the type-length
value model [W]. There are five BGP messages:

1. Open

2. Update

3. Keepalive

4. Notification

5. Route-refresh

The Open message contains a version field, which was useful during
the transition from BGP-3 to BGP-4. The router also puts its AS num
ber, its router ID and its configured hold time in the Open message.
The router ID is a 32-bit value that’s unique for a router (usually one of
its IPv4 addresses) and the hold time is how long the router will wait
before declaring the BGP session dead when it doesn’t see any incom
ing BGP messages.

Last but not least, there’s room for optional parameters. These are typ
ically used to negotiate the use of BGP extensions.

The Update message does most of BGP’s heavy lifting. An Update


message can carry withdrawn routes, new routes or both. Any with
drawn routes simply go in the “withdrawn routes” field. New routes,

23
if they're included in the update, use two fields: path attributes and
NLRI.

The withdrawn routes are routes (prefixes) that the neighbor had pre
viously told us we could reach through them, but now this is no longer
the case. So the local router removes those paths from its BGP table.
See the section BGP operation later this chapter for how this works.

Path attributes are different kinds of information that BGP associates


with each prefix. The two most important ones are the AS path, which
shows all the ASes between the local router and the destination prefix,
and the next hop address, which is the address we have to send pack
ets to in order for those packets to reach the destination in question.

NLRI stands for network layer reachability information, which is just a


fancy way of saying “one or more IP prefixes”. There’s only one set of
path attributes, so if the NLRI field contains multiple prefixes, those all
have the same path attributes. Prefixes with different path attributes
are transmitted in separate Update messages.

The Keepalive message contains no information: it just has the fixed


marker, the type is 3, indicating a Keepalive message, and the length is
zero. Keepalive messages are sent periodically in order to make sure
that the neighbor sees we're still alive and thus the session’s hold timer
at the neighbor’s side doesn’t reach zero. See the Making BGP faster
chapter for more information.

Routers send a Notification message when they need to tear down the
BGP session. This is usually because an error has occurred, but also
when the session needs to be terminated because of maintenance, or as
part of capabilities negotiation. The Notification message has an error
code and an error subcode as well as room for optional additional
data.

The Route-refresh message is an addition to BGP [RFC 2918] to al


low a router to ask a neighbor to send all BGP updates again. This way,
new filters can be applied to those updates. See the chapter Filtering
BGP for more information.

24
Path attributes
There are four types of path attributes:

1. Well-known mandatory: all prefixes must carry this path at


tribute.

2. Well-known discretionary: all BGP implementations must be


able to process this path attribute, but prefixes may or may not
carry this attribute.

3. Optional transitive: BGP implementations aren't required to


process these. If a router encounters an optional transitive path
attribute that it doesn't understand, it has to propagate the at
tribute to its neighbors unchanged.

4. Optional non-transitive: BGP implementations aren't required to


process these. If a router encounters an optional non-transitive
path attribute that it doesn't understand, it removes the at
tribute.

IANA is the organization that keeps track of internet-related protocol


numbers. The IANA BGP attributes registry currently lists nearly 40
path attributes. These are the ones defined in the BGP specification:

1. ORIGIN (well-known mandatory): indicates whether a path was


learned from an IGP, from the EGP protocol or is “incomplete”,
meaning it was learned through some other means. The ORIGIN
attribute doesn't seem to perform any function.

2. AS_PATH (well-known mandatory): the list of ASes that have


“seen” this path. Used for loop suppression and may also be
used for filtering and policy.

3. NEXT_HOP (well-known mandatory): the address of the next hop


router, which is normally the address of the BGP neighbor that
sent the update.

25
4. MULTI_EXIT_DISC (optional non-transitive): the multi exit dis
criminator (MED) is also often called “metric”. Is used to choose
between paths learned from the same neighboring AS.

5. LOCAL_PREF (well-known mandatory): the local preference car


ries a path’s degree of preference. This attribute must be present
on updates within an AS (iBGP), but not on updates that go to
external ASes (eBGP).

6. ATOMIC_AGGREGATE (well-known discretionary): used when


routers perform aggregation. This was relevant in the transition
from BGP-3 to BGP-4, but is rarely used today.

7. AGGREGATOR (optional transitive): also used for aggregation.

The following are path attributes that were added later to BGP, and are
thus optional.

• COMMUNITY (transitive, [RFC 1997]): carries one or more 32-bit


labels that can be used for various purposes. See the Filtering BGP
and Traffic engineering chapters for more information.

• ORIGINATOR_ID and CLUSTER_LIST (non-transitive, [RFC 4456]):


used by BGP route reflectors, see the chapter iBGP.

• MP_REACH_NLRI and MP_UNREACH_NLRI (non-transitive, [RFC


4760]): carry multiprotocol extensions, see the section Multipro
tocol BGP later this chapter.

• EXTENDED COMMUNITIES (transitive, [RFC 4360]): supports larger


communities of different types. Not very widely used (for inter
net routing) because each of the different types of extended com
munities needs to be supported explicitly by a BGP implementa
tion.

• AS4_PATH (transitive, [RFC 6793]): carries the 32-bit version of


the AS path. 32-bit capable routers update both the AS4_PATH as
well as AS_PATH, inserting “23456” as a placeholder for 32-bit AS
numbers. 16-bit capable routers of course only update the AS_

26
PATH, but the next 32-bit router will add any AS hops missing
from the AS4_PATH using the AS_PATH.

• LARGE_COMMUNITY (transitive, [RFC 8092]): larger communities.

• BGPsec_Path (non-transitive, [RFC 8205]): path attribute that


carries the protected AS path as per the BGPsec security mecha
nism. See the section on BGPsec in the BGP security chapter.

Multiprotocol BGP
The routing protocols we still use today were all initially created in the
1980s, long before IPv6 saw the light of day. Of course, once IPv6 ar
rived, it also needed routing protocols. For RIP and OSPF, new ver
sions of those protocols were built from the ground up. This is the
“ships in the night” concept: RIPv2 and OSPFv2 handle IPv4 routing
while RIPng and OSPFv3 handle IPv6 routing. Other than their basic
design, the IPv4 and IPv6 versions of these routing protocols are com
pletely separate and they don’t interact at all.

IS-IS uses the opposite approach: the one IS-IS protocol handles IPv4
and/or IPv6 routing alongside the OSI CLNP for which it was created.

Like IS-IS and unlike RIP and OSPF, there’s just one BGP that handles
both IPv4 and IPv6. This is made possible by the BGP multiprotocol
extensions [RFC 4760].

Rather than just add support for IPv6, multiprotocol BGP adopts the
“address family” concept, with IPv4 and IPv6 being different address
families, along with other address families such as Ethernet VPN
(EVPN).

A set of related protocols, such as TCP/IP is called a protocol stack or a


protocol family. At some point, the idea was that a protocol family like
TCP/IP would support multiple address families, but that never
worked out, so in practice there is no difference between a protocol
family and an address family. The term “address family” is the one
used with multiprotocol BGP, although “protocol family” would prob
ably be clearer.

27
In BGP, address families are identified with the address family identifi
er (AFI). There’s also a subsequent address family identifier (SAFI)
that’s used to differentiate between (for instance) prefixes used for uni
cast (one-to-one) and multicast (one-to-many) communication. IANA
maintains AFI and SAFI registries.

Multiprotocol BGP is implemented through two new path attributes


already mentioned earlier this chapter: MP_REACH_NLRI and MP_UN
REACH_NLRI. The MP_UNREACH_NLRI replaces the “withdrawn routes”
field in the BGP Update message, containing an AFI and SAFI and
then NLRI formatted in the way specified for that AFI and SAFI. For
instance, IPv4 NLRI is encoded as a one-byte length field that holds
the prefix length and a variable length prefix field.

So for instance a /20 prefix would be three bytes in length that hold
the 20-bit prefix padded to 24 bits to make the prefix value three bytes
long. Interestingly, the RFC that describes the use of the multiprotocol
extensions for IPv6 [RFC 2545] doesn’t even bother specifying the
same for IPv6. Obviously the only difference is that the prefix length
can now be up to 128 rather than 32.

The MP_REACH_NLRI attribute replaces the NLRI field in the Update


message. Like MP_UNREACH_NLRI it holds an AFI, SAFI and NLRI, but
in addition to those fields, also a next hop length field and a variable
length next hop field.

All interfaces that have IPv6 enabled must have a link local address in
addition to any regular global unicast addresses. Link local addresses
are addresses that are only used locally on a subnet, and thus don’t
have to be globally unique. They fall within the prefix fe80::/64.
Routes in routing tables typically point to the link-local addresses of
routers.

Because IPv6 requires routing protocols to carry link local addresses,


when using the IPv6 AFI, multiprotocol BGP carries a link local next
hop address as well as a global next hop address where appropriate.
When only a global next hop address is present, the next hop length is

28
16 (128 bits), when there’s also a link-local next hop address, it’s 32 (2 ×
128 bits).

MP_REACH_NLRI and MP_UNREACH_NLRI replace the withdrawn routes


and NLRI fields in the Update message, so these remain empty in mul
tiprotocol BGP operation. Other path attributes are included as usual.

With multiprotocol BGP, it’s possible to run the BGP TCP session either
over IPv4 or over IPv6, and the session can carry IPv4 and/or IPv6
prefixes. Routers will announce the AFIs/SAFIs they want to enable on
a new session in the Open message. To avoid problems with next hop
address processing, it’s best to use an IPv4 BGP session to exchange
IPv4 prefixes with neighboring ASes and an IPv6 BGP session for IPv6
prefixes. For iBGP this is slightly different, as we’ll see in the iBGP
chapter.

BGP states and finite-state machine


A BGP session can be in one of six states. The relationship between
these states and the 28 events that can move the session from one state
to another are modeled using a finite-state machine (FSM) [W]. The
BGP RFC describes the FSM in detail; this is a simplified version:

29
Idle

Active

Connect / OpenSent / OpenConfirm

Established

Figure 1. A simplified version of the BGP finite-state machine

BGP sessions start in the Idle state. In the Idle state, the router doesn’t
try to connect to the neighbor in question, and incoming connection
attempts are rejected. It is possible to move directly from the Idle to the
Connect state, but usually, when the router is ready to start a BGP ses
sion, the session first moves to Active.

In the Active state, there is no active connection yet, but the router ac
tively tries to connect to its neighbor. From Active, the connection can
move to Connect, OpenSent or OpenConfirm.

Usually, a BGP session progresses quickly through the Connect,


OpenSent and OpenConfirm states, so in figure 1 above those states
are collapsed into a single one. Upon various error conditions, a con
nection may revert back to Idle or Active. But if everything goes ac
cording to plan, the session moves into the Established state.

In the Established state, the two routers on opposite sides of the BGP
session are ready to exchange routing information in the form of BGP

30
Update messages. It may take some time for the initial set of updates
to be exchanged after a session enters the Established state. If an error
occurs the session returns to the Idle state.

BGP operation
In this section, we'll have a look at how BGP routers exchange prefixes.
An important rule is that a router may only propagate (announce) to
its neighbors paths that it actually uses itself. So if a router has a choice
of multiple paths towards a given destination prefix, it must first select
the best one out of these paths.

BGP best path selection is somewhat complex, and we’ll discuss it in


more detail in the Traffic engineering chapter. For now, we’ll just look
at the AS path length, and consider the path with the smallest number
of AS hops in the AS path best.

We’ll look at the flow of BGP updates between two autonomous sys
tems, AS 10 to the left and AS 40 to the right. At this point, AS 10 and
AS 40 don’t have a BGP session established between them yet:
AS 10 AS 40
Network Path Network Path
> 192.0.2.0 20 30 82 > 192.0.2.0 82
> 198.51.100.0 4206 > 198.51.100.0 4206

Both ASes have two prefixes in their BGP table: the 192.0.2.0/24 and
the 198.51.100.0/24 prefixes. (The /24 prefix length is implied for
these class C networks.) AS 10 can reach the 192-prefix through two
intermediate hops and is directly connected to AS 4206, the origin of
the 198-prefix. For AS 40, both prefixes are reachable directly over one
hop paths.

Assuming no filters, when the BGP session between AS 10 and AS 40


establishes, they each send a copy of their full BGP table to their
neighbor:

31
AS 10 AS 40
Network Path Network Path
> 192.0.2.0 40 82 <= > 192.0.2.0 82
20 30 82 => 10 20 30 82
198.51.100.0 40 4206 <= > 198.51.100.0 4206
> 4206 => 10 4206

So at this point, both ASes now have two paths towards each prefix:
the one they already had, and the new one from the other AS. By send
ing each other copies of these prefixes, the routers in both ASes invite
the other to send traffic to these destinations through them.

What we see, as indicated by the > character, is that the AS 10 router


takes AS 40 up on its offer to send traffic to the 192-prefix through AS
40, as that path is two hops (40 82) while the path AS 10 already had
is three hops (20 30 82). (When a router propagates a path, it adds its
own AS number to the left of the existing AS path.)

However… moments earlier the AS 10 router had invited AS 40 to


send traffic to the 192-network through AS 10. Should AS 40 want take
AS 10 up on that offer, we’d be in the situation where AS 10 tries to
reach 192.0.2.0/24 through AS 40, while AS 40 tries to reach
192.0.2.0/24 through AS 10. This means we have a routing loop on
our hands and packets will pingpong between AS 10 and AS 40.

So in order to prevent this eventuality, the AS 10 router sends an Up


date message to the AS 40 router withdrawing the 192-prefix. When AS
40 has processed this update and removed from its BGP table the path
towards the 192-prefix through AS 10, BGP has reached a stable state:

AS 10 AS 40
Network Path Network Path
> 192.0.2.0 40 82 > 192.0.2.0 82
20 30 82 => x
198.51.100.0 40 4206 > 198.51.100.0 4206
> 4206 10 4206

Note that for the 198-network, each router keeps the new path in its
BGP table. For both, their original path is shorter and therefore pre
ferred over the new path learned from the other AS. So in this case,

32
there is no conflict. Should one of the routers decide to start using the
path through the other, it will send a withdraw at that point.

In this stable state, no further updates are sent, just periodic Keepalive
messages to make sure the BGP session is still operational. When a
router stops receiving Keepalive messages or loses the BGP TCP ses
sion, it removes all paths learned from the neighbor from its BGP table,
selecting new best paths as necessary, and starts trying to re-establish
the BGP session. When it does, prefixes are exchanged again as de
scribed above.

33
BGP prerequisites

In this chapter we’ll have a look at what you need to have in place be
fore you can get started setting up BGP. Those prerequisites are:

1. Connectivity

2. Routers

3. IP addresses

4. An autonomous system number

Connectivity
It’s a good idea to first select one or more internet service providers
(ISPs) that will provide you with connectivity to the internet with BGP.
They may want to provide input on which routers you should get,
they may be able to help you with obtaining IP addresses, and you
need to list two ASes that you’re going to connect to over BGP in order
to get an AS number.

Using BGP with one connection to one ISP doesn’t make much sense,
as the added complexity of running BGP doesn’t provide any direct
benefits. (Unless you’re starting with one ISP and will be adding an
other later.)

However, you may want to use multiple connections to the same ISP.
There are several ways to do this, such as using static routes, VRRP,
RIP or OSPF. It is of course also possible to use a full BGP setup, with
your own IP addresses and a “real” AS number. But it may prove diffi
cult to obtain an AS number in this situation.

A good solution for multiple connections to the same ISP is using BGP
with IP addresses provided by the ISP and a private AS number. BGP
can then handle distributing traffic over the two (or more) connections
and rerouting when failures occur. However, the ISP will not propa

34
gate your BGP updates to the rest of the world. Talk to your ISP about
how to set this up.

In most situations, you’ll want to have two ISPs. Starting with more
than two would be unnecessarily complex, but adding more ISPs later
is certainly possible. The idea is that both ISPs send you a full copy of
the global BGP table, so you’ll have two paths towards each prefix.
Your routers will then select the best path of the two for each prefix.

If you don’t mind that all outgoing traffic goes through just one ISP,
you may be tempted to simply accept a default route through BGP
rather than the full BGP table. However, this has the downside that if a
certain destination is not reachable through ISP A (which sends you
the higher priority default route) but that destination is reachable
through ISP B (which sends you the backup default route), then your
router will send the traffic to ISP A and it won’t reach the destination.
With full tables from both ISP A and B, the prefix in question will sim
ply be rerouted through ISP B and it will still be reachable.

One or more destinations becoming unreachable through one ISP but


remaining reachable through another ISP is relatively rare, but certain
ly not unheard of. It even happens on purpose with some regularity as
networks may “depeer” [W].

We’ll talk much more about peering in the chapter Transit and peering,
but for now, it’s important to know that there is a small group of tier 1
networks [W] which solely rely on direct peering links between them.
Sometimes they disagree about the conditions for interconnecting di
rectly and may temporarily suspend the connectivity between their
networks (depeer). When this happens, customers of one network can't
reach customers of the other network.

All ISPs other than the 15 or so tier 1 ISPs depend on one or more of
the tier 1 ISPs for at least some of their connectivity. So when selecting
ISPs, make sure it’s not two ISPs that both depend solely on one of the
tier 1s. Most smaller ISPs buy service from multiple larger ones, so this
is unlikely to be an issue, but it's definitely a good idea to ask both
about this.
35
Router hardware
First of all, you need at least two routers, so if one fails or needs main
tenance, you have a second one that keeps you connected to the inter
net. An important question is if your two BGP routers should be the
same, or different from each other.

Having two routers that are the same or similar models from the same
vendor are much easier to work with than two very different models
or especially routers from different vendors. However, routers running
the same software are susceptible to the same bugs. So a complete
monoculture is less than ideal. However, being hit by show-stopping
bugs is relatively rare, so diversifying your router portfolio is probably
something that can wait until your network grows beyond two or
three routers. Another thing to keep in mind is that if you buy two
identical routers, they’re likely to reach the end of their useful lifetime
at the same time, so you may need to buy two new ones at the same
time again at some point in the future.

Try to avoid getting “customer premises equipment” routers from


your ISPs, as best case, these don't add anything, and worst case,
they'll make your BGP setup more complex and less reliable. However,
pay attention to the advice about router hardware and requirements
from your ISPs to make sure everything works well together. If an ISP
has esoteric requirements, that probably means they’re not used to
dealing with BGP customers, so proceed with caution in this situation.

There are basically two types of routers: devices that are sold as
“routers” or perhaps “switches”, and general purpose computer
hardware running router software. The advantage of special purpose
devices is that you don't have to worry about the parts working well
together. The advantage of software routers is the added flexibility.

As IBM itself mentions on the history section of its website, back in the
day there was a catch phrase in the industry: “Nobody ever got fired
for buying IBM”. In other words: there’s safety in going with the mar
ket leader. When it comes to routers, that would be Cisco and Juniper.

36
Which is of course not to say that other vendors, such as Extreme,
Arista, Nokia or even maker of budget BGP-capable routers Mikrotik,
don’t build quality products.

As for software BGP implementations, the most well-known is the


open source Quagga routing suite. Quagga was forked [W] from its
predecessor Zebra when Zebra development stalled. Quagga itself lat
er suffered the same problem and was FRRouting. BIRD is another
open source routing suite, which is especially popular for route servers
at internet exchanges. OpenBGPD is an open source BGP daemon
that’s part of the OpenBSD project.

There used to be a sharp dividing line between routers and switches,


but that line has blurred a lot over the years. Today, most higher end
switches can do IP routing. So when looking for router hardware, don't
limit yourself to devices with “router” in the name, but check for the
actual capabilities. However, many switches that can do IP routing and
BGP can’t handle a full BGP table.

By the end of 2021, the IPv4 BGP table reached 900k (900,000) prefixes.
The growth rate of the IPv4 BGP table has been declining from an av
erage 10% per year in the 2010s to 6% in 2021. The IPv6 BGP table was
about 145k by the end of 2021, but has been growing much faster at
31% per year the past few years and 37% in 2021. Based on the 2021
growth rates, we can predict the following table sizes over the next
years:

• End of 2022: 955k IPv4, 200k IPv6

• End of 2023: 1010k IPv4, 275k IPv6

• End of 2024: 1070k IPv4, 375k IPv6

Routers have three tables that hold BGP routing information:

1. The BGP RIB (routing information base): this table holds all BGP
information received from all neighbors. So with two ISPs send
ing full IPv4 and IPv6 tables, that's 2 × (900k + 145k) = 2.1 mil
lion prefixes, along with their path attributes.

37
2. The main IPv4 and IPv6 RIBs or just “routing table”: these tables
holds the best route from each routing protocol. So if your net
work has 5000 IPv4 and 5000 IPv6 OSPF routes in addition to the
full IPv4 and IPv6 BGP tables, the main routing tables hold 900k
+ 5k = 905k IPv4 and 145k + 5k = 150k IPv6 routes.

3. The Forwarding Information Base (FIB), which is used for for


warding packets. The IPv4 FIB is a copy of the IPv4 main routing
table (905k in our example) and the IPv6 FIB a copy of the IPv6
main routing table (150k).

A router with several Gigabit Ethernet interfaces may need to forward


millions of packets per second. For each of these packets, the router
must look up the next hop address/interface in the FIB, so the FIB is
heavily optimized to be searched as fast as possible. A FIB can either be
stored in RAM and be searched with special-purpose ASIC (chip), or
the FIB can be implemented using a TCAM [W]. A TCAM is a memory
with built-in search capability. As such, it is expensive and runs hot, so
TCAMs are always limited in size. Often, a TCAM can be partitioned
into parts for an IPv4 FIB, and IPv6 FIB and firewall rules. Often, IPv6
FIB entries use more TCAM space than IPv4 FIB entries.

The BGP and main RIBs are stored in RAM. These days, RAM is usual
ly not a constrained resource in routers, and a few gigabytes will hold
a lot of RIB entries. But check anyway, especially if you plan to have
more than a couple of IPv4 full table BGP feeds.

The gating factor for the number of prefixes a router can handle is usu
ally the maximum FIB size. Currently, a router that can handle a mil
lion prefixes can hold just the IPv4 table, with no room for IPv6 or
much growth. A router that can handle 1.5 million FIB entries will be
useable for a few more years. 2 million is the minimum for new routers
with an intended economical lifespan of five years.

38
IP addresses and AS numbers
Five regional internet registries (RIRs) are responsible for giving out IP
addresses and AS numbers. These are the RIRs and their service re
gions:

• AfriNIC, Africa

• APNIC, Asia-Pacific

•ARIN, North America

• LACNIC, Latin America and Caribbean

• RIPE NCC, Europe, Middle East and former Soviet Union

In order to obtain address space and AS numbers directly from your


region's RIR, you need to become a “local internet registry” (LIR) and
pay one time and yearly fees. It may also be possible to obtain address
space and AS numbers from an RIR through a service provider or in
termediary that is an LIR.

However, the supply of IPv4 address space has run out in all five re
gions. AfriNIC and APNIC operate under final /11 and final /8 poli
cies, respectively. Each LIRs in the respective regions will be able to
obtain one last block of at most /22 (1024 addresses) from those RIRs.
The RIPE NCC and LACNIC even used up their respective final /8
and final /10. So along with ARIN, the RIPE NCC and LACNIC now
have a waiting list.

When LIRs request IPv4 addresses, they go on the waiting list, and
they’re given addresses as they become available after address space is
returned to the RIRs and kept in quarantine for some time. The RIPE
NCC allows for /24s, while ARIN and LACNIC allow for larger re
quests but these may of course lead to longer wait times.

IPv4 address space can also be transferred from its existing holder to a
new one, usually through a broker or the RIR. In other words: you can
buy IPv4 addresses on the open market. The going rate is about 50 to
60 US$ per address at the time of this writing. Be careful buying IPv4

39
addresses, it has happened that a network sold some IPv4 addresses
but then kept using those addresses themselves.

There is of course more than enough IPv6 address space available. ISPs
usually get at leas a /32, or more if they can document that they’ll
need the additional address space in the foreseeable future. These are
provider aggregatable (PA) address blocks, which can be used to provide
address space to customers. Networks that aren't service providers can
get provider independent (PI) address blocks, those are usually /48.

All five RIRs require a network to be multihomed (connect to two or


more ISPs or peers) and/or have a “unique routing policy” to be able
to request an autonomous system number. Usually these requirements
can be shown to be met by providing the AS numbers of networks
you're going to connect to.

40
BGP configuration 101

With all the preparations out of the way, we’re now ready to start con
figuring a router to speak BGP!

The assumption is that the router is already set up, has connectivity to
two ISPs, and that the router interfaces towards those ISPs are config
ured with the right IP addresses. Example 1 shows the simplest possi
ble BGP configuration with two ISPs.

Example 1: A very simple BGP configuration


!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
!

If you want to try this example and the other examples for yourself,
have a look at Installing the minilab and running examples at the end
of the book. If you've never configured a router using a Cisco-like
command line interface (CLI), have a look at Appendix: the router CLI
for a short introduction.

To avoid issues with other examples and to keep consistency


between the examples, the addresses for both BGP neighbors
(192.0.2.21 and 192.0.2.41) fall within out own prefix
192.0.2.0/24. In reality, ISPs normally provide a /30 or /29
prefix to number the link subnet between the ISP and the
customer.

The router bgp 65082 line tells the router that we want to configure
the BGP protocol, and that this router belongs to AS 65082. The next
line tells the router that we want to originate the prefix 192.0.2.0/24.
Originate means that this router injects this prefix into BGP and tells
the rest of the world that these addresses are used in our AS.

41
On Cisco routers, we can't specify our prefix or prefixes us
ing CIDR notation. Instead, we'll have to use a mask. In this
case that would be network 192.0.2.0 mask
255.255.255.0. But when displaying the configuration, the
mask part will be left out, as the mask that corresponds
to /24 is implied for class C networks. With the FRRouting
software for Linux, either prefix notation or a mask is ac
cepted.

We can monitor the progress of the BGP session establishment with the
show ip bgp summary command. This is what an older router would
show if we asked it what's going on with BGP:

Router# show ip bgp summary


BGP router identifier 192.0.2.251, local AS number 65082
RIB entries 1, using 112 bytes of memory
Peers 2, using 40 KiB of memory

Neighbor AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State


192.0.2.21 65030 81 81 6 0 0 00:00:04 2
192.0.2.41 65040 0 0 0 0 0 never Active

This will look a little different when you try the example yourself, as
the output of the router commands sometimes has to be edited so the
lines don't get too long and some less relevant information is left out.
Also, different routers will show slightly different output, but they will
largely show the same information.

For the first neighbor, the state is a number. This means the BGP ses
sion is in the Established state, and the number is the number of pre
fixes received and accepted from the neighbor. (I.e., prefixes filtered
out don’t count.)

Should the InQ or OutQ numbers be higher than zero, this means the
routers are still busy exchanging prefixes. However, a zero here
doesn’t necessarily mean they’re not exchanging prefixes currently.

The second neighbor is in the Active state, and has never been up (in
the Established state). If this persists or if the state goes to Idle, there’s
likely a problem that warrants talking to someone who can check the

42
other end of the BGP session. But in the case above we were just a bit
impatient and the second BGP session came up a few moments later.

What happened in this example is that immediately after we enter the


neighbor ... remote-as ... line, the router started trying to set up
a BGP session to the specified neighbor. Even before we set up filters
or other restrictions. This way, our AS will happily propagate the in
formation it learns from AS 65030 to AS 65040 and vice versa.

That is not good. So some newer routers will not send any outgoing
updates until an outgoing filter or policy is configured and not accept
incoming updates until an incoming filter or policy is configured, as
per [RFC 8212]. So FRRouting version 8 (which is used if you want to
run the examples yourself using the Docker BGP minilab), you'll get
the following results with the example 1 configuration in effect:

Router# show ip bgp summary

IPv4 Unicast Summary (VRF default):


BGP router identifier 192.0.2.251, local AS number 65082 vrf-id 0
BGP table version 1
RIB entries 1, using 192 bytes of memory
Peers 2, using 1433 KiB of memory

Neighbor AS MsgRcvd MsgSent Up/Down State/PfxRcd PfxSnt


192.0.2.21 65030 20 16 00:13:41 (Policy) (Policy)
192.0.2.41 65040 18 16 00:13:41 (Policy) (Policy)

For the moment, let's work around that by entering:


Router# conf t
Router(config)# router bgp 65082
Router(config-router)# no bgp ebgp-requires-policy
Router(config-router)# exit
Router(config)# exit
Router# clear ip bgp *

So first we add no bgp ebgp-requires-policy to the configuration,


and then issue the clear ip bgp * command to restart all the BGP
sessions so we can be sure that we're not looking at stale information.
We now get:

43
Router# show ip bgp summary

IPv4 Unicast Summary (VRF default):


BGP router identifier 192.0.2.251, local AS number 65082 vrf-id 0
BGP table version 4
RIB entries 9, using 1728 bytes of memory
Peers 2, using 1433 KiB of memory

Neighbor AS MsgRcvd MsgSent Up/Down State/PfxRcd PfxSnt


192.0.2.21 65030 16 12 00:00:05 2 4
192.0.2.41 65040 14 14 00:00:05 2 4

The fact that the router sends four prefixes to each neighbor is a bit un
expected. So let's see which prefixes it's sending to neighbor
192.0.2.21:

Router# show ip bgp neighbors 192.0.2.21 advertised-routes


Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found

Network Next Hop Metric LocPrf Weight Path


*> 10.0.10.0/23 0.0.0.0 0 65040 65010 i
*> 10.0.20.0/22 0.0.0.0 0 65030 65020 i
*> 10.0.30.0/23 0.0.0.0 0 65030 i
*> 10.0.40.0/21 0.0.0.0 0 65040 i

Total number of prefixes 4

It's a bit odd that FRRouting sends prefixes it just learned from AS
65030 back to AS 65030, but that shouldn't cause problems. And indeed
it sends the AS 65040 prefixes to AS 65030 (as well as the other way
around), so in the next chapter we're going to add some filters to keep
that from happening.

However, our own prefix 192.0.2.0/24 is not advertised to this


neighbor. The reason for that is simple:
Router# show ip route 192.0.2.0/24
% Network not in table

44
So our own prefix is not in our router's IP routing table. In that situa
tion, the logic is that if the router itself doesn't know where to send
packets for this prefix, how can it advertise this prefix to the rest of the
world? We can fix this using a static route:

Example 2: a static route to enable prefix origination


!
ip route 192.0.2.0 255.255.255.0 Null0 250
!

The Null0 interface is a special interface that makes packets forwarded


to it disappear. The effect of this static route is that packets towards
192.0.2.x are filtered out. Using a Null0 route like this has the added
benefit that if parts of the prefix in question aren't in use, packets won't
be sent back to the ISP if there's a default route, with the ISP then send
ing the packets back and they keep ping ponging back and forth until
their time to live reaches zero.

The 250 is the priority of the static route. Any other routes for that
same prefix with a lower priority value will override the Null0 route.
With this route in effect, the router advertises the prefix to its neigh
bors:

Router# show ip bgp 192.0.2.0/24


BGP routing table entry for 192.0.2.0/24, version 6
Paths: (1 available, best #1, table Default-IP-Routing-Table)
Advertised to non peer-group peers:
192.0.2.21 192.0.2.41
Local
0.0.0.0 from 0.0.0.0 (192.0.2.255)
Origin IGP, metric 0, localpref 100, weight 32768, valid,
sourced, local, best

Of course no BGP configuration is complete without some IPv6. Ex


ample 3 below is the IPv6 equivalent of examples 1 and 2, except that
we’re only configuring an IPv6 BGP session towards ISP 30 and not
ISP 40.

45
Example 3: The IPv6 version of examples 1 and 2
!
router bgp 65082
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
no neighbor 2001:db8:30:8201::1 activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor 2001:db8:30:8201::1 activate
exit-address-family
!
ipv6 route 2001:db8:82::/48 Null0 250
!

The obvious difference is that the neighbor address is an IPv6 address.


However, even for IPv6 neighbors, the assumption is that we’re going
to exchange IPv4 prefixes, not IPv6 prefixes. So what we do in the first
part of the configuration, is disable IPv4 for this BGP session with the
no neighbor ... activate command.

Next, we tell the router we want to configure parameters related to


address family IPv6. (We could have said address-family ipv6
unicast to be more precise.) Here we can specify the IPv6 prefix(es)
we want to originate, 2001:db8:82::/48 in this case. Last but not
least, we activate our neighbor for the IPv6 unicast address family.
And we add the static route to the Null0 interface. The results, using
the slightly reordered show bgp ipv6 ... vs show ip bgp ...:
Router# show bgp ipv6 unicast summary
IPv6 Unicast Summary (VRF default):
BGP router identifier 192.0.2.251, local AS number 65082 vrf-id 0
BGP table version 2
RIB entries 3, using 576 bytes of memory
Peers 1, using 716 KiB of memory

Neighbor AS Up/Down State/PfxRcd PfxSnt Desc


2001:db8:30:8201::1 65030 00:03:11 1 2 ISP 30

And:

46
Router# show bgp ipv6 unicast
BGP table version is 2, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
RPKI
Origin
validation
codes: icodes:
- IGP,Vevalid,
- EGP,I?invalid,
- incomplete
N Not found

Network Next Hop Metric LocPrf Weight Path


*> 2001:db8:30::/44 fe80::42:acff:fe11:4
0 0 65030 i
*> 2001:db8:82::/48 :: 0 32768 i

Displayed 2 routes and 2 total paths

The fe80:: next hop address is an IPv6 link local address. All IPv6
routing protocols are required to use link local next hop addresses, as
link local addresses are required so routers can send ICMPv6 redirect
[RFC 4861] messages when necessary. However, for iBGP to work
properly, regular global unicast next hop addresses are required.
Which is also present if we further inspect the prefix in question:

Router# show bgp ipv6 unicast 2001:db8:30::/44


BGP routing table entry for 2001:db8:30::/44, version 2
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
2001:db8:30:8201::1
65030
2001:db8:30:8201::1 from 2001:db8:30:8201::1 (198.51.100.223)
(fe80::42:acff:fe11:4) (used)
Origin IGP, metric 0, valid, external, best (First path
received)

47
Filtering BGP

In the previous chapter, we set up a very simple BGP configuration


towards two ISPs in example 2. That configuration will actually work,
but it has a big problem, which we’ll see if we look at a prefix that’s
originated by ISP 30:
Router# show ip bgp 10.0.30.0/23
BGP routing table entry for 10.0.30.0/23
Paths: (1 available, best #1, table Default-IP-Routing-Table)
Advertised to non peer-group peers:
192.0.2.41
65030
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, metric 0, localpref 100, valid, external, best

So the router is advertising a prefix received from ISP 30 to ISP 40. This
means that ISP 40 may start sending traffic towards ISP 30 through our
AS. That's certainly not something we want—after all, we pay our ISPs
so they handle our traffic, not the other way around!

Also, we probably don't have enough bandwidth to handle this traffic,


and there may be filters in place that block the traffic. As a result, it's
unlikely that this undesired path will actually work. So what we need
to do set up filters make sure only our own prefix(es) are announced to
neighboring ASes. There are three ways to do this type of filtering:
with an AS path filter, with a prefix filter, or with filters based on BGP
communities.

In small/simple networks, it’s highly recommended to apply both AS


path and prefix filters. Each by themselves is enough to make sure we
only announce the right prefix(es), but it's a very good idea to have the
extra safety that a second filter provides. Accidentally announcing the
wrong prefixes at best won't enhance your standing in the BGP com
munity and can lead to significant outages if certain safety features are
triggered. If things get really bad, your name, or at least your AS num
ber, will live in infamy. The the section Some scary stories in the BGP
security chapter has some examples of such incidents.

48
In larger and/or more complex networks, AS path filters and prefix
filters aren’t always appropriate. In those cases, it's better to filter using
BGP communities. This type of filtering is more complex to set up, but
after that, it’s easier to manage when there are changes.

AS path filters
AS path filters use regular expressions (regex or regexp [W]) to allow or
deny routes based on what's in the AS path. A regex is a pattern that
will match or not match a line of text. For our purposes, that line of
text is the textual representation of the AS path. Regular expressions
used by Cisco routers (and other routers with a similar command line
interface, such as Quagga and FRRouting) are more limited than regu
lar expressions found elsewhere. This is the syntax that is supported:

. any character
[] enclose a choice/range of characters, such as [2345] or [2-5]
() enclose a string of characters
+ the preceding character or ()-enclosed string must occur one or more
times
* the preceding character or ()-enclosed string may occur zero or more
times
^ start of the line
$ end of the line
_ a comma, left brace ({), right brace (}), a space or the start or end of the
line

With these, we can make filters such as the following:

650 Any AS path with the sequence 650 in it. That includes 1000
650 2000 but also 1000 65082 2000.
_650_ Any AS path with AS 650 in it. That includes 1000 650 2000
and just 650 but not 1000 65082 2000.
^650_ Any AS path that starts with AS 650, including just 650.
_650$ Any AS path that ends with AS 650, including just 650.
^(650_)+ Any AS path that only contains AS 650 one or more times.

49
^(650_)* An empty AS path or AS paths that only contain AS 650 mul
tiple times.
^$ An empty AS path.

What we want to accomplish is that we only announce to our BGP


neighbors prefixes with our own AS in the AS path. So the obvious
regular expression to accomplish that would seem ^65082$. However,
our own AS path isn’t added to the path until after filters have been
applied, so when the filter sees the AS path it’s still empty. So that
would mean ^$.

There is a final caveat: as we’ll see in the chapter Traffic engineering, it


is sometimes useful to artificially lengthen the AS path by “prepend
ing” it with one or more additional instances of our own AS number.
So the best way to do AS path filtering for a simple BGP setup is with a
regular expression that allows the local AS zero or more times. This is
what example 4 does.

Example 4. Filtering outgoing BGP updates based on the AS path


!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 filter-list 2 out
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
neighbor 192.0.2.41 filter-list 2 out
!
bgp as-path access-list 2 permit ^(_65082)*$
!
There seems to be an issue in some FRRouting builds where
^(65082_)*$ doesn't work for a prepended path, but
^(_65082)*$ does. So put the underscore at the beginning
with FRR when in doubt.

Traditionally, AS path access lists are numbered, but today, in nearly all
cases a name can be used, too. AS path access list 2 permits
^(_65082)*$, the regular expression that allows our AS path through,
even if there are prepends present. After this single line, the implicit

50
deny comes into play. In router filters, the rule is that if something isn't
allowed, it’s denied. So all AS paths that don't match ^(_65082)*$ are
implicitly denied. We apply the AS-path access-list to both BGP neigh
bors with the neighbor ... filter-list 2 out configuration
command. We could of course have made two different filters for the
two neighbors, but that wasn't necessary. We can also filter incoming
BGP updates with an AS-path access-list by configuring filter
list ... in for a neighbor.

The i, e or ? at the end of AS path in the output of the show


ip bgp command represent the origin attribute. The origin is
not part of the AS path, so you can’t filter on it with AS path
filter regular expressions.

Unfortunately, if we change the config “live” on the router, by starting


up with the configuration from example 2, and then pasting the con
figuration from example 4 on top of that, we may be in for a nasty sur
prise. When we check which prefixes we are sending after making this
configuration change with the show ip bgp neighbors ... adver
tised-routes command: on many BGP implementations we'll still
see prefixes from ISP 30 being advertised to ISP 40:

Router# show ip bgp neighbors 192.0.2.41 advertised-routes


BGP table version is 0, local router ID is 192.0.2.251
Status codes: s suppressed, d damped, h history, * valid, > best, =
multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete

Network Next Hop Metric Weight Path


*> 10.0.10.0/23 192.0.2.42 0 65030 65020 65010 i
*> 10.0.20.0/22 192.0.2.42 0 65030 65020 i
*> 192.0.2.0 192.0.2.42 0 32768 i

Total number of prefixes 3

The reason for this is that the new filter only applies to advertisements
that happen after the filter is created or changed. So in order to see the
effect of the new filter, we need to make the router send all of its pre
fixes to its neighbor. Normally, this only happens when a BGP session
is established. So disconnecting the BGP TCP session and then waiting

51
for it to be reestablished will make sure the filter takes effect. The
command to do this is clear ip bgp <neighbor address> to reset
the BGP session towards the neighbor with that address, clear ip
bgp <neighbor AS> to reset the BGP sessions towards all neighbors
with that AS number, or clear ip bgp * to reset all BGP sessions.

But, such a hard reset is a very crude way to apply new or modified
filters. Depending on many factors, this may lead to a noticeable inter
ruption of your networks's reachability. If there are multiple resets in a
short time and remote networks implement BGP flap damping (see the
section on flap damping later in the book), then the unreachability to
wards those networks may persist for 30 minutes. However, flap
damping isn't widely used anymore.

In any event, it's much better to do a soft clear using the route refresh
mechanism. With this, the router simply goes through its entire BGP
table and sends updates as allowed by the outgoing filters that are cur
rently in effect. The neighbor will apply its current incoming filters. We
can do this with the clear ip bgp ... out command.

It’s also possible to perform a route refresh in the other direction, by


asking the neighbor to send over all of its prefixes again, thereby ap
plying the neighbor's outgoing filters and our incoming filters. This is
done with the clear ip bgp ... in command, as long as both we
and the neighboring router support the route refresh capability [RFC
2918]. So:

Router# clear ip bgp 192.0.2.41 out


Router# show ip bgp neighbors 192.0.2.41 advertised-routes
BGP table version is 0, local router ID is 192.0.2.251

Network Next Hop Metric LocPrf Weight Path


*> 192.0.2.0/24 0.0.0.0 0 32768 i

Here we have the desired result: only our own prefix 192.0.2.0/24 is
advertised to the neighboring AS. The same AS path filters can be ap
plied to the IPv6 as well as the IPv4 address families.

52
Prefix filters
Another way to make sure only our own prefix(es) are advertised is
with a prefix filter. It’s highly recommended to have both AS path and
prefix filters, as this way an issue with one of the filters doesn’t imme
diately let incorrect prefixes escape your network. This is especially
relevant on Cisco and Cisco-like platforms, where configuration
changes apply immediately after each line is entered. This means that
while modifying a filter, there is a short time when just part of the filter
has been entered. And yes, prefixes have been known to escape in the
fraction of a second it took for the whole filter to be pasted from a pre
prepared text file to the router's command line.

Prefix filters are also helpful in making sure that if another AS adver
tises (part of) our own address space to us, we don't listen to that. Oth
erwise, we me end up sending traffic to our own addresses out of the
network where other people can take a look at it or impersonate our
servers. So we want a prefix list that allows our own prefix out:
!
ip prefix-list out-prefixes permit 192.0.2.0/24
!

And another prefix list that blocks our own prefix in the incoming di
rection, but lets everything else through:
!
ip prefix-list in-prefixes deny 192.0.2.0/24
ip prefix-list in-prefixes permit any
!

In example 5 we have prefix lists and a remote network announces the


prefix 192.0.2.13/32, the address of our mail server. As per the long
est match first rule, our routers will now send packets for the mail
server's address to AS 65030. As we'll see in the chapter on BGP securi
ty, it’s difficult to completely stop such unauthorized advertisements
elsewhere, but we can at least reject them at our borders so they don't
impact our internal network.

53
Example 5. Someone announces part of our address space to us
!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
!
ip prefix-list out-prefixes seq 5 permit 192.0.2.0/24
ip prefix-list in-prefixes seq 5 deny 192.0.2.0/24
ip prefix-list in-prefixes seq 10 permit any
!

Unfortunately, our prefix filter doesn't block the nefarious advertise


ment:

Router# show ip bgp


BGP table version is 6, local router ID is 192.0.2.251, vrf id 0

Network Next Hop Metric Weight Path


*> 192.0.2.0/24 0.0.0.0 0 32768 i
*> 192.0.2.13/32 192.0.2.21 0 65030 65020 i

The reason the /32 gets through is that the prefix list only blocks the
exact 192.0.2.0/24 block, but not any more specifics (sub-prefixes).
Fortunately, we don't have to list all possible longer prefixes, but we
can filter sub-prefixes based on prefix length instead. We can do that
with the le (less or equal) and/or ge (greater or equal) arguments. For
instance:

!
ip prefix-list test permit 192.0.2.0/24 ge 26 le 28
!

This will match sub-prefixes of 192.0.2.0/24 with a prefix length of /


26, /27 or /28. Example 6 has an updated in-prefixes prefix list that
will catch sub-prefixes.

Example 6. An IPv4 prefix filter that catches sub-prefixes


!

54
ip prefix-list in-prefixes deny 192.0.2.0/24 le 32
ip prefix-list in-prefixes permit 0.0.0.0/0 le 24
!

The le 32 part in the deny 192.0.2.0/24 le 32 rule means “less


than or equal to /32”. This rule matches all prefixes where the first 24
bits are equal to 192.0.2.0/24 and the prefix length is /32 or shorter.
Obviously prefixes shorter than /24 can't match 192.0.2.0/24, so this
means prefix lengths between /24 and /32. Or in other words: all more
specific prefixes of 192.0.2.0/24.

Should you prefer doing the same with 192.0.2.0/24 ge 24, you'll
find that the router doesn't like that:

Router(config)# ip prefix-list in-filter deny 192.0.2.0/24 ge 24


% Invalid prefix range for 192.0.2.0/24, make sure: len < ge-value
<= le-value

The first filter line denies what we don't want. If we end the filter here,
the “implicit deny” will kick in and the filter will not allow any prefix
es through. We could finish the filter with a permit any. (Any is the
same as 0.0.0.0/0 le 32.)

However, in this case we'll finish the filter with permit 0.0.0.0/0 le
24. This allows all prefixes as long as the prefix length is no longer
than /24. This is fairly common practice on the internet, with the result
that /24 is the longest prefix that we can expect to be accepted by all
ASes throughout the internet.

You may have noticed seq 5 and seq 10 in front of the filter rules. If
we display the configuration with show running-configuration
we’ll also see those sequence numbers; the router adds these automati
cally. We can then later add new filter rules between existing ones.

Example 7 has the IPv6 versions of the IPv4 prefix lists in example 6.

55
Example 7. IPv6 prefix filters
!
router bgp 65082
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
no neighbor 2001:db8:30:8201::1 activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor 2001:db8:30:8201::1 activate
neighbor 2001:db8:30:8201::1 prefix-list in-filter in
neighbor 2001:db8:30:8201::1 prefix-list out-filter out
exit-address-family
!
ipv6 prefix-list in-filter seq 5 deny 2001:db8:82::/48 le 128
ipv6 prefix-list in-filter seq 10 permit ::/0 le 48
ipv6 prefix-list out-filter seq 5 permit 2001:db8:82::/48
!

The main difference between IPv4 and IPv6 prefix-lists is that for IPv6,
the neighbor ... prefix-list ... commands go under the ad
dress-family ipv6 section of the BGP configuration. Note that the
name for the IPv4 and IPv6 prefix lists are the same, that’s not a prob
lem for the router but it could be somewhat confusing.

In the out-filter we again permit just our own prefix, and in the in
filter we deny our own prefix up to the maximum /128 prefix
length. The second line of the in-filter prefix-list allows all IPv6 prefix
es with a prefix length of /48 or less. That’s similar to the IPv4 conven
tion that prefixes up to /24 are accepted, although with IPv6, the /48
practice is less universal; some networks accept longer prefixes. The
outgoing filter performs as expected:
Router# show bgp ipv6 unicast neighbors 2001:db8:30:8201::1
advertised-routes
BGP table version is 2, local router ID is 192.0.2.251, vrf id 0

Network Next Hop Metric LocPrf Weight Path


*> 2001:db8:82::/48 :: 0 32768 i

Total number of prefixes 1

56
Community-based filters
The combination of an AS path filter and prefix filters works well.
However, AS path filters become hard to manage in networks that
have their own BGP customers, especially as the number of routers
increases. The reason for this is that when you add a BGP customer,
you'll have to update your AS path filter to allow the customer's AS
number, and then configure the new filter on all your BGP routers. As
such, AS path filters are not recommended for networks that have BGP
customers.

The same is true for prefix filters. Those also become harder to manage
if you’re a fast growing network that adds new IP address ranges fairly
regularly. So prefix filters aren't recommended for networks that have
BGP customers or networks that deploy new IP address blocks regu
larly.

An alternative for AS path and/or prefix filters is filtering based on the


BGP community attribute. This requires a certain amount of setup ini
tially, but after that, adding new prefixes and customer ASes can be
done much easier, because this now only requires changing filters on
the routers where the new prefixes/ASes enter the network.

Communities are not part of the core BGP specification, they were
added in 1996 [RFC 1997]. Communities are 32-bit values that can be
attached to prefixes. The IANA keeps a list of well-known communi
ties. The most relevant well-known communities are:

• NO_EXPORT (0xFFFFFF01): routes that carry this community


most not be advertised to external ASes

• NO_ADVERTISE (0xFFFFFF02): routes that carry this community


must not be advertised to any BGP neighbor, internal nor external
ones

Routers will generally automatically apply the specified behavior.


Should you wish to override that, you’ll need to remove the communi
ty in question.

57
In addition to using well-known communities, networks can define
their own and attach desired behaviors to them. The convention is that
in this case, the first 16 bits of the 32-bit community value is the AS
number of the AS that defines the behavior. The two 16-bit parts are
displayed as numbers with a colon in between. For instance, 64499:13
is a community defined by AS 64499. Communities can be attached to
prefixes and acted upon in the same AS, but it’s also possible for com
munities to trigger actions in external ASes. We’ll do the former here
and discuss the latter in the chapter Traffic engineering.

What we're going to do in example 8 is have the router that injects a


prefix into BGP attach a community to that prefix, and then only allow
prefixes that have this community to be announced to other ASes.

Example 8. Filtering BGP announcements with communities


!
router bgp 65082
network 192.0.2.0/24 route-map originate
network 203.0.113.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 route-map in-rmap in
neighbor 192.0.2.21 route-map out-rmap out
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
no neighbor 2001:db8:30:8201::1 activate
!
address-family ipv6
network 2001:db8:82::/48 route-map originate
neighbor 2001:db8:30:8201::1 activate
neighbor 2001:db8:30:8201::1 route-map in-rmap in
neighbor 2001:db8:30:8201::1 route-map out-rmap out
exit-address-family
!
bgp community-list standard localprefixes permit 65082:1
!
route-map originate permit 10
set community 65082:1
!

58
route-map in-rmap permit 10
set comm-list localprefixes delete
!
route-map out-rmap permit 10
match community localprefixes
set comm-list localprefixes delete
!

Right after router bgp 65082 there are two network statements, our
usual suspect 192.0.2.0/24 is now followed by route-map origi
nate. There is also a second prefix 203.0.113.0/24 without the
route map, so we can easily see how the two prefixes are handled dif
ferently in a moment. The originate route map shows up later in the
configuration, where it adds the community 65082:1 to any prefixes
that have this route map applied to them.

The IPv4 BGP session as well as the IPv6 BGP one all have two other
route maps applied in-rmap for incoming BGP updates and out-rmap
for outgoing BGP updates. The out-rmap matches prefixes that have
the community 65082:1 through the community list localprefixes.
If that community is present, we use the localprefixes community
list to remove the 65082:1 community, to avoid flooding the internet
with irrelevant communities.

Because the route map had a match, the prefix is allowed through. Pre
fixes without the 65082:1 community reach the end of the route map,
where they’re subject to the implicit deny rule, so these prefixes are not
allowed through. These route maps work the same on IPv4 and IPv6
prefixes.

The reason we also have the in-rmap route map is to make sure that if
a BGP neighbor sends us prefixes with community 65082:1 on them,
that community is stripped off: the set comm-list localprefixes
delete line removes all communities from a prefix that match the lo
calprefixes community list. Communities that don’t match the
community list are left in place. Without this, any prefixes that we re
ceive from external networks would be announced to the rest of the
world if that community happens to be present. It is unlikely that this

59
would happen by accident, but someone could attach the community
out of malicious intent.

Let’s have a look to see if everything works as intended:

Router# show ip bgp 192.0.2.0


BGP routing table entry for 192.0.2.0/24, version 2
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
192.0.2.21
Local
0.0.0.0 from 0.0.0.0 (192.0.2.251)
Origin IGP, metric 0, weight 32768, valid, sourced, local,
best (First path received)
Community: 65082:1

Router# show ip bgp 203.0.113.0


BGP routing table entry for 203.0.113.0/24, version 1
Paths: (1 available, best #1, table default)
Not advertised to any peer
Local
0.0.0.0 from 0.0.0.0 (192.0.2.251)
Origin IGP, metric 0, weight 32768, valid, sourced, local,
best (First path received)

Router# show bgp ipv6 unicast 2001:db8:82::/48


BGP routing table entry for 2001:db8:82::/48, version 1
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
2001:db8:30:8201::1
Local
:: from :: (192.0.2.251)
Origin IGP, metric 0, weight 32768, valid, sourced, local,
best (First path received)
Community: 65082:1

We got the intended result: prefixes 2001:db8:82::/48 and


192.0.2.0/24 and have community 65082:1 and are advertised over
the IPv4 and IPv6 BGP session, respectively, while 203.0.113.0/24
doesn’t have the community and isn’t advertised.

With this way of filtering, connecting a new customer with their own
AS and/or IP prefix(es) to the network requires doesn't require updat
ing the outgoing AS path and prefix filters on all routers.

60
In networks with more than a handful of routers, updating the config
urations on all of them manually is too time consuming and quickly
leads to out-of-sync filters, which invariably leads to time consuming
troubleshooting later. In a network that can automatically deploy con
figuration changes, this is less of an issue, but even then, updating all
router configurations is not something you'd want to do unless you
really have to.

With community-based filtering of BGP announcements, the route


maps and community lists never have to change. In our example, it’s
enough to add a network ... route-map originate statement to
two routers (in case one of them fails) and all routers, be it two, five or
500, will announce the new prefix. Things get a bit more complex
when adding BGP customers to the network, but the main advantage
of community-based filtering remains: no need to touch all the routers
in the network.

Consistency between filters


The point of having multiple filters is to make sure nothing bad hap
pens if there's a mistake in one of them. So it's useful to see if their be
havior is consistent. Example 9 adds the example 4 AS path filters and
the example 5 and 7 prefix filters to the configuration from example 8.

Example 9. AS path and prefix filters in addition to community filtering


!
router bgp 65082
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.21 filter-list 2 out
!
ip prefix-list in-prefixes deny 192.0.2.0/24 le 32
ip prefix-list in-prefixes permit 0.0.0.0/0 le 24
ip prefix-list out-prefixes permit 192.0.2.0/24
!
bgp as-path access-list 2 permit ^(65082_)*$
!

61
We can then use the show ip bgp community-list localprefixes
command (or the show ip bgp community 65082:1 command) to
see which prefixes have this community attached:
Router# show ip bgp community-list localprefixes
BGP table version is 6, local router ID is 192.0.2.251, vrf id 0

Network Next Hop Metric LocPrf Weight Path


*> 192.0.2.0/24 0.0.0.0 0 32768 i

Displayed 1 routes and 6 total paths

And compare that to the output of show ip bgp filter-list 2 to


see if the AS path filter matches the same prefixes:

Router# show ip bgp filter-list 2


BGP table version is 6, local router ID is 192.0.2.251, vrf id 0

Network Next Hop Metric LocPrf Weight Path


*> 192.0.2.0/24 0.0.0.0 0 32768 i
*> 203.0.113.0/24 0.0.0.0 0 32768 i

Displayed 2 routes and 6 total paths

And show ip bgp prefix-list out-prefixes to see what matches


the prefix list:

Router# show ip bgp prefix-list out-prefixes


BGP table version is 6, local router ID is 192.0.2.251, vrf id 0

Network Next Hop Metric LocPrf Weight Path


*> 192.0.2.0/24 0.0.0.0 0 32768 i

Displayed 1 routes and 6 total paths

This quickly reveals that we didn't put the route map originate on
the 203.0.113.0/24 prefix and didn't add the 203 prefix to the prefix
list, so the community filter and the prefix list lead to the same result.
The AS path filter can't differentiate between 192.0.2.0/24 and
203.0.113.0/24, as they are both originated locally and thus have the
same AS path.

In the real world, it wouldn't make sense to use three different filters
together. A useful approach would be:

62
1. When first deploying BGP, use prefix filters + AS path filters.

2. As the network grows and/or there are frequent changes, such


as adding new prefixes, replace the prefix filters with communi
ty filters. Keep the AS path filters.

3. When adding the third or so BGP customer, drop the AS path


filters after a round of rigorous consistency checking and just use
community filters.

And it may prove useful to have one or more filters that aren't actually
applied to BGP sessions, but can be used for consistency checking with
the commands discussed above.

63
Transit and peering

So far, we’ve mostly assumed that you buy service from an ISP and
that smaller ISPs buy service from bigger ISPs. This service, where the
ISP provides connectivity to all destinations, is called “transit”.

Of course, in a model of the internet based smaller networks buying


transit service from larger networks, there must be one “apex ISP” sit
ting at the top of the network hierarchy. Around 1990, that was pretty
much how the internet worked, with the US National Science Founda
tion’s NSFNET [W] long-distance network through the United States
functioning as the internet’s “backbone”. Everyone simply connected
to the NSFNET Backbone and was thus connected to the rest of the in
ternet. Nice and simple.

However, even in the days of the NFSNET Backbone, some regional


networks found it useful directly connect to each other, bypassing the
NSFNET Backbone. So traffic from regional network A to regional
network B would go over a direct line between A and B. We call this
“peering”.

With regional networks A and B peering with each other, traffic from A
to any other destination than B and traffic from B to any other destina
tion than A would still go over the NFSNET Backbone.

By the mid-1990s, commercial ISPs in the US could no longer use the


NFSNET Backbone. This meant they had to operate their own back
bone networks to get traffic from one part of the United States to an
other, and they had to set up and maintain peering relationships with
other commercial ISPs to get traffic from a customer of one commercial
ISP to a customer of another commercial ISP.

Internet exchanges
The obvious way to peer with another network is for the two peers to
install a direct connection between them. This is known as a private

64
network interconnect (PNI) or simply “private peering”. To keep costs
to a minimum, the preferred way to do this with an in-house connec
tion within the same facility (data center) where the two networks are
present.

However, even if a fiber connection (or possibly a UTP cable) is cheap,


it’s still not efficient to maintain a very large number of PNIs to many
other networks, because each PNI requires its own port on a router or
a switch, and setting up new peerings is a lot of work.

Another way of peering is “public peering” through an internet ex


change (IX). An internet exchange is basically just a big (Ethernet)
switch. All members or customers of the internet exchange connect one
or more routers to the internet exchange switch, and can then ex
change traffic with all other members of the exchange.

However, the internet exchange only provides layer-2 (Ethernet) con


nectivity, connected networks still need to agree to peer with each oth
er and then set up a BGP session before traffic will flow between them.
This is “bilateral peering”: the BGP sessions are directly between each
set of two peers.

Alternatively, if this is compatible with their “peering policy”, net


works may opt to peer through the route serves that most internet ex
change operators run. (Usually two, for redundancy.)

For historical reasons, PNI is used more in the US while internet ex


changes are used more in the rest of the world, especially in Europe. In
the US, in the early/mid 1990s, the bandwidth required for peering
between the larger networks required using more complex and more
expensive technologies such as FDDI and ATM, which made intercon
necting directly more practical. In Europe, internet exchanges were
able to use Ethernet as Ethernet speed increases kept arriving just in
time as bandwidth requirements grew. However, PNI is certainly not
uncommon in Europe and elsewhere, and internet exchanges are also
available throughout the world, including the US [W].

65
The business of peering: peering policies
Smaller networks are generally happy to peer with anyone, as any traf
fic that is handled by peering doesn’t use more costly transit capacity.
They thus have an “open” peering policy.

For larger networks, there are several reasons they may not want to
peer with smaller networks. For instance, for a network that operates
in a single country, it’s a very good deal to peer with a large in
ternational network, as they basically get to use the large network’s
long distance network for free. For the large network, this is not a very
good deal, and they would rather see the small network buy transit
service from them. Peering with very many small networks in many
locations is also simply a lot of work for relatively little benefit. Espe
cially if the large network already peers with the transit providers of
the small networks.

So larger networks usually have a “selective” or even “closed” peering


policy. Typically, a peering policy contains a set of requirements, and
any network that fulfills those requirements will be eligible for peer
ing. Common requirements are:

• A 24/7 reachable network operations center (NOC)

• A sufficiently wide geographic footprint

• Certain minimum traffic levels

• Roughly equal amounts of incoming and outgoing traffic

Balanced incoming and outgoing traffic is important because of “hot


potato” routing.

Hot potato routing


Large networks peer in many locations. This means that BGP not only
has to decide to which peer to send traffic, but also over which connec
tion to the selected peer. In the chapter Traffic engineering we'll look in
more detail at BGP’s path selection algorithm, but the short version is

66
that, all else being equal, BGP will apply “early exit” routing, also
called “hot potato” routing. In other words: BGP will use the external
connection to the AS in question that can be reached over the shortest
distance through the internal network.

This makes sense because a router in network A knows the distance or


cost towards the interconnect locations with network B by consulting
A’s internal routing protocol. The router in network A doesn’t have
access to information from B’s internal routing protocol, so it can't
compute the distance or cost for the entire path that spans networks A
and B. As such, quickly handing off the packet to network B is the only
reasonable choice.

As a result of BGP's early exit behavior, the receiving network has to


carry packets most of the way. For example, suppose a customer of ISP
A in London connects to a web server connected to ISP B in New York.
We'll assume networks A and B peer in both London and New York.

So the user in London clicks on the URL, generating a request that's


sent to ISP A. A immediately hands over the packet to B, which has to
carry it across the Atlantic to New York. The server then sends a re
sponse, which ISP hands over to ISP A in New York, so A now has to
carry the response back across the ocean.

Both networks have to carry traffic long distance, but as web servers
generate much more traffic than web users, ISP A needs a lot more
transatlantic capacity than network B. This means that without further
measures, content providers basically get a free ride because the net
works that mostly connect consumers end up handling far more long
distance traffic.

For this reason networks that have mostly “eyeballs” require that in
coming and outgoing traffic must be balanced. (Eyeballs as in users
that look at websites and videos and thus generate incoming traffic but
much less outgoing traffic.)

67
Valley-freeness
In the 1990s, it became clear that it in some situations, BGP never con
verges to a stable state. In this context, stable means that no AS wants
to make any changes. See Appendix: non-converging configurations
for an example where BGP never converges at all. A more realistic ex
ample is that of “BGP wedgies” [RFC 4264], where a backup configu
ration may get stuck in the backup state even when the primary path
becomes available again. So BGP does converge to a stable state, just
not to the intended one.

In 2001, the seminal paper Stable Internet routing without global coor
dination showed that if an AS observes a set of guidelines, that AS will
see BGP converge to a stable state.

The main takeaway from these guidelines is that prefixes learned from
a customer must have a higher local preference than prefixes learned
from non-customers. So under normal circumstances, an ISP always
sends traffic to a customer over the link to that customer. Exceptions
are possible for backup links, but those require extra attention.

Closely related to the Gao/Rexford guidelines that ensure BGP con


vergence is the valley-free model. Figure 2-1 shows a hierarchy of net
works, with ISPs 1 and 2 providing transit service to ISPs 30, 40 and 50.
These in turn serve networks 600 and 700. The two big ISPs peer with
each other, and 40 peers with both 30 and 50. Figure 3 shows the most
obvious path between 600 and 700 through 30 and 40.

68
1. 1 2
2. 1 2

30 40 50 30 40 50

600 700 600 700

3. 4. 5.
1 2 1 2 1 2

30 40 50 30 40 50 30 40 50

600 700 600 700 600 700

Figure 2. Network paths that are valley-free

However, options 3, 4 and 5 are also unproblematic, although the path


is longer than necessary. If we now look at figure 3-6, the link between
ASes 40 and 700 is removed. 7 - 10 show invalid paths, where AS 40 is
transporting traffic between ASes 600 and 700, either over peering or
over transit that AS 40 pays for, even though neither AS 600 nor AS 700
is paying for the privilege. So unless AS 40 is prepared to give away
service for free, these are invalid paths.

6.600 30 1 40 700
2 50 7. 30
1
40
2

50

600 700

8. 30
1
40
2
9. 1 2
10. 1 2

50 30 40 50 30 40 50

600 700 600 700 600 700

Figure 3. Network paths that are not valley-free

69
In a hierarchical network diagram, such invalid paths are easily identi
fied by a “valley” along the way. In order for a path to be valley-free, a
path may go up the hierarchy (from customer to provider) until it
reaches a peering link or starts to go down the hierarchy (from
provider to customer). After that peering link or provider to customer
link, the path may only go down the hierarchy through additional
provider to customer links.

Option 7 violates the valley-free property by having two peering links,


option 8 by having a customer-provider link after a peering link, op
tion 9 by having a peering link after a provider-customer link and op
tion 10 by having a customer-provider link after a provider-customer
link.

It’s important to understand that neither the part of the path left or
right of the valley is invalid in and of itself, it’s the combination of
those two halves that makes the path invalid.

Valley-freeness is accomplished through filtering BGP advertisements.


The filters on both sides of a BGP session reflect and enforce the rela
tionship between two autonomous systems:

Network A an Network B an Relationship


nounces nounces
All prefixes B’s prefixes and B’s A is provider,
customer’s prefixes B is customer
A’s prefixes and A’s All prefixes A is customer,
customer’s prefixes B is provider
A’s prefixes and A’s B’s prefixes and B’s A and B are peers
customer’s prefixes customer’s prefixes
All prefixes All prefixes A and B provide mu
tual backup (rare)

70
BGP peering configuration
As per the table above, the outgoing BGP filters on BGP sessions to
wards transit providers and customers are the same. However, incom
ing prefixes are treated differently, as is show in example 10.

Example 10. A peering configuration


!
router bgp 65082
neighbor 203.0.113.83 remote-as 65083
neighbor 203.0.113.83 description IX peer 83
neighbor 203.0.113.83 maximum-prefix 10
neighbor 203.0.113.83 prefix-list in-prefixes in
neighbor 203.0.113.83 prefix-list out-prefixes out
neighbor 203.0.113.83 filter-list 2 out
neighbor 2001:db8:90::6:5083:1 remote-as 65083
no neighbor 2001:db8:90::6:5083:1 activate
!
address-family ipv6
neighbor 2001:db8:90::6:5083:1 activate
neighbor 2001:db8:90::6:5083:1 maximum-prefix 10
neighbor 2001:db8:90::6:5083:1 prefix-list in-ipv6-prefixes in
neighbor 2001:db8:90::6:5083:1 prefix-list out-ipv6-prefixes out
neighbor 2001:db8:90::6:5083:1 filter-list 2 out
exit-address-family
exit
!
ip prefix-list in-prefixes seq 5 deny 192.0.2.0/24 le 32
ip prefix-list in-prefixes seq 10 deny 203.0.113.0/24 le 32
ip prefix-list in-prefixes seq 15 permit 0.0.0.0/0 le 24
ip prefix-list out-prefixes seq 5 permit 192.0.2.0/24
!
ipv6 prefix-list in-ipv6-prefixes seq 5 deny 2001:db8:82::/48 le
128
ipv6 prefix-list in-ipv6-prefixes seq 10 deny 2001:db8:90::/64 le
128
ipv6 prefix-list in-ipv6-prefixes seq 15 permit ::/0 le 48
ipv6 prefix-list out-ipv6-prefixes seq 5 permit 2001:db8:82::/48
!

This example builds on previous examples 1 to 5; the duplicate parts


are not repeated here. The neighbor 203.0.113.83 is an internet ex
change peer. The BGP configuration for an internet exchange peer is
the same as for a private network interconnect peer.

71
We haven’t seen the maximum-prefix configuration command yet.
With this command, we can limit the number of prefixes we’ll accept
from a neighbor. This setting is useful for peers, but generally not for
transit providers and customers: transit providers send us all global
prefixes, so there’s no point in setting a limiting on BGP sessions to
wards transit providers. For customers, unless those are very large
networks in their own right, we’ll want to explicitly allow their indi
vidual prefixes, so there’s no need to additionally limit the number of
prefixes we’ll accept.

In example 10, we limit the number of prefixes we accept from peer 83


to 10,000. In the Best practices chapter we’ll look at what would be a
good setting for the maximum prefix limit.

By default, when the number of prefixes received from the neighbor


exceeds 75% of the configured limit, the router will log a warning to its
logging buffer or other logging destination.
2020/05/10 12:06:07 BGP: %MAXPFX: No. of IPv4 Unicast prefix
received from 203.0.113.83 reaches 8, max 10

When the limit itself is exceeded, another log message is generated:

2020/05/10 12:07:40 BGP: %MAXPFXEXCEED: No. of IPv4 Unicast prefix


received from 203.0.113.83 11 exceed, limit 10

And the BGP session is disabled:


Router# show ip bgp summary
BGP router identifier 192.0.2.255, local AS number 65082

Neighbor V AS TblVer InQ OutQ Up/Down State/PfxRcd


203.0.113.83 4 65083 0 0 0 00:02:07 Idle (PfxCt)

The session stays down until restarted manually with the clear ip
bgp ... command. Additional arguments to the maximum-prefix
command can be one or more of the following: a percentage at which a
warning is generated, restart and a number in minutes after which
the session is restarted and warning-only.

The other difference between the peer configuration and earlier transit
provider configurations, in addition to the maximum-prefix setting, is

72
the change to the in-prefixes and in-ipv6-prefixes filters. These
now filter out the address blocks 203.0.113.0/24 and
2001:db8:90::/64 and any sub-prefixes in incoming BGP updates.
Those are the prefixes used by the internet exchange, which contain all
the neighbor addresses for our IX peers:

Router# show bgp ipv6 summary


Neighbor V AS Up/Down State/PfxRcd PfxSnt
2001:db8:30:8201::1 4 65030 00:10:42 1 1
2001:db8:90::6:5083:1 4 65083 00:14:17 1 1

The next hop addresses for the routes learned from these IX peers also
fall within the IX “peering LAN” prefix. A BGP route for an IX prefix
may reroute these addresses, which may disrupt the BGP session and/
or disrupt the traffic flow towards peers. So it’s important to filter out
the IPv4 and IPv6 peering LAN prefixes of all internet exchanges that
your network connects to on all BGP sessions, not just the ones towards
the respective internet exchange peers.

In 2003, the Amsterdam Internet Exchange had to renumber its peering


LAN from a /24 to a /23 to accommodate the growing number of
connected routers. So everyone had to configure the new /23 on the
router interface that connects to the peering LAN. But someone mis
takenly typed /24 at the end of the new address. And they had BGP
set up such that prefixes configured on router interfaces were injected
into BGP and advertised to peers. These peers now saw half the peer
ing LAN's new /23 rerouted through the peer with the fat fingers.

The main effect was that BGP packets between other peers would now
flow through an extra router hop, which BGP doesn't allow for eBGP
sessions. So these sessions started to go down in large numbers. In
2014 the AMS-IX had to renumber again, this time from a /22 to a /21.
Same thing happened again.

It’s common for internet exchanges to encode the AS number in the IP


address, similar to how this is done in example 10. In the example, a
semicolon is placed between digits 8 and 9 and between digits 4 and 5
of the AS number. So for the 10-digit AS number 4206508500 that’s
42:0650:8500. For the 5-digit AS number 65083 that would be

73
00:0006:5083, or just :6:5083 as per the IPv6 notation rules regard
ing leading zeros. The /64 prefix of the peering LAN
(2001:db8:90::/64) goes in front of the semicolon-padded AS num
ber, and :1 is added to the end for the AS’s first router connected to
the exchange, :2 for the second and so on. With as the results
2001:db8:90::42:650:8500:1 and 2001:db8:90::6:5083:1, re
spectively.

Peer groups
When peering at a large internet exchange, it’s not uncommon to have
BGP sessions with more than a hundred peers. That creates two prob
lems: when the router has to send a BGP update message, it has to do a
lot of work, and the configuration gets very long. Peer groups are in
tended to solve both problems.

Although peer groups are especially useful for internet exchange peer
ing configurations, the word “peer” applies to the concept of a BGP
peer (neighbor); peer groups can be used for all types of BGP neigh
bors.

When BGP neighbors are part of the same peer group, the router only
has to create one BGP update message for the group, and it can then
send copies of that message to all the members. Without peer groups,
the router has go through the process of applying filters and policies
for each neighbor separately whenever it needs to send an update
message.

Peer groups also simplify the configuration because settings can be


applied to the group and then take effect for all the members, rather
than specify those settings for each member separately. Example 11
duplicates the IPv4 part of the peering configuration from example 10
and adds two new peers.

74
Example 11. An IPv4 peer group configuration
!
router bgp 65082
neighbor ix-ipv4-peers peer-group
neighbor ix-ipv4-peers description IPv4 IX peers, max 10 prefixes
neighbor ix-ipv4-peers maximum-prefix 10
neighbor ix-ipv4-peers prefix-list in-prefixes in
neighbor ix-ipv4-peers prefix-list out-prefixes out
neighbor ix-ipv4-peers filter-list 2 out
neighbor 203.0.113.83 remote-as 65083
neighbor 203.0.113.83 peer-group ix-ipv4-peers
neighbor 203.0.113.83 description IX peer 83
neighbor 203.0.113.84 remote-as 65084
neighbor 203.0.113.84 peer-group ix-ipv4-peers
neighbor 203.0.113.84 description IX peer 84
neighbor 203.0.113.85 remote-as 4206508500
neighbor 203.0.113.85 peer-group ix-ipv4-peers
neighbor 203.0.113.85 description IX peer 85
neighbor 203.0.113.85 maximum-prefix 100
!

Example 11 shows that peer group members don’t have to share all
settings: the remote AS is different for each peer. They do have to share
the settings that may influence outgoing updates, i.e., outbound filters
and route maps. Other settings, including inbound filters and route
maps, may differ. In the example, the last line specifies a maximum
prefix limit of 100 for IX peer 85, which overrules the limit of 10 that
would otherwise be inherited from the peer group.

For IPv6, peer groups get more complex, as we can see in example 12,
which adds a peer group version of the IPv6 part of example 10.

Example 12. An IPv6 peer group configuration


!
router bgp 65082
network 192.0.2.0/24
neighbor ix-ipv6-peers peer-group
neighbor ix-ipv6-peers description IPv6 IX peers, max 10 prefixes
no neighbor ix-ipv6-peers activate
neighbor 2001:db8:90::6:5083:1 remote-as 65083
neighbor 2001:db8:90::6:5083:1 description IX peer 83
no neighbor 2001:db8:90::6:5083:1 activate

75
neighbor 2001:db8:90::6:5084:1 remote-as 65084
neighbor 2001:db8:90::6:5084:1 description IX peer 84
no neighbor 2001:db8:90::6:5084:1 activate
neighbor 2001:db8:90:0:42:650:8500:1 remote-as 4206508500
neighbor 2001:db8:90:0:42:650:8500:1 description IX peer 85
no neighbor 2001:db8:90:0:42:650:8500:1 activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor ix-ipv6-peers activate
neighbor ix-ipv6-peers maximum-prefix 10
neighbor ix-ipv6-peers prefix-list in-ipv6-prefixes in
neighbor ix-ipv6-peers prefix-list out-ipv6-prefixes out
neighbor ix-ipv6-peers filter-list 2 out
neighbor 2001:db8:90::6:5083:1 peer-group ix-ipv6-peers
neighbor 2001:db8:90::6:5084:1 peer-group ix-ipv6-peers
neighbor 2001:db8:90:0:42:650:8500:1 peer-group ix-ipv6-peers
exit-address-family
!

If a peer group is applied under the router bgp heading, the peer
group applies to both session related settings, such as the remote AS,
the description and neighbor ... shutdown / no neighbor ...
shutdown, as well as settings specific to the IPv4 address family, such
as filters. The same peer group or another one can be applied under
the address-family ipv6 unicast heading and will then govern
IPv6 specific settings.

On Cisco routers, peer groups are no longer necessary for performance


reasons, as Cisco routers have supported automatically created dy
namic update peer groups for some time. The router will automatically
group peers with the same outgoing policies together in order to re
duce the overhead of generating update messages.

Cisco also has a newer mechanism for managing configuration com


plexity: BGP templates. Unlike peer groups, templates are split into
peer session templates and peer policy templates, so the configuration
of session settings and IPv4 settings is no longer commingled.

76
Internet exchange route servers
The advantage of internet exchanges is that they make it possible to
peer with a large number of other networks in one place. The down
side is that it still takes a lot of work to contact all these other networks
and set up BGP sessions with them. For this reason, internet exchanges
have route servers. Unlike normal peers, a route server propagates
paths learned from one peer to all its other peers. Example 13 adds
IPv4 and IPv6 BGP sessions with the internet exchange route server.

Example 13. Peering with a route server


!
router bgp 65082
neighbor 203.0.113.90 remote-as 65090
neighbor 203.0.113.90 peer-group ix-ipv4-peers
neighbor 203.0.113.90 description IX route server
neighbor 203.0.113.90 maximum-prefix 100
neighbor 2001:db8:90::6:5090:1 remote-as 65090
neighbor 2001:db8:90::6:5090:1 description IX route server
no neighbor 2001:db8:90::6:5090:1 activate
!
address-family ipv6
neighbor 2001:db8:90::6:5090:1 peer-group ix-ipv6-peers
exit-address-family
!

Route servers take advantage of the fact that in this situation, where
both of the route server’s neighbors are connected to the same layer 2
network, BGP is smart enough to keep the next hop address the same.
This means that paths learned through the route server have the same
next hop address as paths learned directly, as we can see for the prefix
es 10.0.83.0/24 and 10.0.84.0/24 that are learned both directly
and from the route server:
Router# show ip bgp
BGP table version is 4, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed

77
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop Metric Weight Path


* 10.0.83.0/24 203.0.113.83 0 65090 65083 i
*> 203.0.113.83 0 0 65083 i
* 10.0.84.0/24 203.0.113.84 0 65090 65084 i
*> 203.0.113.84 0 0 65084 i
* 10.0.85.0/24 203.0.113.85 0 65090 4206508500 i
*> 203.0.113.85 0 0 4206508500 i
*> 192.0.2.0/24 0.0.0.0 0 32768 i

Note that the next hop address is the same directly and for the path
learned from the route server. When a BGP router re-advertises a path
from one router on a subnet to another router on the same subnet, it
doesn't update the next hop address so packets can flow directly be
tween the two other routers without going through the one in the
middle.

Paths through the route server are an extra AS hop longer, as the route
server adds its own AS to the AS path as required by the BGP specifi
cation. This means the bilateral peering sessions (directly with the peer
in question) have a shorter AS path so these paths are preferred over
multilateral peering through a route server.

However, in practice route servers don’t include their own AS in the


AS path. This is accomplished with the command neighbor at
tribute-unchanged as-path med in the route server’s configuration,
...
which we'll do in example 14. As the name of the command suggests,
with this configuration, the route server will pass along the AS path
and MED unchanged.

The router should automatically refrain from updating next


hop address if both the source and the destination of an up
date are on the same subnet, as explained above, but it looks
like there is a bug in the current version of FRRouting (and
Quagga) so this doesn’t happen for IPv6.

78
Many routers check if the first AS in AS paths of incoming updates is
indeed the neighbor AS, and if not, log one or more errors and tear
down the BGP session:
2020/05/14 14:01:21 BGP: 203.0.113.90 incorrect first AS (must be
65090)
2020/05/14 14:01:21 BGP: %NOTIFICATION: sent to neighbor
203.0.113.90 3/11 (UPDATE Message Error/Malformed AS_PATH) 0 bytes

FRRouting has this check turned off by default, but if you turn it on
with neighbor ... enforce-first-as, the BGP session is not torn
down like a Cisco router does as shown above. Rather, prefixes with a
different first AS than the neighbor AS in the AS path are filtered out.

When peering with route servers that don't add their AS to the AS path
and routers that do enforce this by default, it’s necessary to use the no
bgp enforce-first-as configuration command on Cisco routers:

Example 14. Disabling the first AS in the AS path consistency check


!
router bgp 65082
no bgp enforce-first-as
!

With the route server now leaving out its AS number from the AS path,
the prefixes learned directly from peers and those same prefixes
learned from the route server look the same:
Router# show ip bgp
Network Next Hop Metric LocPrf Weight Path
* 10.0.83.0/24 203.0.113.83 0 0 65083 i
*> 203.0.113.83 0 0 65083 i
* 10.0.84.0/24 203.0.113.84 0 0 65084 i
*> 203.0.113.84 0 0 65084 i
* 10.0.85.0/24 203.0.113.85 0 0 4206508500 i
*> 203.0.113.85 0 0 4206508500 i

It is of course still possible to list all paths learned from the route
server using the show ip bgp neighbors ... routes command:

79
Router# show ip bgp neighbors 203.0.113.90 routes
Network Next Hop Metric LocPrf Weight Path
*> 10.0.83.0/24 203.0.113.83 0 0 65083 i
*> 10.0.84.0/24 203.0.113.84 0 0 65084 i
* 10.0.85.0/24 203.0.113.85 0 0 4206508500 i

For prefixes 10.0.85.0/24 the direct path is preferred, so the path to


that prefix through the route server doesn’t have the “best” indicator.
For 10.0.83.0/24 and 10.0.84.0/24 the route server path is pre
ferred, the reason for this is that the direct BGP session toward these
neighbors was reset, so the path learned through the route server is
older than the direct path.

See the section on the multi-exit discriminator in the next chapter for
an example where route server paths are given lower priority than di
rect paths.

80
Traffic engineering

Traffic engineering is the art and science of influencing routing deci


sions to make sure that available network capacity is used as effective
ly as possible. As BGP has doesn’t know how much bandwidth is
available on the different paths towards a destination, it will often not
make optimal use of available network capacity.

The BGP path selection algorithm


First, let’s have a look at the rules for how BGP makes its routing deci
sions when multiple paths are available. These rules are known as the
BGP path selection algorithm. The BGP specification [RFC 4271] has 7
steps, Cisco’s version has 13 steps:

Cisco RFC
4271
1 Prefer the path with the highest WEIGHT.
2 * Prefer the path with the highest LOCAL_PREF.
3 Prefer the path that was locally originated.
4 a Prefer the path with the shortest AS_PATH.
5 b Prefer the path with the lowest origin type.
6 c Prefer the path with the lowest multi-exit discrim
inator (MED).
7 d Prefer eBGP over iBGP paths.
8 e Prefer the path with the lowest IGP metric to the
BGP next hop.
9 Determine if multiple paths require installation in
the routing table for BGP multipath.
10 When both paths are external, prefer the path that
was received first (the oldest one).

81
11 f Prefer the route that comes from the BGP router
with the lowest router ID.
12 If the originator or router ID is the same for mul
tiple paths, prefer the path with the minimum
cluster list length.
13 g Prefer the path that comes from the lowest neigh
bor address.

Most of the text in the table above comes from Cisco’s description,
which contains additional notes for most steps. Steps 1, 3 and 5 usually
don’t come into play. Step 9 is only relevant with BGP multipath,
which we’ll discuss later this chapter. Steps 10 and above are pure tie
breakers that make sure the router will eventually be able to select a
path even though there is no real reason to prefer one over the other.
This leaves steps 2, 4, 6, 7 and 8.

When a BGP router receives an update, it runs the path selection algo
rithm for each prefix contained in the update. Obviously, if after the
update there is only one valid path towards a prefix, it selects that path
as the best one. If after the update there are multiple routes or paths
towards a prefix, the router evaluates the steps in the algorithm until a
single best path is left.

RFC 4271 really only specifies one rule to select the best path: Cisco’s
step 2, prefer the path with the highest LOCAL_PREF. RFC 4271
steps a - g are all considered tie breakers. The local preference attribute
is always present for paths learned from other routers within our own
AS over iBGP. When a router learns a path from an external AS, or
originates a path itself, there may not be a local preference value avail
able. In that case, 100, the default local preference value, will be used.

The router now identifies the maximum local preference value for all
the paths under consideration. Then all paths that have a lower local
preference than this maximum are removed from consideration. (They
remain in the BGP table.) So if there are three paths with local prefer
ences of 90, 100 and 110, respectively, the maximum is 110. The paths

82
with the local preference of 90 and 100 are removed from considera
tion. This leaves the path with a local preference of 110, which is now
selected as the best path and the rest of the steps are skipped.

However, suppose the router learns a new path that also has a local
preference of 110. In this case, the maximum local preference is still
110, and the paths with 90 and 100 are still removed from considera
tion. This leaves the two paths that have a local preference of 110,
which then both move on to the next step in the algorithm.

Cisco’s step 4 and the BGP specification’s tie breaker a is prefer the
path with the shortest AS_PATH. That is the AS path with the fewest
AS numbers in it. So the AS path 64999 65000 is shorter than the AS
path 12 34 56. Like with the local preference, if only one path/route
has the shortest AS path, that one is declared best and the remaining
steps in the algorithm are skipped. If multiple paths/routes share the
shortest AS path length, those move on to the next step.

Step 6 / c is prefer the path with the lowest multi-exit discriminator


(MED). The purpose of the MED (also known as “metric”) is to choose
between multiple paths towards a single neighboring AS. So the MED
step is only applied to paths learned from the same neighbor AS.
However, it is possible to specify bgp always-compare-med, and the
router will then also compare MEDs for paths learned from different
neighboring autonomous systems, if the algorithm reaches this step.

The MED is an optional attribute and non-transitive, so it only sur


vives one eBGP hop. So often, a path has no MED. If none of the paths
that reach the MED comparison step have the MED attribute, then all
paths move on to the next step. When comparing a path that has an
MED with a path that doesn’t have an MED, the path that doesn’t have
an MED should be considered to have the lowest possible MED, but
it’s dangerous to rely on that, as we’ll see in the section Setting and ad
justing the MED later this chapter.

Step 7 / d is to prefer eBGP over iBGP paths. At this point, interaction


with other routing protocols becomes relevant, as we’ll discuss in the
chapter iBGP.

83
Step 8/e is prefer the path with the lowest IGP metric to the BGP
next hop. If there are still multiple paths under consideration at this
point, those paths are equally preferred purely from a BGP viewpoint,
so at this step, the algorithm looks at information from the interior
gateway protocol such as OSPF. This step compares the IGP metrics for
the next hop addresses of the paths still under consideration, effective
ly preferring the path that requires traveling the shortest distance
through the internal network.

This means that tuning of the cost values in the internal network will
influence BGP, but only if all the BGP attributes are the same or very
similar. That will usually be the case for paths learned from the same
neighboring AS in multiple locations. This step is the one that makes
BGP use hot potato / early exit routing.

In many cases, we’ll want to overrule or adjust the results of the path
selection algorithm. This is done by using a route map to set the local
preference to a certain value, make the AS path longer (prepending) or
change the MED.

Route maps
Changing the path attributes that are used by the BGP best path selec
tion algorithm is done with route maps. We’ve used route maps earlier
in the Filtering BGP chapter, but before we continue, it’s a good idea to
cover the workings of route maps in a bit more detail.

A route map is basically a simple computer program. Each route map


has a name, and at least one clause. The clauses are if-then construc
tions, with both the if (called match) and the then (called set) being
optional. Each clause also has a permit or a deny.

When applied to outgoing or incoming updates to/from a BGP neigh


bor, each prefix is passed through the route map clauses in order. We
normally use permit clauses. When those match, any set actions are
executed, and then the prefix is accepted into the BGP RIB (for route
maps applied in the incoming direction) or the prefix is propagated to
the BGP neighbor (for route maps applied in the outgoing direction).

84
However, if a route map clause is a deny clause, if there is a match, the
prefix is not allowed into the BGP RIB or propagated to the neighbor. If
there was no match, the next route map clause is applied. When ap
plied to BGP sessions, route maps have the usual “implicit deny” be
havior. So when a prefix progresses through the route map without
matching a permit clause, that prefix is not added to the BGP RIB or
sent to the BGP neighbor. (A clause without a match condition matches
everything.)

In addition to changing BGP attributes, route maps can also be used


for more complex filtering behaviors. For instance, this route map al
lows prefix lengths all the way up to /32 (IPv4) or /128 (IPv6) from AS
64999, but no longer than /24 and /48, respectively, from anyone else:

!
ip prefix-list more-than-24 seq 5 permit 0.0.0.0/0 ge 24
!
ipv6 prefix-list more-than-48 seq 5 permit ::/0 ge 48
!
ip as-path access-list as64999 permit ^64999$
!
route-map long-prefixes-from-64999 permit 10
match as-path as64999
match ip address prefix-list more-than-24
!
route-map long-prefixes-from-64999 permit 20
match as-path as64999
match ipv6 address prefix-list more-than-48
!
route-map long-prefixes-from-64999 deny 30
match ip address prefix-list more-than-24
!
route-map long-prefixes-from-64999 deny 40
match ipv6 address prefix-list more-than-48
!
route-map long-prefixes-from-64999 permit 50
!

The permit 10 and permit 20 clauses both match on the AS path


64999 through the AS path access list as64999. In addition, the permit
10 one also matches on any prefix longer than IPv4 /24 through the
more-than-24 prefix list. The permit 20 one does the same for IPv6 /
48. Having two match statements in one route map clause means that

85
both must match. When that’s the case, no set actions are performed,
but the prefix is permitted in.

IPv4 prefixes up to /24 and IPv6 prefixes up to /48 as well as any pre
fixes with an AS path that isn’t exactly 64999 didn’t match, so they go
on to the deny 30 and deny 40 clauses. Those match all prefixes
longer than /24 and longer than /48, respectively. A match in a deny
route map clause means the prefix is denied.

IPv4 prefixes up to /24 and IPv6 prefixes up to /48 still haven’t


matched, so they reach the permit 50 clause. This one doesn’t have a
match, so it matches everything, and all the prefixes that make it to
here are permitted.

An easy mistake to make is to leave out prefix-list in a


line like match ipv6 address prefix-list more
than-48. In that case, this means a regular access list will be
used for matching prefixes. That’s probably not what you
intended. And you shouldn’t, as access lists are normally
used to filter packets rather than prefixes, so it’s just a more
complex way to achieve the same thing as with a prefix list.

Setting the local preference


By giving certain paths/routes a higher local preference, those are al
ways preferred. So the local preference is a useful tool when we know
exactly how we want traffic to flow. Good examples of this are paths
learned from customers, which should always be preferred over indi
rect paths to reach a customer. Or paths learned from peers, which
should be preferred over indirect paths to reach a peer. After all, if the
indirect path is better, why bother peering in the first place? Example
15 extends the configuration from examples 11 and 12 to increase the
local preference for prefixes learned from peers to 110.

86
Example 15. Increasing the local preference
!
router bgp 65082
neighbor ix-ipv4-peers peer-group
neighbor ix-ipv4-peers description IPv4 IX peers, max 10 prefixes
neighbor ix-ipv4-peers maximum-prefix 10
neighbor ix-ipv4-peers prefix-list in-prefixes in
neighbor ix-ipv4-peers prefix-list out-prefixes out
neighbor ix-ipv4-peers route-map peers-in in
neighbor ix-ipv4-peers filter-list 2 out
neighbor ix-ipv6-peers peer-group
neighbor ix-ipv6-peers description IPv6 IX peers, max 10 prefixes
no neighbor ix-ipv6-peers activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor ix-ipv6-peers activate
neighbor ix-ipv6-peers maximum-prefix 10
neighbor ix-ipv6-peers prefix-list in-ipv6-prefixes in
neighbor ix-ipv6-peers prefix-list out-ipv6-prefixes out
neighbor ix-ipv6-peers route-map peers-in in
neighbor ix-ipv6-peers filter-list 2 out
!
route-map peers-in permit 10
set local-preference 110
!

The neighbor ... route-map peers-in in command applied to


the IPv4 and IPv6 peer groups applies the peers-in route map to in
coming updates from peers. The route map has no match statement, so
the set statement applies to all updates, setting the local preference to
110. Remember to do a clear ip bgp ... in after such a configura
tion change to make sure the BGP table is updated to reflect the new
policy. With the following result:

Router# show ip bgp


BGP table version is 12, local router ID is 192.0.2.251, vrf id 0

Network Next Hop LocPrf Weight Path


*> 10.0.83.0/24 203.0.113.83 110 0 65083 i
* 192.0.2.21 0 65030 65083 i
* 192.0.2.41 0 65040 65083 i
*> 10.0.84.0/24 203.0.113.84 110 0 65084 65084 65084 i
* 192.0.2.21 0 65030 65084 i
* 192.0.2.41 0 65040 65084 i

87
*> 10.0.85.0/24 203.0.113.85 110 0 4206508500 i
* 192.0.2.21 0 65030 4206508500 i
* 192.0.2.41 0 65040 4206508500 i
*> 192.0.2.0/24 0.0.0.0 32768 i

We get each peer’s prefix three times: through AS 65030, through AS


65040 and directly. In each case, the direct path is preferred, as per the
110 local preference. However, in the case of ASes 65083 and
4206508500, the direct AS path is also shorter than the extra hop
through ASes 65030 or 65040. However, AS 65084 prepends its AS
twice towards us, making the AS path 65084 65084 65084 and thus
longer than 65030 65084 or 65040 65084. But the higher local prefer
ence still makes the peering path preferred, despite the longer AS path.

As the local preference is a local preference, modifying it only changes


path selection by your own routers, which impacts outgoing traffic.

AS path prepending
Changing the local preference is very useful when we want all traffic to
prefer paths learned from certain neighboring ASes over other neigh
boring ASes, but it’s a rather crude tool when there are multiple net
work paths, and we just want to move some traffic from one to another.
In this case, we don’t want to completely sidestep the entire BGP path
selection algorithm, we just want to nudge it a bit. We can do this by
making the AS path a bit longer. In example 16, the peering with net
works 83, 84 and 85 has been shut down, so the prefixes those net
works originate come in through ISPs 30 and 40. 83 prepends the AS
path for the prefix it announces towards ISP 30; 84 prepends towards
ISP 40 and 85 doesn’t prepend.

Example 16. Network 83 prepends towards ISP 30


!
router bgp 65083
network 10.0.83.0/24
neighbor 198.51.100.5 remote-as 65030
neighbor 198.51.100.5 description ISP X connection 1
neighbor 198.51.100.5 route-map prepend1 out

88
neighbor 203.0.113.82 remote-as 65082
neighbor 203.0.113.82 description peer R
neighbor 203.0.113.82 shutdown
!
route-map prepend1 permit 10
set as-path prepend 65083
!

Example 16. Network 84 prepends towards ISP 40


!
router bgp 65084
network 10.0.84.0/24
neighbor 198.51.100.45 remote-as 65040
neighbor 198.51.100.45 description ISP Y connection 1
neighbor 198.51.100.45 route-map prepend1 out
neighbor 203.0.113.82 remote-as 65082
neighbor 203.0.113.82 description peer R
neighbor 203.0.113.82 shutdown
!
route-map prepend1 permit 10
set as-path prepend 65083
!

In both cases, the route map prepend1 is applied to outgoing updates


towards one ISP. AS path prepending can be done in for outgoing BGP
updates, which influences incoming traffic, or for incoming BGP up
dates, in which case it influences outgoing traffic. (Or combine both.)

Also, the BGP session with peer 203.0.113.82 is shut down. (Use no
neighbor ... shutdown to bring the session back up.) The result is
the following BGP table entries in network 82:

Router# show ip bgp

Network Next Hop LocPrf Weight Path


* 10.0.83.0/24 192.0.2.21 0 65030 65083 65083 i
*> 192.0.2.41 0 65040 65083 i
* 10.0.84.0/24 192.0.2.41 0 65040 65084 65084 i
*> 192.0.2.21 0 65030 65084 i
* 10.0.85.0/24 192.0.2.21 0 65030 4206508500 i
*> 192.0.2.41 0 65040 4206508500

Each of the three prefixes can be reached over either AS 65030 or AS


65040. In each case, there is no local preference, which means that for
10.0.83.0/24 and 10.0.84.0/24 the route with the shorter AS path

89
is chosen. There is also no MED (metric), but that wouldn't have
changed the results. For the 10.0.85.0/24 prefix, the AS path doesn’t
provide any guidance, so let’s look at this prefix in more detail:
Router# show ip bgp 10.0.85.0/24
BGP routing table entry for 10.0.85.0/24, version 20
Paths: (2 available, best #2, table default)
Not advertised to any peer
65030 4206508500
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, valid, external
Community: 65030:1
Last update: Tue Nov 1 15:29:56 2022
65040 4206508500
192.0.2.41 from 192.0.2.41 (198.51.100.255)
Origin IGP, valid, external, best (Older Path)
Community: 65040:2
Last update: Tue Nov 1 15:30:06 2022

This doesn’t seem to make much sense, as the path through AS 65030
should be preferred by step 10 (prefer the oldest, last update 15:29:56
is older than 15:30:06), step 11 (prefer the lowest router ID, with
198.51.100.223 being lower than 198.51.100.255) as well as step 13
(prefer the lowest neighbor address, with 192.0.2.21 being lower
than 192.0.2.41). However, the update at 09:28:37 was due to a soft
reset (clear ip bgp 192.0.2.41 in), so although the path was up
dated at 15:30:06, that update didn’t change anything, so the path
through AS 65040 still counts as the oldest and is thus preferred by
step 10, as indicated by (Older Path).

It’s rather impolite to prepend using another network’s AS


number, with the exception of a service provider prepending
on behalf of a customer using that customer's AS number for
the prepend. In the Best practices chapter, we’ll discuss rea
sonable limits on AS path lengths and number of AS path
prepends.

Setting and adjusting the MED


Example 17 uses the MED for its intended purpose: when there are
multiple connections between two ASes, the MED can be used to pre

90
fer one of these paths over the other or others. In example 17, we first
add a second BGP session towards ISP 30, and configure the ISP 30
router to set an MED of 10 on the first session and an MED of 20 on the
second session.

Example 17-1. A second BGP session towards AS 65030 on router 82


!
router bgp 65082
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30, first connection
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.21 filter-list 2 out
neighbor 192.0.2.25 remote-as 65030
neighbor 192.0.2.25 description ISP 30, second connection
neighbor 192.0.2.25 prefix-list in-prefixes in
neighbor 192.0.2.25 prefix-list out-prefixes out
neighbor 192.0.2.25 filter-list 2 out
!

Example 17-2. Setting different MEDs on two parallel BGP sessions on ISP
router 30
!
router bgp 65030
network 10.0.30.0/23
neighbor 192.0.2.22 route-map customer-in in
neighbor 192.0.2.22 route-map med10 out
neighbor 192.0.2.26 remote-as 65082
neighbor 192.0.2.26 route-map customer-in in
neighbor 192.0.2.26 route-map med20 out
!
route-map med10 permit 10
set metric 10
!
route-map med20 permit 10
set metric 20
!

We can see the MED values of 10 and 20 show up for prefixes learned
over the two BGP sessions with AS 65030:

91
Router# show ip bgp
Network Next Hop Metric LocPrf Weight Path
*> 10.0.40.0/21 192.0.2.41 0 0 65040 i
* 10.0.83.0/24 192.0.2.41 0 65040 65083 i
* 192.0.2.25 20 0 65030 65083 i
*> 192.0.2.21 10 0 65030 65083 i
* 10.0.84.0/24 192.0.2.41 0 65040 65084 i
* 192.0.2.25 20 0 65030 65084 i
*> 192.0.2.21 10 0 65030 65084 i
* 10.0.85.0/24 192.0.2.41 0 65040 4206508500 i
* 192.0.2.25 20 0 65030 4206508500 i
*> 192.0.2.21 10 0 65030 4206508500 i

Interestingly, of the two prefixes learned from AS 65040, one


(10.0.40.0/21) has an MED of 0, while the other one (10.0.83.0/24)
has no MED. The reason for this is probably that the router adds an
MED 0 to all the prefixes it originates itself, but, because the MED is
only propagated over one AS hop, the router removes any MED that’s
present from prefixes learned over eBGP as it re-advertises those over
eBGP.

In the output above, for the prefix 10.0.83.0/24 the path with MED
10 and AS path 65030 65083 is selected as best. This path obviously
wins from the one with MED 20 and the same AS path (learned
through the other BGP session with AS 65030), but it’s not immediately
clear why the paths through AS 65030 are preferred over the path
through AS 65040. This must come down to the last few tie breaker
steps in the BGP path selection algorithm. (As one of those is how re
cent the last update was, running the example yourself may not pro
duce the same result.)

This result does clearly illustrate that the MED is only compared be
tween paths learned from the same neighboring AS. However, it may
be useful to be able to influence path selection when multiple paths
with the same AS path length are learned from different neighboring
ASes. In practice, AS path prepending tends to shift too much traffic
from one connection to another. Paths towards a given prefix through
different ISPs often have the same AS path length, so a prepend to
wards one ISP may push traffic to and/or from as much as half of the

92
internet to another connection. Example 18 sets bgp always-com-
pare-med.

Example 18. Always comparing the MED


!
router bgp 65082
bgp always-compare-med
!

The result is that now the path through AS 65040 that has no MED
wins from the paths through AS 65030 with MEDs of 10 and 20:
Router# show ip bgp
BGP table version is 7, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
RPKI
Origin
validation
codes: icodes:
- IGP,Vevalid,
- EGP,I?invalid,
- incomplete
N Not found

Network Next Hop Metric LocPrf Weight Path


* 10.0.83.0/24 192.0.2.25 20 0 65030 65083 i
* 192.0.2.21 10 0 65030 65083 i
*> 192.0.2.41 0 65040 65083 i

However, even though [RFC 4271] specifies that a path with no MED
should be considered to have the lowest possible MED, some imple
mentations may deviate from this, either by default or because they’re
configured to do so. Example 19 shows a bgp bestpath med miss
ing-as-worst configuration:

Example 19. Treating a missing MED as the worst MED


!
router bgp 65082
bgp always-compare-med
bgp bestpath med missing-as-worst
!

And now the path through AS 65040 is no longer preferred, as the


missing MED is considered worse than the 10 or 20 MEDs:

93
Router# show ip bgp
BGP
Default
Status
table
codes:
local
version
pref
s suppressed,
is
100,
9, localdrouter
damped,
IDhis
history,
192.0.2.251,
* valid,
vrf>id
best,
0
local AS 65082

= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found

Network Next Hop Metric LocPrf Weight Path


* 10.0.83.0/24 192.0.2.25 20 0 65030 65083 i
*> 192.0.2.21 10 0 65030 65083 i
* 192.0.2.41 0 65040 65083 i

In addition to setting the MED to a certain value, it’s also possible to


adjust an existing MED by adding to it or subtracting from it. For in
stance, when peering with some networks directly as well as through a
route server, we may want to prefer the routes learned directly over
the ones learned through the route server, but we may not want to
completely overrule the preferences our peers express through the
MEDs they announce to us. Example 20 adjusts the MED for prefixes
learned from the route server, building on the configuration from ex
amples 13 and 14.

Example 20. Adjusting the MED for route server paths


!
router bgp 65082
neighbor 203.0.113.90 remote-as 65090
neighbor 203.0.113.90 peer-group ix-ipv4-peers
neighbor 203.0.113.90 description IX route server
neighbor 203.0.113.90 route-map add-med-rserv in
neighbor 2001:db8:90::6:5090:1 remote-as 65090
neighbor 2001:db8:90::6:5090:1 description IX route server
no neighbor 2001:db8:90::6:5090:1 activate
!
address-family ipv6
neighbor 2001:db8:90::6:5090:1 peer-group ix-ipv6-peers
neighbor 2001:db8:90::6:5090:1 route-map add-med-rserv in
exit-address-family
!
route-map add-med-rserv permit 10
set metric +2
!

94
If we look at the paths for prefixes 2001:db8:83::/48 and
2001:db8:84::/48, we can now tell which ones are learned from the
peer directly and which ones are learned through the route server: for
the first prefix, we see a path with a MED of 2, and another one with a
MED of 0. So the first one was learned from the route server, the other
one directly.
Router# show bgp ipv6 unicast
BGP table version is 5, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found

Network Next Hop Metric Weight Path


* 2001:db8:83::/48 fe80::42:acff:fe11:5
2 0 65083 i
*> fe80::42:acff:fe11:2
0 0 65083 i
*> 2001:db8:84::/48 fe80::42:acff:fe11:5
102 0 65084 i
* fe80::42:acff:fe11:3
200 0 65084 i

For the second prefix, the MEDs are 102 and 200, respectively. What’s
happening here is that AS 65084 has the opposite preference from ours:
they are attaching a higher MED to their prefix when announced di
rectly to peers, and a lower MED when announcing their prefix to the
route server, which the route server subsequently propagates unmodi
fied.

Our configuration still adds 2 to the MED attached to the prefix


learned from the route server, adding up to a MED of 102, which is still
a lot less than the 200 for the direct path. So in this case, our peer’s
preference wins.

We can also configure the router to subtract a value from the MED at
tached to incoming prefixes with set metric -2, for instance. How
ever, many prefixes won’t have an MED or an MED of 0. Lowering

95
such a MED will result in a MED that’s still 0. So lowering the MED
will often not have the desired effect.

Influencing neighboring networks with com


munities
It can be hard to reach the desired traffic engineering results for incom
ing traffic by prepending the AS path on outbound updates. At an av
erage AS path length of 4 hops, adding an extra hop by prepending the
AS path can move as much as 50% of traffic from the newly prepended
path to another, unprepended path.

For a network with many upstream ISPs and/or many peers, this is
less of a problem: they can prepend towards some ISPs or peers and
not others. Often, ISPs let their customers take advantage of this abili
ty. The usual way to do this is by attaching certain communities to a
prefix. Each network has their own system for this. For example, Telia
Carrier (AS 1299) allows setting communities for regions such as Eu
rope (1299:200x), North America (1299:500x) and Asia (1299:700x).
There are also communities for individual peers, such as 1299:566x
for Comcast and 1299:264x for Deutsche Telekom. The x denotes 0 - 3
for a number of prepends or 9 to not announce the prefix in question to
that network at all.

So if a Telia customer attaches the communities 1299:5001 and


1299:2649 to their prefix, Telia will prepend that prefix once to all
their peers in North America and not announce the prefix to Deutsche
Telekom.

In example 21 we’re going to use communities to have ISP 30 prepend


towards networks 84 and 85, but not towards network 83. But before
that, let’s have a look at how networks 83, 84 and 85 see the paths to
wards the 192.0.2.0/24 prefix that our test network (82) announces
based on the standard configuration from example 11 with neighbor
ix-ipv4-peers shutdown in effect. Network 83 sees:

96
Router83# show ip bgp
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.5 0 65030 65082 i
* 198.51.100.13 0 65040 65082 i

Network 84:

Router84# show ip bgp


Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.37 0 65030 65082 i
* 198.51.100.45 0 65040 65082 i

Network 85:
Router85# show ip bgp
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.101 0 65030 65082 i
* 198.51.100.109 0 65040 65082 i

So they all prefer the path through ISP 30 (AS 65030), possibly over
burdening the connection from the test network to ISP 30, while the
connection to ISP 40 (AS 65040) is left underutilized.

Prepending towards AS 65030 would make them all prefer the path
through AS 65040, so then that connection would be overburdened
and the connection to AS 65030 would be underutilized. By using
community mechanisms provided by ISPs, we can selectively prepend
and get a better balance.

For the purposes of our example, AS 65030 uses the following commu
nities, which are often stored in the AUT-NUM object of an internet
routing registry (IRR) in a format like this:

aut-num: AS65030
as-name: ISP30
descr: Internet Service Provider Three Zero
admin-c: ABCD
tech-c: EFGH-RIPE
remarks:
remarks: -----------------------------------------------------
remarks: Don't announce community
remarks: -----------------------------------------------------
remarks: 0:XXX - don't announce to AS XXX

97
remarks: -----------------------------------------------------
remarks: Customer traffic engineering communities - prepending
remarks: -----------------------------------------------------
remarks: 65001:0 - prepend once to all peers
remarks: 65001:XXX - prepend once to AS XXX
remarks: 65002:0 - prepend twice to all peers
remarks: 65002:XXX - prepend twice to AS XXX
remarks: 65003:0 - prepend 3 x to all peers
remarks: 65003:XXX - prepend 3 x to AS XXX
remarks: -----------------------------------------------------
remarks: Large communities for 32-bit (or 16-bit) AS numbers
remarks: -----------------------------------------------------
remarks: 65030:1:0 - prepend once to all peers
remarks: 65030:1:XXX - prepend once to AS XXX
remarks: 65030:2:0 - prepend twice to all peers
remarks: 65030:2:XXX - prepend twice to AS XXX
remarks: 65030:3:0 - prepend 3 x to all peers
remarks: 65030:3:XXX - prepend 3 x to AS XXX
remarks: -----------------------------------------------------

In example 21, we use communities to ask ISP 30 to prepend twice to


wards AS 65084 and once towards AS 4206508500 and we ask the in
ternet exchange route server AS 65090 to not announce our prefix to
AS 65083.

Example 21. Using communities to trigger prepending by an ISP


!
router bgp 65082
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.21 route-map isp30-out out
neighbor 192.0.2.21 filter-list 2 out
neighbor 203.0.113.90 remote-as 65090
neighbor 203.0.113.90 description IX route server
neighbor 203.0.113.90 prefix-list in-prefixes in
neighbor 203.0.113.90 prefix-list out-prefixes out
neighbor 203.0.113.90 route-map rserv90-out out
neighbor 203.0.113.90 filter-list 2 out
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
no neighbor 2001:db8:30:8201::1 activate
!

98
address-family ipv6
neighbor 2001:db8:30:8201::1 activate
neighbor 2001:db8:30:8201::1 prefix-list in-prefixes in
neighbor 2001:db8:30:8201::1 prefix-list out-prefixes out
neighbor 2001:db8:30:8201::1 route-map isp30-out out
neighbor 2001:db8:30:8201::1 filter-list 2 out
exit-address-family
!
route-map isp30-out permit 10
set community 65002:65084
set large-community 65030:1:4206508500
!
route-map rserv90-out permit 10
set community 0:65083 0:65084
!

The 0:XXX community to block announcements of a prefix to AS XXX


is widely, but not universally supported by internet exchange route
servers.

We’re using the regular [RFC 1997] community 65002:65084 to ask


ISP 30 to prepend twice towards AS 65084. However, AS 4206508500 is
a 32-bit AS number, and we can’t put those inside either of the two 16
bit halves of a regular community. Large communities [RFC 8092]
solve this issue. They consist of three 32-bit numbers. The first number
in a large community is the “global administrator”—in other words,
the AS number of the network defining the community. The two other
numbers are the local data part 1 and local data part 2. In this case, lo
cal data part 1 is the number of prepends desired, while local data part
2 is the AS number to prepend towards.

It would actually be cleaner to use large communities even for 16-bit


AS numbers, so 65030:1:65084 rather than 65002:65084, but al
though large communities are starting to see a good level adoption by
different vendors, they’re not as widely supported as regular commu
nities. So it’s likely that regular communities and large communities
will be used side by side for some time to come. Also, most transit
networks that participate in peering still have 16-bit AS numbers, so in
many cases there is no immediate pressure to make community actions
work with 32-bit AS numbers.

99
A slight advantage of using regular communities when possible is that
this way, no large community attribute needs to be added to the path,
and the large communities themselves take 12 bytes rather than 4 for a
regular community.

There are also extended communities [RFC 4360], but these are com
plex and although many implementations support them to some ex
tent, they are not used in ways similar to regular communities or large
communities.

Let’s have a look at the BGP tables in networks 83, 84 and 85 to see
what the effect of the example 21 has been. First, network 83. No
change:
Router83# show ip bgp
Network Next Hop LocPrf Weight Path
*> 10.0.84.0/24 203.0.113.84 110 0 65084 i
* 198.51.100.13 0 65040 65084 i
* 198.51.100.5 0 65030 65084 i
*> 192.0.2.0/24 198.51.100.13 0 65040 65082 i
* 198.51.100.5 0 65030 65082 i

Network 83 does receive prefix 10.0.84.0/24 from the route server,


but not 192.0.2.0/24 because of the 0:65083 community we at
tached to this prefix as it was advertised to the route server.

Network 84. The path through AS 65030 now has two prepends: AS
65082 appears three times. AS 65030 uses set as-path prepend
last-as 2 to prepend the last AS in the path rather than its own AS.
The unprepended path through AS 65040 is now preferred:
Router84# show ip bgp
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.45 0 65040 65082 i
* 198.51.100.37 0 65030 65082 65082
65082 i

Network 85. The path through AS 65030 now has one prepend (so AS
65030 appears twice) and the path through AS 65040 would now be
now preferred, except that the direct path through peering has an even
shorter AS path and also a higher local preference:

100
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 203.0.113.82 110 0 65082 i
* 198.51.100.109 0 65040 65082 i
* 198.51.100.101 0 65030 65082 65082 i

So our efforts have been successful.

Make sure that if you use communities like 0:XXX to block


advertising your prefixes to a certain AS, the network in
question still has another path to reach the prefix. This will
usually be the case if the prefix is also advertised through
another transit ISP and the community is used to suppress
advertisement to a peer. You should never use a community
to ask an ISP to suppress advertisement to a customer of
theirs, because it’s possible that the customer has no other
path to reach the prefix.

Announcing more specific prefixes


With communities, we can overcome the issue that using prepending
for traffic engineering incoming traffic is too effective. However,
prepending may also prove not effective enough. That first prepend
may move a lot of traffic, but the second is much less effective and
more than three prepends rarely makes much of a difference. The rea
son for this is that the networks still sending traffic over the prepended
path are probably not even reaching the step in the BGP path selection
algorithm, but probably use a higher local preference for the path in
question.

As we’ve discussed before, transit providers should always use a


higher local preference for their customer’s prefixes, and, by extension,
their customer’s customer’s prefixes. There may also be other reasons
for remote networks to prefer a certain path despite a long AS path.

In those cases, if we still want to perform traffic engineering of incom


ing traffic, we have to overrule the best path selection decisions in re
mote networks. We can’t do this directly, but there’s one thing we can
still do: simply not announce a prefix over a certain path. In that case,

101
other networks have no other choice and can only send the traffic to us
over the path we prefer.

If we have several prefixes, we can announce some to one ISP and


some to another. But if that doesn’t give us the desired results, or we
only have a single prefix, we can break up a prefix into two or more
smaller ones and advertise those differently. This is how ISP 40 reaches
all of 192.0.2.0/24 before we split it up into two /25 more specific
prefixes:
Network Next Hop Metric LocPrf Weight Path
*> 192.0.2.0 192.0.2.42 0 200 0 65082 i

Example 22 splits our example prefix 192.0.2.0/24 into two /25


more specific prefixes, announcing one to ISP 30 and one to ISP 40.

Example 22. Announcing more specific prefixes


!
router bgp 65082
network 192.0.2.0/25
network 192.0.2.128/25
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 prefix-list out-prefixes-isp30 out
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
neighbor 192.0.2.41 prefix-list out-prefixes-isp40 out
!
ip prefix-list out-prefixes-isp30 seq 5 permit 192.0.2.0/25
ip prefix-list out-prefixes-isp40 seq 5 permit 192.0.2.128/25
!

With this configuration in effect, ISP 40 now sees the following paths to
the two halves of 192.0.2.0/24:
Router40# show ip bgp
Network Next Hop LocPrf Path
*> 192.0.2.0/25 198.51.100.142 110 65010 65020 65030 65082 i
*> 192.0.2.128/25 192.0.2.42 200 65082 i

But remember, in reality many networks won’t accept prefixes longer


than /24, so in practice your more specifics can’t be longer than /24 so
you need at least a /23 to be able to do this with any measure of suc

102
cess. (I’m using 192.0.2.0/24 as the example prefix here to maintain
consistency with other examples.)

But even when using “safe” prefix lengths of /24 or shorter for the
more specifics, these more specifics are now only available over one
path so there is a significant risk that they’ll become unavailable at
some point. For instance, if the connection between the example net
work and ISP 30 goes down, 192.0.2.0/25 will be completely un
reachable. We can solve this issue by announcing the aggregate (the
full prefix) as well as the more specifics. This is what example 23 does.

Example 23. Announcing more specifics and a covering aggregate


!
router bgp 65082
network 192.0.2.0/24
network 192.0.2.0/25
network 192.0.2.128/25
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 prefix-list out-prefixes-isp30 out
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
neighbor 192.0.2.41 prefix-list out-prefixes-isp40 out
!
ip prefix-list out-prefixes-isp30 seq 5 permit 192.0.2.0/24
ip prefix-list out-prefixes-isp30 seq 10 permit 192.0.2.0/25
ip prefix-list out-prefixes-isp40 seq 5 permit 192.0.2.0/24
ip prefix-list out-prefixes-isp40 seq 10 permit 192.0.2.128/25
!

ISP 40 now sees the following:

Router40# show ip bgp


Network Next Hop LocPrf Path
* 192.0.2.0/24 198.51.100.142 110 65010 65020 65030 65082 i
*> 192.0.2.42 200 65082 i
*> 192.0.2.0/25 198.51.100.142 110 65010 65020 65030 65082 i
*> 192.0.2.128/25 192.0.2.42 200 65082 i

As with example 22, 192.0.2.0/25 will be reached through ISP 30 (AS


65030) and 192.0.2.128/25 through the direct connection. We can
now disable the BGP session between AS 65082 and AS 65030 to simu
late an outage:

103
!
router bgp 65082
neighbor 192.0.2.21 shutdown
!

This leaves ISP 40 with the following paths:

Router40# show ip bgp


Network Next Hop LocPrf Path
*> 192.0.2.0/24 192.0.2.42 200 65082 i
*> 192.0.2.128/25 192.0.2.42 200 65082 i

So 192.0.2.0/24 remains reachable in its entirety, but we still get to


benefit from selectively advertising more specifics.

However, this configuration does have an issue: there is a


period of instability between the moment a more specific
prefix disappears and packets consistently flow according to
the covering aggregate. This happens because of BGP’s “path
hunting” behavior when a prefix is withdrawn.

The update containing the withdrawal first comes in over the shortest
path. Routers that receive the withdrawal will thus select the next
shortest path. However, in the meantime, the update containing the
withdrawal is traveling down that next shortest path, with is also
withdrawn.

So routers select the third shortest path, which is also quickly with
drawn. This goes on until finally, the longest path is withdrawn.

Because of BGP’s minimum route advertisement interval (MRAI),


which is 30 seconds by default on eBGP sessions, routers will generally
wait 30 seconds before propagating the next update after a previous
update for a prefix. So it's common to see each step in the path hunting
process take 30 seconds.

In my experience, withdrawing a more specific leads to two minutes of


instability before packets continuously flow towards an aggregate.
During that period of instability, reachability may come and go several
times, usually at 30-second intervals. However, two minutes is not a

104
hard and fast rule, this could be different depending on the levels of
interconnectivity between the autonomous systems involved.

Multipath BGP
Over the past two decades, Ethernet has almost completely taken over
as the layer-2 technology that underpins the internet and other net
works. And traditionally, increases in Ethernet speeds were by a factor
ten. So when a Gigabit Ethernet link fills up, the next step is a 10 Giga
bit Ethernet link. However, such a big increase is often relatively costly.
Usually, it makes more sense to simply deploy a second port rather
than upgrade to hardware that’s ten times faster.

So for instance, if AS 65082 has a Gigabit Ethernet link to AS 65030 and


that link fills up, a second Gigabit Ethernet link may be set up between
the same router on the AS 65082 side and the same router on the AS
65030 side.

We can of course use the traffic engineering techniques discussed ear


lier in this chapter over such parallel links. A better option is to deploy
equal cost multi-path (ECMP). ECMP can be used between switches,
often in the form of IEEE 802.3ad link aggregation [W]. In that case,
routing protocols such as BGP operate as usual and ECMP happens at
layer 2.

ECMP can also be used at layer 3. In that case, the routing protocol
used over the parallel links must be able to determine that multiple
paths can be used in parallel without risk of routing loops, and install
multiple routes for these multiple paths in the routing table. In exam
ple 24, we add a second BGP session over a second connection be
tween our test network and ISP 30.

105
Example 24. Two parallel BGP sessions
!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30, first connection
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.25 remote-as 65030
neighbor 192.0.2.25 description ISP 30, second connection
neighbor 192.0.2.25 prefix-list in-prefixes in
neighbor 192.0.2.25 prefix-list out-prefixes out
!

As a result, we now get two copies of each prefix from ISP 30. For in
stance:

Router# show ip bgp 10.0.30.0/23


BGP routing table entry for 10.0.30.0/23, version 4
Paths: (3 available, best #3, table default)
Not advertised to any peer
65040 65010 65020 65030
192.0.2.41 from 192.0.2.41 (198.51.100.255)
Origin IGP, valid, external
Last update: Wed Nov 2 11:56:44 2022
65030
192.0.2.25 from 192.0.2.25 (198.51.100.223)
Origin IGP, metric 0, valid, external
Community: 65030:1
Last update: Wed Nov 2 11:56:44 2022
65030
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, metric 0, valid, external, best (Neighbor IP)
Community: 65030:1
Last update: Wed Nov 2 11:56:44 2022

For the two paths we get from AS 65030, all BGP attributes are the
same, except for the next hop address (192.0.2.21 or 192.0.2.25).
We can tell that the two BGP sessions are towards the same router at
the other end because the BGP identifier is the same: 198.51.100.223.
Even the last update time is at first glance the same, so selecting the
best path came down to the last tie breaker: prefer the path that comes
from the lowest neighbor address, as indicated by (Neighbor IP).

The corresponding routing table entry is as follows:

106
Router# show ip route 10.0.30.0/23
Routing entry for 10.0.30.0/23
Known via "bgp", distance 20, metric 0, best
Last update 00:14:25 ago
* 192.0.2.21, via eth0.1201, weight 1

In example 25, we enable multipath BGP.

Example 25. Multipath BGP


!
router bgp 65082
maximum-paths 2
!
address-family ipv6
maximum-paths 2
exit-address-family
!

The maximum-paths configuration command takes as its argument the


maximum number of paths that may be used concurrently. Note that
the maximum-paths setting applies per address family, and then ap
plies to all BGP sessions that are activated for that address family.

The show ip bgp output now looks slightly different:


Router# show ip bgp
Network Next Hop LocPrf Path
*> 10.0.10.0/23 192.0.2.41 65040 65010 i
* 192.0.2.25 65030 65020 65010 i
* 192.0.2.21 65030 65020 65010 i
*= 10.0.20.0/22 192.0.2.25 65030 65020 i
*> 192.0.2.21 65030 65020 i
* 10.0.30.0/23 192.0.2.41 65040 65010 65020 65030 i
*= 192.0.2.25 65030 i
*> 192.0.2.21 65030 i

When we zoom in on a specific prefix, we see this:

Router# show ip bgp 10.0.30.0/23


BGP routing table entry for 10.0.30.0/23, version 4
Paths: (3 available, best #3, table default)
Not advertised to any peer
65040 65010 65020 65030
192.0.2.41 from 192.0.2.41 (198.51.100.255)
Origin IGP, valid, external

107
65030
192.0.2.25 from 192.0.2.25 (198.51.100.223)
Origin IGP, metric 0, valid, external, multipath
Community: 65030:1
65030
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, metric 0, valid, external, multipath, best
(Neighbor IP)
Community: 65030:1

The most interesting change is to the routing table:

Router# show ip route 10.0.30.0/23


Routing entry for 10.0.30.0/23
Known via "bgp", distance 20, metric 0, best
Last update 00:05:12 ago
* 192.0.2.21, via eth0.1201, weight 1
* 192.0.2.25, via eth0.1202, weight 1

In other words: packets to destinations inside 10.0.30.0/23 will be


transmitted over two different interfaces. The maximum-paths setting
only has an effect when multiple paths are sufficiently equal; eBGP
learned paths are only eligible for multipath under the following con
ditions:

• The paths have the same weight

• The paths have the same local preference

• The paths have the same AS path (the AS path must be identical, not
just the same length)

• The paths have the same origin

• The paths have the same MED

• The paths are learned over eBGP

• The paths have the same IGP metric to the BGP next hop

(Weight is an extra local BGP attribute that Cisco introduced which


may overrule the local preference.)

In other words: only the BGP path selection algorithm steps from 10
and up (f and up) are ignored when determining if paths can be used

108
for multipath. Some BGP implementations also support unequal cost
multipath where packets are distributed over multiple paths with un
equal costs, but this requires additional settings.

In example 25, the two paths are learned from the same router in AS
65030, but this is not a requirement: multipath can also be done for
paths learned from different routers in the same neighboring AS.

However, if there’s one router on the other side, there’s another option
to perform multipath with BGP: with just one BGP session instead of
two or more. This is done in example 26.

Example 26. Multipath with one BGP session


!
interface lo
ip address 192.0.2.251/32
!
router bgp 65082
network 192.0.2.0/24
neighbor 198.51.100.223 remote-as 65030
neighbor 198.51.100.223 description ISP 30 over two links
neighbor 198.51.100.223 ebgp-multihop 2
neighbor 198.51.100.223 update-source lo
!
ip route 198.51.100.223/32 192.0.2.21
ip route 198.51.100.223/32 192.0.2.25
!

The first two lines set up an address for the lo loopback interface. Un
like hosts, which all use 127.0.0.1 and ::1 as the addresses for their
loopback interface, routers have “real” addresses on their loopback
interfaces. This way, the router has an address that is always “up” even
if physical interfaces may go down, which is useful for management,
and also for iBGP, as we’ll see in the iBGP chapter.

In this configuration, the BGP session is between the loopback ad


dresses of both routers. 192.51.100.223 is the loopback address of
the AS 65030 router. We wouldn’t normally have a route towards that
address, so we set up two static routes towards this address over the
two interfaces that we want to load balance across. We then configure
the BGP session towards this address with update-source lo and

109
ebgp-multihop 2. Pointing the update source to the lo interface
makes our router use the address of the lo interface as the source ad
dress in the BGP TCP session towards this neighbor. However, this
makes it seem like there is an extra hop between the two routers,
which BGP normally doesn’t allow. With ebgp-multihop 2 we let the
router know that an extra hop is allowed for this BGP session.

(Note that the interface lo and following line goes in the zebra.
conf file and the two ip route lines into the static.conf file, not
the bgpd.conf file. If you’re using vtysh to talk to FRRouting that will
happen automatically.)

With a corresponding configuration on the other side, this BGP session


comes up like any other, and the routes also look normal:
Router# show ip bgp 10.0.30.0/23
BGP routing table entry for 10.0.30.0/23, version 4
Paths: (2 available, best #2, table default)
Not advertised to any peer
65040 65010 65020 65030
192.0.2.41 from 192.0.2.41 (198.51.100.255)
Origin IGP, valid, external
65030
198.51.100.223 from 198.51.100.223 (198.51.100.223)
Origin IGP, metric 0, valid, external, best (AS Path)
Community: 65030:1

The only clue that something is going on under the surface is that we
get the same address as the next hop address, the neighbor address
and the neighbor’s router ID. However, the routing table does show
we’re load balancing traffic over multiple interfaces:
Router# show ip route 10.0.30.0/23
Routing entry for 10.0.30.0/23
Known via "bgp", distance 20, metric 0, best
Last update 00:03:53 ago
198.51.100.223 (recursive), weight 1
* 192.0.2.21, via eth0.1201, weight 1
* 192.0.2.25, via eth0.1202, weight 1

This configuration is a little more complex and unlike the maximum


paths configuration, it can only be used if the multiple links connect to
the same router on the other end. The advantage is that there are fewer

110
BGP sessions and thus fewer paths in the BGP RIB, preserving memo
ry and CPU cycles on the router.

ECMP load balancing strategies


When using ECMP, be it through a layer 2 link aggregation mechanism
such as IEEE 802.3ad, using multiple BGP sessions and maximum
paths, or by using a single BGP session routed over multiple inter
faces, packets are distributed over multiple links in a very similar way.
One obvious way to do this is send packet 1 over link A, packet 2 over
link B, packet 3 over link A, packet 4 over link B, and so on. This is per
packet load balancing. The advantage of per-packet local balancing is
that a single TCP session can use the full aggregate bandwidth of all
links involved.

However, with any kind of load balancing some level of packet re


ordering happens. This can make TCP think packets were lost, even
though they’re just delayed slightly because one link had a bit more
data to transmit than the other. If nothing else, the last packet in a TCP
session is usually smaller than earlier ones, which makes the packet
from other TCP sessions that follow arrive a bit earlier. TCP reacts to
this suspected packet loss by slowing down. In other words: per-pack
et load balancing kills TCP performance.

This means we need to do per-flow load balancing, where TCP ses


sions (and other “flows”) are distributed over the available links in
their entirety. So all packets that belong to the same TCP session go
over the same link. It would be a lot of work to keep track of individ
ual TCP sessions, so recognizing flows is done by taking several head
er fields in each packet and calculating a hash over those fields. The
hash then determines which link is used for a certain packet.

A simple way to do this would be to add up the values of the relevant


header fields and if the result is even, packets go over link A, and if the
result is odd, packets go over link B.

Best practice is to take the source and destination IP addresses, the pro
tocol number (TCP, UDP, ICMP et cetera) and the source and destina

111
tion port numbers and hash those. The hashing algorithm is often a
CRC function rather than a cryptographic hash. Then use the hash to
assign the packet to one of a fixed number of “buckets”. For instance,
there may be 16 buckets. And each bucket is assigned to a certain net
work link or next hop address. So for instance, buckets 1 - 8 may be
assigned to link A and 9 - 16 to link B. When link C is added to the
group, the buckets may be reassigned, with 1 - 5 to A, 6 - 10 to C and 11
- 16 to B. (So for the TCP sessions in buckets 1 - 5 and 11 - 16 there is no
disruption.)

Early ECMP load balancing strategies, especially in the case of layer 2


switches with no or limited layer 3 functionality, may not use the 5-tu
ple (source and destination IP addresses and port numbers plus proto
col number) but only a 3-tuple (source and destination IP addresses
and protocol number) or just the source and destination MAC ad
dresses. In these cases, traffic may not balance very well or even at all.

Obviously, a single TCP session will never balance over multiple links
with per-flow load balancing. Small numbers of TCP sessions will also
often saturate one link but not the other or others. Rule of thumb is
that with 1000 TCP sessions or more per-flow load balancing will use
all links equally.

112
iBGP

So far, we’ve only looked at external BGP, or eBGP. Any BGP session
towards a router in another autonomous system is an eBGP session.
However, if an AS contains multiple routers router with eBGP sessions,
then it’s very helpful if those routers that together handle BGP for the
AS in question coordinate their efforts. This is where internal BGP
(iBGP) comes in.

The idea is that every BGP router within an AS maintains an iBGP ses
sion with every other BGP router in the AS. In service provider net
works, it’s common that all routers run iBGP, even routers that don’t
connect to external ASes. This way, all routers have a full view of all
BGP information so they can make the best routing decisions.

In the early days of BGP the assumption was that only border routers
would run BGP and would then “redistribute” the routing information
learned through eBGP into an internal routing protocol such as OSPF.
But with something like a million prefixes in BGP that practice would
be quite hard on an internal gateway protocol (IGP) such as OSPF
these days. It's easier to just run BGP on every router.

Unlike with eBGP sessions, on iBGP sessions routers don’t add their
own AS number to the AS path and they don’t update the next hop
address. And unlike with eBGP, there is not requirement that there is a
direct connection between two iBGP routers: it’s completely fine to
have additional hops in between. Last but not least, there are normally
no filters or route maps applied to iBGP sessions.

Example 27 is based on a very basic setup where there are two BGP
routers that each connect to a different ISP and the two routers connect
to each other using iBGP. Because these two routers are always directly
connected to each other, there is no need to run an internal routing
protocol. Presumably, the two routers run VRRP [W] so hosts on the
internal network have a virtual IP address they can use as their default
gateway, so they still have a working default gateway if one of the

113
routers goes down. However, a VRRP configuration is not part of the
example.

Example 27 shows an iBGP configuration on our test router, which


we’ll call R1 in this context, and a corresponding configuration for a
router in the same AS, which we’ll call R2.

Example 27-1. R1 iBGP configuration


!
hostname R1
!
router bgp 65082
neighbor 192.0.2.122 remote-as 65082
neighbor 192.0.2.122 description iBGP to R2
!
address-family ipv6
neighbor 192.0.2.122 activate
exit-address-family
!

Example 27-2. R2 configuration


!
hostname R2
!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.45 remote-as 65040
neighbor 192.0.2.45 description ISP 40
neighbor 192.0.2.45 prefix-list in-prefixes in
neighbor 192.0.2.45 prefix-list out-prefixes out
neighbor 192.0.2.45 filter-list 2 out
neighbor 192.0.2.121 remote-as 65082
neighbor 192.0.2.121 description iBGP to R1
neighbor 2001:db8:40:8202::1 remote-as 65040
neighbor 2001:db8:40:8202::1 description ISP 40
no neighbor 2001:db8:40:8202::1 activate
!

114
address-family ipv6
network 2001:db8:82::/48
neighbor 192.0.2.121 activate
neighbor 2001:db8:40:8202::1 activate
neighbor 2001:db8:40:8202::1 prefix-list in-prefixes in
neighbor 2001:db8:40:8202::1 prefix-list out-prefixes out
neighbor 2001:db8:40:8202::1 filter-list 2 out
exit-address-family
!

The configuration for the iBGP session between the two routers is ex
ceedingly simple, because there are no filters, route maps or other set
tings on the iBGP session. There is no need to explicitly configure a
BGP session as an iBGP session; a session is an iBGP session when the
remote AS is the same as the router’s own AS—65082 in the example.

In addition to the iBGP session towards R1, R2 also has an eBGP ses
sion with ISP 40. With both R1 and R2 having at least one eBGP session
towards an ISP, each router can maintain connectivity to the internet
on its own when the other router is down. This means that both R1
and R2 use network 192.0.2.0/24 and network 2001:db8:82::/48
to originate our IPv4 and IPv6 prefixes.

If we now have a look at the BGP table in router R2, we see this:
R2# show ip bgp
BGP table version is 10, local router ID is 192.0.2.254, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found

115
Network Next Hop LocPrf Weight Path
*> 10.0.10.0/23 192.0.2.45 0 65040 65010 i
*>i10.0.20.0/22 192.0.2.21 100 0 65030 65020 i
* 10.0.30.0/23 192.0.2.45 0 65040 65010 65020 65030
i
*>i 192.0.2.21 100 0 65030 i
*> 10.0.40.0/21 192.0.2.45 0 65040 i
*> 10.0.83.0/24 192.0.2.45 0 65040 65083 i
i 203.0.113.83 100 0 65083 i
*> 10.0.84.0/24 192.0.2.45 0 65040 65084 i
* i 192.0.2.21 100 0 65030 65084 i
*>i192.0.2.0/24 192.0.2.121 100 0 i
0.0.0.0 32768 i

The routes learned over iBGP look different in two ways. First, the pre
fix is preceded by an i, indicating that the path was learned over iBGP.
Second, one of the iBGP paths lacks the * indicating that the path is
valid. We can see this in more detail by looking at a particular prefix:

R2# show ip bgp 10.0.83.0/24


BGP routing table entry for 10.0.83.0/24, version 9
Paths: (2 available, best #1, table default)
Advertised to non peer-group peers:
192.0.2.121
65040 65083
192.0.2.45 from 192.0.2.45 (198.51.100.255)
Origin IGP, valid, external, best (First path received)
Community: 65040:2
Last update: Wed Nov 2 14:22:10 2022
65083
203.0.113.83 (inaccessible) from 192.0.2.121 (192.0.2.251)
Origin IGP, metric 0, localpref 100, invalid, internal

The reason the iBGP paths are invalid is because on R2, the next hop
address 192.0.2.21 points to a next hop address that is directly connect
ed to R1. R2 has no route to that address:
R2# show ip route 203.0.113.0/24
% Network not in table

There are two ways to solve this. The easy way only works in very
simple networks, and our two-router AS is as simple as they come. Ex
ample 28 tells R1 and R2 to update the next hop address in iBGP up
dates and set that next hop address to its own address on the interface
used for the (i)BGP session in question.

116
Example 28-1. next-hop-self on R1
!
router bgp 65082
neighbor 192.0.2.122 remote-as 65082
neighbor 192.0.2.122 description iBGP to R2
neighbor 192.0.2.122 next-hop-self
!
address-family ipv6
neighbor 192.0.2.122 activate
neighbor 192.0.2.122 next-hop-self
exit-address-family
!

Example 28-2. next-hop-self on R2


!
router bgp 65082
neighbor 192.0.2.121 remote-as 65082
neighbor 192.0.2.121 description iBGP to R1
neighbor 192.0.2.121 next-hop-self
!
address-family ipv6
neighbor 192.0.2.121 activate
neighbor 192.0.2.121 next-hop-self
exit-address-family
!

iBGP paths in the BGP table now all have an asterisk indicating they’re
valid:
R2# show ip bgp
Network Next Hop LocPrf Weight Path
*> 10.0.10.0/23 192.0.2.45 0 65040 65010 i
* 10.0.30.0/23
*>i10.0.20.0/22 192.0.2.121 100 0 65030 65020 i
192.0.2.45 0 65040 65010 65020 65030
i
*>i 192.0.2.121 100 0 65030 i
*> 10.0.40.0/21 192.0.2.45 0 65040 i
* 10.0.83.0/24 192.0.2.45 0 65040 65083 i
*>i 192.0.2.121 100 0 65083 i
*> 10.0.84.0/24 192.0.2.45 0 65040 65084 i
* i 192.0.2.121 100 0 65030 65084 i
*>i192.0.2.0/24 192.0.2.121 100 0 i
0.0.0.0 32768 i

The same is true for IPv6 iBGP paths:

117
R2# show bgp ipv6 unicast
BGP table version is 3, local router ID is 192.0.2.254, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Weight Path


*> 2001:db8::/36 fe80::42:acff:fe11:5
0 65040 65010 i
*>i2001:db8:30::/44 ::ffff:c000:279 100 0 65030 i
*> 2001:db8:40::/44 fe80::42:acff:fe11:5
0 65040 i
2001:db8:82::/48 :: 32768 i

Because the next hop address isn’t modified when propagating prefix
es over iBGP, there are no complications distributing IPv6 prefixes over
an IPv4 iBGP session, so we activate the iBGP neighbors 192.0.2.121
and 192.0.2.122, respectively, for the IPv6 address family. This way,
there is no need to maintain separate IPv6 iBGP sessions. However,
this does mean that if something goes wrong with internal IPv4 rout
ing so the iBGP sessions over IPv4 are disrupted, this will impact ex
ternal IPv6 routing.

iBGP and internal routing protocols


In example 28, replacing a next hop address that is unknown to R2
with an address that R2 does know through next-hop-self was our
quick and dirty solution. But what if we add a third router? Suppose
R1 connects to R2 and R2 connects to R3. However, there is no direct
connection between R1 and R3; they need to go through R2 to reach
each other. See figure 4.

118
Figure 4. An autonomous system with three routers

This means that R1 none of R1s addresses are reachable to R3, and the
other way around. We fix that by running an internal routing protocol
that makes sure all our internal routers know how to reach all the ad
dress prefixes used in our internal network.

The IGP of choice is generally OSPF. OSPF [W] is an internet standard


so any router that can run BGP can also run OSPF and it performs well
in small and medium size networks without having to use multiple
areas or other complexity. In large networks, IS-IS [W], OSPF's cousin
from the CLNP side of the networking family may be a better choice.
FRRouting calls OSPFv2 for IPv4 simply “ospf” and OSPFv3 for IPv6
“ospf6”.

So let's enable OSPF for IPv4 and IPv6 on our three test routers. Exam
ple 29 shows just the configuration for R2, as the R1 and R3 configura
tions are identical except that R1 only has interface eth0.821 and R#
only eth0.822.

Example 29. OSPF on R2


!
router ospf
redistribute connected
network 192.0.2.0/24 area 0.0.0.0
!
router ospf6
redistribute connected
interface eth0.821 area 0.0.0.0
interface eth0.822 area 0.0.0.0
!

119
For IPv4, OSPF is enabled by specifying a prefix, and then all interfaces
with an address that falls within that prefix run OSPF. Area 0.0.0.0,
or simply area 0, is the backbone area. These days, it's rarely necessary
to use additional areas.

Zebra/Quagga/FRRouting copied the configuration language from


Cisco, and IPv4 OSPF is truly ancient. By the time IPv6 OSPF came
around, a new way of configuring routing protocols had become in
vogue: by specifying the interfaces you want to run the protocol on
explicitly.

redistribute connected means that address prefixes from inter


faces that don't run OSPF themselves are also injected into OSPF. The
result of our OSPF efforts is that address 203.0.113.83 that was pre
viously unreachable for R2 is now reachable:

R2# show ip route 203.0.113.83


Routing entry for 203.0.113.0/24
Known via "ospf", distance 110, metric 20, best
Last update 00:00:53 ago
* 192.0.2.121, via eth0.821, weight 1

It takes a bit longer than expected for FRRouting's OSPF implementa


tion to discover neighboring routers and sync up their databases, so
after up to a minute R3 knows all prefixes from all of R1's and R2's in
terfaces. This includes the internet exchange peering LAN prefix
203.0.113.0/24:

R3# show ip route ospf


Codes: K - kernel route, C - connected, S - static, R - RIP,
O - Table,
OSPF, Iv- IS-IS, B - BGP, E - EIGRP, N NHRP,
T VNC, V - VNC-Direct, A - Babel,
- F - PBR,
f - OpenFabric,
-
> - selected route, * - FIB route, q - queued, r- rejected,
b - backup
t - trapped, o - offload failure

O>* 192.0.2.20/30 [110/30] via 192.0.2.232, eth0.822, 00:01:37


O>* 192.0.2.24/30 [110/30] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.40/30 [110/30] via 192.0.2.232, eth0.822, 00:01:37

120
O>* 192.0.2.44/30 [110/20] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.112/28 [110/20] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.144/28 [110/30] via 192.0.2.232, eth0.822, 00:01:37
O 192.0.2.160/28 [110/10] is directly connected, eth0.823,
00:02:27
O 192.0.2.224/28 [110/10] is directly connected, eth0.822,
00:02:27
O>* 192.0.2.251/32 [110/20] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.252/32 [110/10] via 192.0.2.232, eth0.822, 00:01:37
O 192.0.2.253/32 [110/0] is directly connected, lo, 00:02:27
O>* 203.0.113.0/24 [110/20] via 192.0.2.232, eth0.822, 00:01:37

(To make the output fit, I removed weight 1 from each line.)

The result is that BGP prefixes with next hop addresses in that
203.0.113.0/24 prefix are reachable without trouble:

R3# show ip bgp 10.0.85.0/24


BGP routing table entry for 10.0.85.0/24
Paths: (1 available, best #1, table Default-IP-Routing-Table)
Not advertised to any peer
4206508500
203.0.113.85 (metric 20) from 192.0.2.251 (192.0.2.251)
Origin IGP, metric 5, localpref 100, valid, internal, best

This does mean that reaching this prefix and others like it require a
two-stage process: first look up the BGP next hop address, and then
turn that BGP next hop address into an actual next hop address and
output interface:

R3# show ip route


Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel,
- F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected,
b - backup
t - trapped, o - offload failure
...
B> 10.0.85.0/24 [200/0] via 203.0.113.85 (recursive), 00:07:15
* via 192.0.2.232, eth0.822, 00:07:15
...
O>* 203.0.113.0/24 [110/20] via 192.0.2.232, eth0.822, 00:08:26
...

121
The [200/0] and [110/20] numbers are the “administrative distance”
and the metric. The administrative distance is how preferred prefixes
from one routing protocol are over the same prefixes of another rout
ing protocol.

For OSPF, the distance is 110 by default. For BGP it's 20 if BGP dele
gates an eBGP path to the main routing table and 200 if it is an iBGP
path. So if the same prefixes is known through OSPF, eBGP and iBGP,
the eBGP path is installed in the forwarding information base (FIB)
and thus used for forwarding packets.

In the example output above, the 10.0.85.0/24 prefix is learned over


iBGP and it has a MED (metric) of 0, so [200/0]. The OSPF route has a
metric of 20, so [110/20].

Loopback addresses for iBGP


Of course our three router network has a big single point of failure [W]
because R2 needs to be in working order to let R1 and R2 communi
cate. But there's actually an R4, which also connects R1 and R3. See
figure 5.

Figure 5. An autonomous system with four routers

If R4 has packets to send to R1, those normally go over the direct link
between R1 and R4. But if that link fails, the two can still communicate
through R2 and R3.

An important consequence that follows from such a configuration is


that iBGP sessions can't be configured between IP addresses of any of
these hardware interfaces. When a hardware interface goes down, the

122
IP addresses configured on it become unreachable. Instead, we config
ure an address for iBGP use on the interface that never goes down: the
loopback interface.

Example 30. iBGP over loopback interfaces on R1


!
interface lo
ip address 192.0.2.251/32
!
router bgp 65082
neighbor 192.0.2.252 remote-as 65082
neighbor 192.0.2.252 description iBGP to R4
neighbor 192.0.2.252 update-source lo
neighbor 192.0.2.253 remote-as 65082
neighbor 192.0.2.253 description iBGP to R3
neighbor 192.0.2.253 update-source lo
neighbor 192.0.2.254 remote-as 65082
neighbor 192.0.2.254 description iBGP to R2
neighbor 192.0.2.254 update-source lo
!
address-family ipv6
neighbor 192.0.2.252 activate
neighbor 192.0.2.253 activate
neighbor 192.0.2.254 activate
!

So on interface “lo” we configure a single IP address (a /32). All


routers in the network configure their iBGP sessions with these loop
back addresses as the neighbor address. And then we specify update
source lo to point to the loopback interface as the place to find the
source address used in the packets that carry iBGP messages.

As we configured OSPF to “redistribute connected”, the addresses of


the loopback interfaces are injected into OSPF. Using R4 as our vantage
point:
R4# show ip route
O>* 192.0.2.251/32 [110/10] via 192.0.2.151, eth0.824, 00:01:42
O>* 192.0.2.252/32 [110/20] via 192.0.2.151, eth0.824, 00:00:42
* via 192.0.2.163, eth0.823, 00:00:42
O>* 192.0.2.253/32 [110/10] via 192.0.2.163, eth0.823, 00:01:17
O 192.0.2.254/32 [110/0] is directly connected, lo, 00:02:32

123
R4 uses the path through interface eth0.824 (its direct link to R1) to
reach R1's loopback address 192.0.2.251. There's also a path trough
R3 and R2, but remember, each routing protocol only sends a copy of
its best path to the main routing table (RIB), so we don't see that path
in the show ip route output. This means that the iBGP packets be
tween R1 and R4 flow as shown in figure 6.

Figure 6. iBGP between R1 and R4

But what if the link between R1 and R4 goes down? Let's simulate that
by shutting down the interface:

R4# conft
R4(config)# interface eth0.824
R4(config-if)# shutdown
R4(config-if)# ^Z

If we now look at the routing table, we see that 192.0.2.251 is rerout


ed over interface eth0.823, which connects R4 to R3:

R4# show ip route


O>* 192.0.2.251/32 [110/30] via 192.0.2.163, eth0.823, 00:00:35

This is the status of the iBGP sessions:


R4# show ip bgp summary
IPv4 Unicast Summary (VRF default):
BGP router identifier 192.0.2.254, local AS number 65082 vrf-id 0
BGP table version 8
RIB entries 15, using 2880 bytes of memory
Peers 3, using 2149 KiB of memory

Neighbor AS MsgRcvd MsgSent Up/Down State/PfxRcd PfxSnt


5
3

192.0.2.251 65082 17 9 00:04:18 1


192.0.2.252 65082 13 9 00:04:16 1
192.0.2.253 65082 9 9 00:04:16 1 1

124
In other words, the flow of the iBGP session between R1 and R4 is
rerouted as shown in figure 7 without any interruption. This is espe
cially important when R1 and R4 have exchanged many prefixes, as
having to remove all these prefixes and then retransmit and reinstall
them would create significant load on the router CPU and disrupt
connectivity to some degree.

Figure 7. iBGP between R1 and R4 rerouted after a link failure

However, the router still has to update the RIB/FIB because the BGP
next hop addresses are now reachable through a different interface.

Route reflectors
The iBGP full mesh requirement—i.e., having each router learn eBGP
prefixes directly from the router where those prefixes enter the AS—is
nice and simple, and avoids potential routing loops. However, in our
network with four routers, each router already has three iBGP sessions.
That number quickly rises as the the number of routers in the network
increases. Not only does the extra overhead start to add up in a larger
network with dozens or even hundreds of routers, but adding a router
becomes a nightmare: every existing router has to be configured with
an iBGP session towards the new router.

There are two mechanisms to get around the iBGP full mesh require
ment: confederations and route reflectors.

BGP confederations [RFC 3065] split an autonomous system into


multiple “member-ASes” that together form a confederation. The iBGP
full mesh requirement then only applies within each member-AS and a
modified version of eBGP is used between the member-ASes. This is

125
invisible to external ASes. Confederations are not in wide use, so we
won't discuss the details here.

Route reflectors [RFC 4456], the other system to work around the
iBGP full mesh requirement, on the other hand, is extensively used in
larger networks. What a route reflector does is propagate the prefixes it
learns over iBGP to its clients. So a client gets all the prefixes know to
the different eBGP routers in the AS over a single iBGP session with a
route reflector.

Setting up a route reflector is exceedingly simple. Example 31 turns R3


from our earlier examples into a route reflector with R4 as a route re
flector client.

Example 31. A route reflector configuration


!
router bgp 65082
neighbor 192.0.2.252 remote-as 65082
neighbor 192.0.2.252 description iBGP to R4
neighbor 192.0.2.252 update-source lo
neighbor 192.0.2.252 route-reflector-client
neighbor 192.0.2.254 remote-as 65082
neighbor 192.0.2.254 description iBGP to R2
neighbor 192.0.2.254 update-source lo
neighbor 192.0.2.255 remote-as 65082
neighbor 192.0.2.255 description iBGP to R1
neighbor 192.0.2.255 update-source lo
!
address-family ipv6
neighbor 192.0.2.252 activate
neighbor 192.0.2.252 route-reflector-client
neighbor 192.0.2.254 activate
neighbor 192.0.2.255 activate
exit-address-family
exit
!

Note that the route reflector client status is set for each address family
separately. There are no configuration changes on R4, the route reflec
tor client, except to remove the iBGP sessions towards R1 and R2 that
are no longer needed.

126
R1 and R2 are not configured to be route reflector clients, they are
”non-client peers” of the route reflector. This means they still need to
maintain iBGP sessions with all other routers in the AS other than the
route reflector clients.

In example 30, before it became a route reflector client, R4 received


some paths from R1 and some from R2. For the prefix 10.0.83.0/24
the paths through R1 and R2 are almost identical, so the selection of
the best path came down to the OSPF metric:
R4# show ip bgp 10.0.83.0/24
BGP routing table entry for 10.0.83.0/24, version 10
Paths: (2 available, best #1, table default)
Not advertised to any peer
65030 65083
192.0.2.21 (metric 20) from 192.0.2.251 (192.0.2.251)
Origin IGP, localpref 100, valid, internal, best (IGP Metric)
Community: 65030:1
65040 65083
192.0.2.45 (metric 30) from 192.0.2.252 (192.0.2.252)
Origin IGP, localpref 100, valid, internal
Community: 65040:2

The path learned from R1 has an OSPF metric of 20 and is thus pre
ferred over the path learned from R2, which requires an extra hop
through R3 and thus has an OSPF metric of 30. However, now that R4
is a route reflector client in example 31, this is different:

R4# show ip bgp 10.0.83.0/24


BGP routing table entry for 10.0.83.0/24, version 5
Paths: (1 available, best #1, table default)
Not advertised to any peer
65040 65083
192.0.2.45 (metric 30) from 192.0.2.253 (192.0.2.252)
Origin IGP, metric 0, localpref 100, valid, internal, best
(First path received)
Community: 65040:2
Originator: 192.0.2.252, Cluster list: 192.0.2.253

The result is that the path with the higher IGP metric (30 vs 20) is used
by R4 as R4 now only receives just the path that R3 considers best. And
R3 has a direct link to R2, while R1 requires an extra hop through R4.
As such, it's important to carefully consider the placement of route re

127
flectors within the topology of the network. Of course for redundancy
a route reflector client should always talk to at least two route reflec
tors. If those are in different locations in the network, that will reduce
the incidence of less optimal routing that comes with the deployment
of route reflectors.

Another solution is to use the “add-path” capability [RFC 7911]. This


can be used to let a route reflector advertise multiple paths towards the
same prefix. Both sides of the BGP session have to support add-path
for it to be used. However, add-path may solve the less optimal rout
ing, but it brings back part of the iBGP scalability issues as now again
copies of all paths are distributed to the route reflector clients.

The originator and cluster list are new attributes that make sure route
reflectors don't cause routing loops. This would be a risk when route
reflectors are clients of other route reflectors. That can happen by acci
dent, but in very large networks, it's actually common to have a hierar
chy of route reflectors.

128
BGP security

BGP can be considered secure if we can be confident that we're able to


answer the following questions with “yes”:

1. Are BGP messages exchanged between the right BGP speakers


unmodified?

2. Are the BGP speakers saying the right things? Meaning:

a. Is the AS that originates a prefix authorized to do so by the


legitimate holder of the prefix?

b. Is the prefix further propagated in BGP in accordance with


the wishes of the legitimate holder of the prefix?

As it was created in the innocent days of the early internet, BGP wasn't
designed with security concerns in mind. Over the years, that changed,
for the most part because it turned out mistakes made by the operators
of one AS could severely impact large parts of the rest of the internet.
And to a lesser degree because the lack of defenses in the BGP protocol
and in BGP operation were exploited for malicious purposes.

In this chapter, we'll first look at answering question 1 by securing the


actual BGP sessions at the IP and TCP levels. Then, we'll look at mech
anisms that address question 2.

MD5 passwords
Because eBGP neighbors connect to each other over a shared layer-2
network (i.e., usually the same Ethernet), it's not really possible for
remote attackers to get between them. More extreme scenarios are pos
sible, be it hard to pull off and/or hide. For instance, an attacker phys
ically interjects himself between two BGP routers and becomes a “man
in the middle”. Or perhaps at an internet exchange, an attacker steals
another member's IP address.

129
But the most likely attack vector is someone sending fake TCP reset
packets. When a computer (or router) receives a TCP packet for a TCP
session that it doesn't recognize, it sends back a TCP RST packet. When
the sender of the original TCP packet receives the RST packet, it knows
that the TCP session is dead and removes it. So for instance, if a system
reboots, this reset mechanism will make sure that TCP sessions that
were active before the reboot don't linger on the other side of that TCP
connection.

As resetting a BGP session (through resetting the TCP session it runs


on top of) is fairly disruptive, an attacker could perform a denial-of-
service (DoS) attack by sending fake TCP RST packets. This will make
the BGP routers tear down their BGP session, clear all the paths
learned over that BGP session from the routing tables and then initiate
a new BGP session and reinstall all those paths.

The attacker needs to be able to send packets with spoofed source ad


dress and would have to guess the IP addresses (usually discoverable
through traceroute), the port numbers on both sides (one side will use
179) and a TCP sequence number that falls inside the active window.
That can mean it's necessary to send hundreds of millions up to sever
al billions of packets. Which at 10 Gbps speeds of course only takes a
few minutes.

How do we protect against these attacks? Ideally, we'd use IPsec [W],
which is designed to protect against exactly these kinds of attacks.
However, protecting BGP sessions with IPsec has never entered com
mon practice among network operators.

Instead, we use the TCP MD5 signature option [RFC 2385]. This
works by calculating an MD5 hash over the TCP segment that contains
a BGP message plus a password that both sides have agreed upon. The
MD5 has is then placed in a TCP option and the TCP segment is
transmitted to the other side.

The receiver also calculates the MD5 hash in the same way, and then
checks if the resulting hash is the same as the one in the TCP option. If
so, we can be sure both ends used the same password and the segment

130
wasn't modified in transit. So the segment is accepted for further pro
cessing. If not, it's discarded without further action.

In example 32 IX peers 83 and 84 have a TCP MD5 password config


ured on their end, but right now, we've only configured the password
for peer 83 on our end.

Example 32. Setting up a TCP MD5 password


!
router bgp 65082
neighbor 203.0.113.83 remote-as 65083
neighbor 203.0.113.83 peer-group ix-ipv4-peers
neighbor 203.0.113.83 description IX peer 83
neighbor 203.0.113.83 password verysecret
neighbor 203.0.113.84 remote-as 65084
neighbor 203.0.113.84 peer-group ix-ipv4-peers
neighbor 203.0.113.84 description IX peer 84
neighbor 2001:db8:90::6:5083:1 remote-as 65083
neighbor 2001:db8:90::6:5083:1 description IX peer 83
neighbor 2001:db8:90::6:5083:1 password verysecret
no neighbor 2001:db8:90::6:5083:1 activate
neighbor 2001:db8:90::6:5084:1 remote-as 65084
neighbor 2001:db8:90::6:5084:1 description IX peer 84
no neighbor 2001:db8:90::6:5084:1 activate
!

What we see here is that although neighbors 203.0.113.83 and


203.0.113.84 belong to the same peer group, their TCP MD5 settings
can be different. And of course we need to set a password both for the
IPv4 and the IPv6 session with the same peer. Those don't have to be
the same, but making them different is of course asking for mistakes.
These settings are per session, not per address family. Let's look at the
status of our BGP sessions:
Router# show ip bgp summary
BGP router identifier 192.0.2.251, local AS number 65082
RIB entries 5, using 560 bytes of memory
Peers 6, using 53 KiB of memory
Peer groups 2, using 64 bytes of memory

Neighbor V AS MsgRcvd MsgSent TblVer Up/Down State


203.0.113.83 4 65083 6 9 0 00:03:10 1
203.0.113.84 4 65084 0 0 0 never Connect

131
The password-protected BGP session towards AS 65083 immediately
came up. But even after giving it some time, the session towards AS
65084 never came up. Also, the message counters are still zero, so no
BGP messages were exchanged at all. On a Cisco router, a message like
this will appear in the log:

%TCP-6-BADAUTH: No MD5 digest from 203.0.113.84(45219) to


203.0.113.82(179)

Or, if the other side has a password configured, but the passwords on
both sides of the BGP session don't match:

%TCP-6-BADAUTH: Invalid MD5 digest from 203.0.113.84(45219) to


203.0.113.82(179)

With routing software like FRRouting, the routing software won't gen
erate log entries as the issue is handled in the kernel TCP code.

Now let's add the password to the session to AS 65084:

Router# conf t
Router(config)# router bgp 65082
Router(config-router)# neighbor 203.0.113.84 password secretpwd
Router(config-router)# ^Z

And then the session with AS 65084 immediately establishes:


Neighbor AS MsgRcvd MsgSent TblVer Up/Down State
203.0.113.83 65083 10 11 0 00:06:14 1
203.0.113.84 65084 4 5 0 00:00:02 1

In practice, not everyone uses TCP MD5 passwords on all of their BGP
sessions. Doing this is most important for BGP sessions carrying many
prefixes, such as eBGP sessions with transit ISPs or transit customers
as well as iBGP sessions. Having to remove so many prefixes after a
reset, rerouting those destinations over other paths, and then re-estab
lishing the BGP session and restoring the previous state is rather dis
ruptive.

However, with peering setting up a password is a (slight) extra hassle


and something that can easily go wrong due to copy&paste mistakes,
and losing a dozen prefixes because a peering BGP session goes down
is usually not a big deal.

132
Another reason to be reluctant with using TCP MD5 passwords every
where is that, although the MD5 algorithm should be very fast, in
practice some router CPUs may have a hard time keeping up with a
large flood of TCP packets with MD5 hashes to check. So then this
mechanism becomes a denial-of-service attack surface.

The MD5 algorithm has long since been proven vulnerable to collision
attacks. But RFC 2385 doesn't have a mechanism that allows for up
grading the hashing algorithm. [RFC 5925] specifies a new TCP au
thentication option (TCP-AO) that addresses the limitations of the TCP
MD5 option, but so far, TCP MD5 is still more common than TCP-AO,
which is less widely supported.

The “TTL hack”: GTSM


It's not great that setting up TCP MD5 passwords on BGP sessions cre
ates a potential avenue for attackers to bring your router's CPU to its
knees checking bogus MD5 hashes.

IP level access lists (packet filters) that reject these packets won't work
very well because the addresses will be spoofed. Most networks make
sure their customers can't successfully send packets with spoofed
source addresses, but if an attacker finds a place where this is possible,
these packets will look legitimate to your access lists.

However, there is a clever way to reject those spoofed packets: by


checking the “time to live” (TTL) field in the IPv4 header. IPv6 local
house keeping messages such as Neighbor Discovery (ND) use the
“hop limit” field in the IPv6 header to check if a packet is indeed
sourced locally.

(The hop limit field is the new name for the otherwise identical IPv4
time to live field. The purpose of the TTL and hop limit is to make sure
packets won't circle around forever when routing loops occur.)

The sender sets this field to 255. If the packet is then delivered to a re
ceiver on the same subnet, the hop limit will still be 255. However, if
the packet passed through a router, the router will have decremented

133
the hop limit value. So if the receiver sees a value of 255, it knows the
packet was sent locally.

Interestingly, BGP already did sort of the same thing: it sets the TTL to
1. This way, if something goes wrong and packets between two eBGP
speakers flow through a third router, that router will decrement the
TTL to 0 and discard the packet.

1 or 255 both work equally well to protect against mistakes. However,


an attacker three hops away could set the TTL in her spoofed packets
to 4 and thus the TTL would be 1 when the packet arrives at the in
tended victim. But as 255 is the highest value that fits in the field, the
attacker can't set the TTL to 258.

The Generalized TTL Security Mechanism (GTSM) [RFC 5082] makes


BGP use this “TTL hack”. Example 33 sets up GTSM towards our in
ternet exchange peers 83 and 84.

Example 33. The Generalized TTL Security Mechanism (GTSM)


!
router bgp 65082
neighbor ix-ipv4-peers peer-group
neighbor ix-ipv4-peers description IPv4 IX peers, max 10 prefixes
neighbor ix-ipv4-peers ttl-security hops 1
neighbor ix-ipv4-peers maximum-prefix 10
neighbor ix-ipv4-peers prefix-list in-prefixes in
neighbor ix-ipv4-peers prefix-list out-prefixes out
neighbor ix-ipv4-peers filter-list 2 out
neighbor 203.0.113.83 remote-as 65083
neighbor 203.0.113.83 peer-group ix-ipv4-peers
neighbor 203.0.113.83 description IX peer 83
neighbor 203.0.113.83 password verysecret
neighbor 203.0.113.84 remote-as 65084
neighbor 203.0.113.84 peer-group ix-ipv4-peers
neighbor 203.0.113.84 description IX peer 84
neighbor 203.0.113.84 password secretpwd
!

Note that unlike with the TCP MD5 password, it seems we have to set
ttl-security hops 1 for the peer group, not the individual neigh
bors. I expect that this limitation may not apply on all types of routers.
But when it does, this can be addressed by simply making a duplicate

134
peer group with GTSM disabled on the old peer group and enabled on
the new one.

Peer 83 also has GTSM enabled, but 84 hasn't. So the BGP session to
wards AS 65083 comes up without trouble, but the one towards AS
65084 remains stuck in the Connect, OpenSent or OpenConfirm states:
Neighbor AS MsgRcvd MsgSent InQ OutQ Up/Down State
203.0.113.83 65083 7 8 0 0 00:03:28 1
203.0.113.84 65084 0 5 0 0 never OpenSent

GTSM is a good complement to the TCP MD5 option, as it addresses


the MD5 option's “crypto DoS” vulnerability. However, in my experi
ence GTSM is not widely used when peering over internet exchanges.
The difficulty is that both ends need to explicitly enable it. It would
have been nicer if the routers would automatically negotiate the use of
GTSM. But that would allow for bidding down attacks [W].

Some scary stories


With our BGP TCP sessions at least somewhat protected by MD5 pass
words and the “TTL hack”, let's look at some examples of widely
propagated bad BGP advertisements that lead to significant disrup
tions.

An early notable example of bad things happening to BGP is the AS


7007 incident [W] from 1997. This started as a run-of-the-mill occur
rence of “customer leaks tens of thousands of prefixes to their ISP
which doesn't filter the customer properly and happily propagates
them internet-wide”. But it got worse as AS 7007's router then, pre
sumably because of a bug, de-aggregated those prefixes into /24s and
then their upstream ISP saw those prefixes miraculously reappear even
after disconnecting AS 7007. Don't forget that BGP-4 was only a few
years old in 1997, so bugs in BGP implementations were relatively
common then.

De-aggregation always makes these incidents much worse, as then the


longest match first rule kicks in and all traffic will flow to the de-aggre

135
gator. Which of course immediately becomes overwhelmed by all the
traffic.

De-aggregation was also part of the Youtube-Pakistan incident from


2008. This all started with an anti-islam movie made by Dutch politi
cian Geert Wilders. After the movie appeared on Youtube, the Pak
istani government banned Youtube. So Pakistan Telecom null0-routed
the /24 that held the addresses of the Youtube streaming servers. Set
ting up a static route that sends a prefix to the Null0 interface is an
easy way to filter out packets to that destination prefix. So far so good.

However, Pakistan Telecom's router configuration was set up to redis


tribute static routes into BGP. And they didn't have a filter in place to
make sure such routes wouldn't be announced to their ISP PCCW in
Hong Kong. Which also didn't have a filter to prevent this customer
from advertising prefixes that didn't belong to them.

Now this would have been problematic regionally if the Youtube pre
fix was in fact a /24. In that case, routers elsewhere would have to
choose between the path to the real Youtube and the leaked path
through PCCW and Pakistan Telecom. Outside Asia, most networks
would probably have used the legitimate path towards Youtube as that
would have been shorter. But Youtube actually announced a /22, so
the “more specific” /24 vacuumed up the traffic to Youtube's stream
ing servers from all over the world.

Four years later, something similar happened again with some of the
same players involved.

In 2010 there was an incident where China Telecom advertised 15% of


the internet's prefixes with the AS path stripped off for 18 minutes.
China Telecom peers in a good number of locations, and their peers
would now see a one-hop path towards destinations that would nor
mally be reachable over at least two hops. So many of these networks
sent their traffic to China. This incident got a lot of attention, all the
way to the US Congress, with some security experts suspecting nefari
ous activity.

136
Then in 2018 there was an incident with definite malicious intent:
someone hijacked the IP prefixes for Amazon AWS' DNS servers. The
attacker then sent back fake DNS replies for myetherwallet.com in
order redirect visitors of that wallet service for the Ether cryptocurren
cy to their own servers. This way, the attackers could obtain the login
credentials of users of the service that who to log in and ignored an
HTTPS certificate warning. Apparently the attackers were able to get
away with $150,000 worth of Ether.

Internet Routing Registries


Without any other tools, the way to avoid incidents such as the ones
described above is painstakingly maintain filters for every BGP session
so only legitimately advertised prefixes from a BGP neighbor are al
lowed into the router's BGP table.

In general, ISPs can, and definitely should, create and maintain such
filters on BGP sessions with customers. Things get more difficult for a
large ISP that has a smaller ISP as its customer. This means the small
ISP has to inform the big ISP whenever it adds new address prefixes,
or when it adds new customers that advertise one or more prefixes of
their own. And then wait for the big ISP to update their filters accord
ingly. It gets worse with peering, because now each peer needs to in
form every other peer of changes to the prefixes they advertise. There
is no way to make this work manually.

(Filtering is not necessary for prefixes being announced from a transit


ISP to a customer, because the point of transit service is to reach the
entire internet and thus the customer wants/needs to receive all pre
fixes.)

Internet Routing Registries (IRRs) make it possible to generate filters


automatically rather than having to manage them manually. Owners of
address prefixes register which AS advertises that prefix in an IRR.
Owners of an AS number register their “routing policy” in the IRR,
which specifies the relationships with other ASes. If the information in

137
the IRRs is complete and trustworthy, this makes it possible to gener
ate filters automatically.

However, in practice this only works to a limited degree. First of all,


the BGP table is now so large that in networks with extensive peering,
prefix filters would have to be more than a hundred thousand lines
long. The IRR databases are also incomplete. Some networks don't reg
ister anything at all; others are inconsistent in updating their IRR
records, so their information is incomplete. And not all IRRs have
strong authentication, so it's hard to trust them completely.

Still, a practice I like is for ISPs to require their BGP-speaking cus


tomers to register their prefixes and a routing policy in an appropriate
IRR, and then the ISP generates their customer filters from that infor
mation.

On www.irr.net there is a list of IRRs. In Europe, the RIPE database


was in a unique position because that database is both the regular
“whois” database operated by the RIPE NCC to record delegations of
IP address blocks and AS numbers to ISPs and it's also an IRR data
base. The other RIRs originally only ran whois databases (named after
the whois command line tool used to query these databases) and later
set up IRRs. The current situation:

• AfriNIC: integrated whois and IRR database

• APNIC: integrated whois and IRR database

• ARIN: separate whois and IRR databases

• LACNIC: separate whois and IRR databases

• RIPE: integrated whois and IRR database

When an AfriNIC, APNIC or the RIPE NCC delegate IP addresses to


you, they'll register an inetnum or inet6num object in their whois
database, which looks like this:

138
% whois -h whois.afrinic.net 196.216.2.6
inetnum: 196.216.2.0 - 196.216.3.255
netname: AFRINIC
descr: AfriNIC - Internal Use
country: ZA
org: ORG-AFNC1-AFRINIC
admin-c: CA15-AFRINIC
tech-c: IT7-AFRINIC
status: ASSIGNED PI
mnt-by: AFRINIC-HM-MNT

The inetnum or inet6num objects refer to organization (org:), main


tainer (mnt-by:) and person (admin-c: and tech-c:) objects. AS num
bers (aut-num) are similar. This is the first part of the query results for
an AS number:

% whois -h whois.apnic.net AS4608


aut-num: AS4608
as-name: APNIC-SERVICES
descr: Asia Pacific Network Information Centre
descr: Regional Internet Registry for the Asia-Pacific Region
descr: Australia
country: AU
org: ORG-APNI1-AP
admin-c: AIC1-AP
tech-c: AIC1-AP

ARIN and LACNIC show largely the same information, but in a differ
ent format. The format that the other three use is actually the IRR for
mat, RPSL: the Routing Policy Specification Language[RFC 2622].
The heavy lifting in RPSL is done by the aut-num and route / route6
objects. In addition to administrative information listed above, an aut
num object may list a network's routing policy:
% whois -h whois.ripe.net as1125
aut-num: AS1125
as-name: UNSPECIFIED
descr: SURF Test Network
export:
import: from AS1103 action pref=100; accept ANY
to AS1103 announce AS1125

So AS 1125 accepts all prefixes from AS 1103 and only sends its own
prefix(es) to AS 1103. pref=100 suggests that the local preference is

139
100. However, in RPSL a lower pref value is considered more pre
ferred, and it's unclear how an unspecified pref value is evaluated.

So it looks like AS 1125 gets transit service from AS 1103. This is a


sampling of AS 1103's routing policy:
% whois -h whois.ripe.net as1103
aut-num: AS1103
as-name: SURFNET-NL
descr: SURFnet, The Netherlands
import: from AS112 accept AS112
import: from AS702 accept AS-UUNETEURO
import: from AS714 accept AS714
import: from AS1104 accept AS1104 AS3333
import: from AS1125 accept AS1125
export: to AS112 announce AS-SURFNET
export: to AS702 announce AS-SURFNET
export: to AS714 announce AS-SURFNET
export: to AS1104 announce ANY
export: to AS1125 announce AS-SURFNET

So AS 1103 “imports” (accepts) only prefixes that originate directly


from the neighboring AS from ASes 112, 714 and 1125. So those are
“leaf” ASes (if we view the internet as a tree) that do not have any BGP
customers of their own.

AS 1104 may send prefixes it originates itself, as well as those that orig
inate in AS 3333. So AS 3333 is a customer of AS 1104. However, should
AS 1104 gain another BGP customer, then AS 1103 would have to up
date its aut-num object to also list that new customer AS number in the
AS 1104 import line.

The ASes accepted from AS 702 are not listed here one-by-one, but
those are in the as-set object AS-UUNETEURO instead. When AS 702
connects a new customer, they can simply update this as-set object
and AS 1103 will automatically start accepting prefixes from that new
customer after rebuilding their filters from the IRR data.

The export lines show that AS 1103 will only advertise prefixes origi
nated in its own AS and its customer's ASes (as listed in AS-SURFNET)
to most of these neighboring ASes. This means those ASes are peers.
Only AS 1104 gets all prefixes and is thus a customer. Interestingly, the

140
AS 1125 import policy doesn't match the AS 1103 export policy. This is
not ideal, but doesn't create immediate problems.

Of course to create prefix filters, we need to know IPv4 and IPv6 pre
fixes, not just AS numbers. Let's find these with an inverse query:
% whois -h whois.ripe.net -- "-i origin as112"
route: 192.175.48.0/24
descr: Root Server Technical Operations Assn
origin: AS112
mnt-by: NETNOD-MNT
created: 2002-12-17T14:02:55Z

route6: 2620:4f:8000::/48
descr: Root Server Technical Operations Assn
origin: AS112
mnt-by: NETNOD-MNT

See the MANRS Implementation Guide for more some pointers on set
ting up objects in the various RIR IRRs. The RIRs also all have their
own documentation.

ARIN recently switched to a next generation IRR, leaving all the legacy
data in the old IRR behind. The new IRR uses different query com
mands. For instance:

% whois -h rr.arin.net \!r199.43.0.0/24


A442
route: 199.43.0.0/24
origin: AS10745
descr: American Registry for Internet Numbers
mnt-by: MNT-ARINOPS

(The \ is not part of the command but required on the MacOS / Linux
command line to keep the shell from giving the ! special treatment.)

You definitely need to read the ARIN Internet Routing Registry (IRR)
page but disregard the “RIPE-style” examples, as those don't work
anymore. And then the IRRd 4.2.5 Whois queries documentation to
really understand the new syntax.

Although not strictly required, it's highly recommended to register at


least a simple routing policy with import and export lines for your
transit providers and route / route6 objects that refer to your aut

141
num object. This way, you won't end up on the wrong side of filters
generated from IRR data. If you have BGP customers then of course
those need to be in your routing policy as well, and you'll want to
make an as-set object to list those customer's ASes along with your
own.

I wouldn't necessarily bother listing peers or preferences in an aut


num object, as this requires a lot more upkeep and peer relationships
are not relevant to filters made by others. Unless you use a tool like
bgpq4 and that tool needs to find the ASes you peer with in the IRR to
generate filters.

RPKI
The Resource Public Key Infrastructure (RPKI) is a mechanism that
allows holders of IP addresses and AS numbers to prove they are the
legitimate holder of that resource using a certificate they get from an
RIR. The RPKI architecture [RFC 6480] provides an overview of how
all of this works.

With such a certificate, an address holder can create and sign a Route
Origination Authorization (ROA) which indicates which AS is allowed
to originate the prefix in question. (If multiple ASes are allowed to
originate the prefix, multiple ROAs must be created.)

ROAs, certificates and certificate revocation lists are stored in a dis


tributed repository. That means that the RIRs as well as many ISPs
have databases with their own RPKI data. Anyone who wants to create
RPKI-based filters will have to synchronize with all of these databases
to get copies of all the ROAs. Then, for each ROA the certificate chain
must be checked to make sure it's valid. Once that is done by a “rely
ing party” server, the result is a big list of prefixes and authorized ori
gin ASes. Relying party as in: a network operator who relies on the
RPKI system, or specifically on the five RIR trust anchors (often re
ferred to as TAL, trust anchor location).

142
The RIR TALs can be downloaded from their websites. Until recently,
this was not the case for the ARIN TAL, as they used to require relying
parties to sign an indemnification agreement first.

APNIC and LACNIC also publish an AS 0 TAL, which is used for


ROAs that authorize AS 0 to originate their unused address space. In
other words, using that TAL will make RPKI-generated filters block
advertisements of unused address space. However, APNIC strongly
recommends to only use the AS 0 TAL for “advisory and/or alerting
purposes” and not for actual filtering.

In practice, few people create ROAs themselves and use the RIR-pro
vided RPKI certificate to sign them. The easier option is to use the RIR
portal to have the RIR generate ROAs. ARIN has a slightly different
procedure because they require the private key of the certificate that
signs the ROA to reside with the address holder.

So now we have that big list of prefixes and valid origin ASes. How do
we use this to filter?

That part is called RPKI route origin validation. Although mostly peo
ple say just “RPKI” when they actually mean RPKI route origin valida
tion. Unlike traditional filters, the RPKI-derived filter is not simply
copied to a router's configuration file. Instead, a relying party cache
server uses the RPKI-Router protocol [RFC 8210] to transmit the filter
to routers. Routers then apply the filter to the paths they have in their
BGP table, with three possible results:

• Valid: there is a ROA covering this prefix and prefix length, and
the first AS in the AS path matches the origin AS in the ROA.

• Invalid: there is a ROA covering this prefix, but either the prefix
length is longer than the maximum specified in the ROA, or the
first AS in the AS path doesn't match the origin AS in the ROA.

• Not-found: there is no ROA covering this prefix.

In the next few examples we're going to see how that works. Rather
than run an actual relying party server and validate certificates and

143
ROAs ourselves, we're going to use the GoRTR tool made by Cloud
flare that reads a filter from a JSON file and transmits it to routers us
ing the RPKI-Router protocol. By default, GoRTR downloads a JSON
file from Cloudflare with the full current RPKI filter. (That file is cur
rently 30 MB.) But we'll use our own with the following data:

{"prefix":"10.0.10.0/23","maxLength":24,"asn":"AS65010","ta":""},
{"prefix":"2001:db8:10::/44","maxLength":44,"asn":"AS65010","ta":""
},
{"prefix":"10.0.16.0/21","maxLength":21,"asn":"AS65020","ta":""},
{"prefix":"2001:db8:20::/44","maxLength":44,"asn":"AS65020","ta":""
},
{"prefix":"10.0.40.0/21","maxLength":23,"asn":"AS65040","ta":""},
{"prefix":"2001:db8:40::/44","maxLength":47,"asn":"AS65040","ta":""
}

Obviously, when using RPKI in production, you wouldn't be editing


such a JSON file to add your own settings. For that, there's the “Simpli
fied Local Internet Number Resource Management” (SLURM) addition
to RPKI [RFC 8416].

If you're trying this out yourself with the BGP minilab, first start a
Docker image of the GoRTR tool in a separate terminal / shell window
as follows:

./run-gortr.sh

Or, on Windows:
.\run-gortr.ps1

Stop it when you're done with control-c. We can now run example 34,
which has a routine BGP setup we've seen in earlier examples coupled
with some RPKI settings.

In order for the FRRouting software to be able to use RPKI,


it's necessary to add -M rpki to the bgpd options. So if
you're running your own installation of FRR, in the /etc/
frr/daemons file change this line:

bgpd_options=" -A 127.0.0.1"

to:

144
bgpd_options=" -A 127.0.0.1 -M rpki"

Example 34. Enabling RPKI


!
rpki
rpki polling_period 10
rpki retry_interval 10
rpki cache 172.17.0.1 8282 preference 1
rpki cache 172.17.0.2 8282 preference 2
rpki cache 172.17.0.3 8282 preference 3
rpki cache 172.17.0.4 8282 preference 4
rpki cache 172.17.0.5 8282 preference 5
rpki cache 172.17.0.6 8282 preference 6
rpki cache 172.17.0.7 8282 preference 7
rpki cache 172.17.0.8 8282 preference 8
rpki cache 172.17.0.9 8282 preference 9
exit
!

The polling and retry intervals of 10 seconds wouldn't be appropriate


in production, but they help to get faster results in a lab environment.
We connect to an RPKI cache (i.e., GoRTR) over TCP port 8282 without
authentication. Also not what you'd want to do in production. The rea
son there are nine addresses listed is because we're using default
Docker addresses here and those are not entirely predictable. We can
ask the router to show us the RPKI prefix table:

Router# show rpki prefix-table


RPKI/RTR prefix table
Prefix Prefix Length Origin-AS
10.0.10.0 23 - 24 65010
10.0.40.0 21 - 23 65040
10.0.16.0 21 - 21 65020
2001:db8:40:: 44 - 47 65040
2001:db8:20:: 44 - 44 65020
2001:db8:10:: 44 - 44 65010
Number of IPv4 Prefixes: 3
Number of IPv6 Prefixes: 3

Which has the following effect on the BGP table:

145
Router# show ip bgp
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found

Network Next Hop Weight Path


V* 10.0.10.0/23 192.0.2.21 0 65030 65020 65010 i
V*> 192.0.2.41 0 65040 65010 i
I*> 10.0.20.0/22 192.0.2.21 0 65030 65020 i
N*> 10.0.30.0/23 192.0.2.21 0 65030 i
V* 10.0.40.0/21 192.0.2.21 0 65030 65020 65010 65040 i
V*> 192.0.2.41 0 65040 i
I*> 203.0.113.83 0 65083 i
N* 10.0.83.0/24 192.0.2.41 0 65040 65083 i
N* 192.0.2.21 0 65030 65083 i
N*> 203.0.113.83 0 65083 i
N*> 192.0.2.0/24 0.0.0.0 32768 i

So the paths originated by ASes 65010 and 65040 are all valid (V),
while 10.0.20.0/22 from AS 65020 is invalid (I). That's because there
is a ROA for 10.0.16.0/21 with a /21 maximum length and
10.0.20.0/22 matches 10.0.16.0/21 (it's the second half of that
block), but the /22 prefix length is longer than the /21 limit in the
ROA.

AS 65030's prefix is not-found (N), as there is no matching ROA. The


same is true for 65083's own prefix 10.0.83.0/24. AS 65083 also an
nounces a /24 and a /23 out of AS 65040's address block
10.0.40.0/21. Those are flagged invalid.

However, apart from an extra letter in the show ip bgp output, the
RPKI state doesn't have any consequences. The RPKI invalid paths are
still considered valid for BGP path selection purposes:

146
Router# show ip bgp 10.0.20.0/22
BGP routing table entry for 10.0.20.0/22, version 7
Paths: (1 available, best #1, table default)
Not advertised to any peer
65030 65020
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, valid, external, best (First path received), rpki
validation-state: invalid

This is different with Cisco's RPKI implementation: there the default


behavior is to ignore paths that are considered invalid by RPKI.

So what actions should we attach to the RPKI validation state of a


path? Early on in the development of RPKI route origin validation, the
suggestion was to give invalid paths a low local preference, not-found
paths a higher local preference and valid paths the highest local pref
erence. This way, valid paths should “win” from not-found and invalid
paths, and not-found paths from invalid paths.

Unfortunately, that's not good enough. Consider the invalid path for
10.0.42.0/23 from example 34. Even if we give this path a really low
local preference, packets to for instance 10.0.42.13 will still flow ac
cording to that invalid /23 while the path for the encompassing prefix
10.0.40.0/21 that is valid and has a high local preference is ignored
as per the longest match first rule.

So in order for RPKI route origin validation to do something useful,


we need to block/filter invalid paths completely rather than just lower
their local preference. This does of course incur the risk that legitimate
paths are filtered because someone made a mistake with their ROAs.
That makes it unattractive to be the first one to start filtering invalids,
as you might be blamed for the lost connectivity as “it works for
everyone else”. However, filtering out paths with RPKI state invalid is
now common enough that people are well-motivated to fix any issues
that make their prefixes fail route origin validation.

It has been suggested at once RPKI uptake reaches a sufficiently high


level, we should also start filtering out paths with status not-found.
That doesn't make much sense. The risk of not having a valid ROA
cover your prefix is that you can become unreachable due to mistakes

147
elsewhere or because of an attack. Filtering out such prefixes accom
plishes the same thing that RPKI is supposed to protect against: be
coming unreachable.

Let's have a look at example 35, which adds route maps that imple
ment filtering and adjusting the local preference to the previous con
figuration.

Example 35. Filtering RPKI invalid paths and higher local preference for valid
paths
!
router bgp 65082
neighbor 192.0.2.21 route-map apply-rpki-transit in
neighbor 192.0.2.41 route-map apply-rpki-transit in
neighbor 203.0.113.83 route-map apply-rpki-peers in
!
route-map apply-rpki-transit deny 10
match rpki invalid
!
route-map apply-rpki-transit permit 20
match rpki valid
set local-preference 200
!
route-map apply-rpki-transit permit 30
!
route-map apply-rpki-peers deny 10
match rpki invalid
!
route-map apply-rpki-peers permit 20
match rpki valid
set local-preference 210
!
route-map apply-rpki-peers permit 30
set local-preference 110
!

There are two route maps to handle RPKI: one for BGP sessions with
transit providers and one for BGP sessions with peers. The apply
rpki-transit route map starts with a deny clause and looks for paths
with RPKI state invalid. That means that when a match happens, the
path is filtered out and doesn't get added to the BGP RIB.

The next clause looks for RPKI state valid, and if there is a match the
local preference is set to 200 and the path is (implicitly) added to the

148
BGP RIB. The last clause has neither a match nor a set and thus match
es all paths that get to this stage and adds them to the BGP RIB. This
would be paths with RPKI state not-found, but also paths without any
RPKI state at all.

Somewhat ironically, match rpki notfound doesn't match paths with


no RPKI state. (At least with FRRouting's RPKI implementation.) It's
important to make sure that paths with no RPKI state are also handled,
because it's always possible that RPKI is unavailable at some point, for
instance because the cache server is unreachable.

The apply-rpki-peers route map that is applied to peers is very sim


ilar, except that valid paths get a local preference of 210, and not-found
and no RPKI state paths a local preference of 110. This makes sure that
paths from peering are preferred over paths from transit.

The output below is the result of first starting the routers and letting
them establish their BGP sessions, and only then starting the GoRTR
cache server:
Router# show ip bgp
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Path


V* 10.0.10.0/23 192.0.2.21 65030 65020 65010 i
V*> 192.0.2.41 65040 65010 i
I*> 10.0.20.0/22 192.0.2.21 65030 65020 i
N*> 10.0.30.0/23 192.0.2.21 65030 i
V* 10.0.40.0/21 192.0.2.21 65030 65020 65010 65040 i
V*> 192.0.2.41 65040 i
N*> 10.0.83.0/24 203.0.113.83 110 65083 i
N* 192.0.2.41 65040 65083 i
N* 192.0.2.21 65030 65083 i
N* i192.0.2.0/24 192.0.2.154 100 i
N*> 0.0.0.0 i

When the router connected to the RPKI cache server, the RPKI state
was added to each path, but because the paths were already received
from the BGP neighbors and evaluated by the route maps when there
was no RPKI state yet, the invalid path wasn't filtered and the valid
ones didn't get the higher local preference.

149
This means that an invalid path that was received before the cache
server became available will remain in the BGP table until there is a
change somewhere that triggers a BGP update for such a path. New
invalid paths will be filtered as intended. Or we can can ask our BGP
neighbors to send a full set of updates:

Router# clear ip bgp * in


Router# show ip bgp
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Path


V*> 10.0.10.0/23 192.0.2.41 200 65040 65010 i
V* 192.0.2.21 200 65030 65020 65010 i
N*> 10.0.30.0/23 192.0.2.21 65030 i
V*> 10.0.40.0/21 192.0.2.41 200 65040 i
V* 192.0.2.21 200 65030 65020 65010 65040 i
N*> 10.0.83.0/24 203.0.113.83 110 65083 i
N* 192.0.2.41 65040 65083 i
N* 192.0.2.21 65030 65083 i
N* i192.0.2.0/24 192.0.2.154 100 i
N*> 0.0.0.0 i

Now the invalid path is gone and the valid ones have the intended
higher local preference.

Note that giving valid paths a higher local preference than not-found
paths will not do anything under normal circumstances, as it is impos
sible to have a valid and a not-found path for the same prefix at the
same time. After all, valid requires the presence of a matching ROA,
while not-found is only possible if there is not a matching ROA.

Still, having these different local preference values could be useful


when the RPKI validation state is not available or not easily visible.
For instance, this is what another router in the same AS that has
learned the prefixes listed above over iBGP:
R4# show ip bgp
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Path


*>i10.0.10.0/23
*>i10.0.30.0/23 192.0.2.151 200 65040 65010 i
192.0.2.151 100 65030 i
*>i10.0.40.0/21 192.0.2.151 200 65040 i

150
i10.0.83.0/24 192.0.2.151 110 65083 i
* i192.0.2.0/24 192.0.2.151 100 i
*> 0.0.0.0 i

Because this router doesn't have any eBGP sessions, it's not necessary
to have it perform RPKI route origin validation itself, as all the eBGP
routers will have already done that. But it's still informative to be able
to see what the RPKI validation state was by looking at the local pref
erence.

Then again, having a higher local preference for valid paths could re
sult in unexpected results when some routers are able to perform RPKI
validation and others aren't. The traffic will then flow over the routers
with working RPKI and thus the higher local preferences, possibly tak
ing a longer than necessary route in the process.

Looking at published ROAs, it's remarkable how many of them use a


maximum prefix length of /24. This is even true for some very short
prefixes (large address blocks), such as half of the 36 /10 ROAs at the
time of this writing. And you really don't want to de-aggregate a /10
into 16384 /24s.

Worse, having a ROA with a maximum length longer than the prefix
length of prefixes that are actually advertised undermines one of the
most important advantages of RPKI route origin validation: the ability
to filter unwanted more specifics. More specifics originated from a dif
ferent AS will still be filtered because the origin validation fails. But if
accidental leaking of more specifics preserves the original origin AS, or
in the case of a deliberate attack where the attacker simply spoofs the
authorized origin AS, these more specifics will be considered valid and
be able to do their damage.

A reason why so many ROAs go down to /24 could be because the


network in question may want to avail itself of external anti-DDoS fil
tering services. These usually work by having the service announce the
/24 that covers the address(es) under attack from the anti-DDoS ser
vice provider's AS. The service provider then “washes” the traffic and
passes the cleaned up traffic to the network under attack.

151
However, this doesn't necessarily require setting a large maximum pre
fix length (such as /24 for IPv4) in ROAs. An alternative is to selective
ly allow invalid paths. For instance, in example 36, AS 65083 is an anti
DDoS filtering service. AS 65040 is experiencing a DDoS attack for ad
dresses within 10.0.41.0/24. So AS 65083 advertises that prefix to
make the “dirty” traffic flow to AS 65083 so it can be filtered and then
the “clean” traffic is forwarded to AS 65040. The configuration in the
example allows the /24 advertisement from AS 65083 even though it
has RPKI state invalid.

Example 36. RPKI and an external anti-DDoS traffic filtering service


!
router bgp 65082
neighbor 203.0.113.83 remote-as 65083
neighbor 2001:db8:90::6:5083:1 remote-as 65083
!
address-family ipv4 unicast
neighbor 203.0.113.83 soft-reconfiguration inbound
neighbor 203.0.113.83 maximum-prefix 10
neighbor 203.0.113.83 route-map override-rpki-peers in
no neighbor 2001:db8:90::6:5083:1 activate
exit-address-family
!
address-family ipv6 unicast
neighbor 2001:db8:90::6:5083:1 activate
neighbor 2001:db8:90::6:5083:1 maximum-prefix 10
neighbor 2001:db8:90::6:5083:1 route-map override-rpki-peers in
exit-address-family
!
ip prefix-list only24 seq 5 permit 0.0.0.0/0 ge 24 le 24
ipv6 prefix-list only48 seq 5 permit ::/0 ge 48 le 48
!

152
route-map override-rpki-peers permit 10
match rpki invalid
match ip address prefix-list only24
set local-preference 50
!
route-map override-rpki-peers permit 20
match rpki invalid
match ipv6 address prefix-list only48
set local-preference 50
!
route-map override-rpki-peers deny 30
match rpki invalid
!
route-map override-rpki-peers permit 40
match rpki valid
set local-preference 210
!
route-map override-rpki-peers permit 50
set local-preference 110
!

The maximum-prefix setting makes sure that if AS 65083 starts leaking


large amounts of unauthorized prefixes that are no longer caught by
RPKI, the BGP session will be shut down.

For both the IPv4 and the IPv6 BGP sessions to AS 65083 the route map
override-rpki-peers is applied to incoming updates. The permit
10 clause matches if a path has both RPKI state invalid and it matches
prefix list only24. That prefix list only matches IPv4 prefixes with a
prefix length less or equal than /24 and also greater or equal than /24.
So only exactly /24. In that case, the local preference is set to 50, and
the router adds the path to the BGP RIB.

The permit 20 clause does the same thing for IPv6 /48. After that,
RPKI invalid prefixes that didn't match earlier are discarded by the
deny 30 clause and RPKI valid paths get a 210 local preference and
everything else remaining at this stage a 110 local preference. The re
sults:

153
Router# show bgp ipv4 unicast
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Path


V*> 10.0.10.0/23 192.0.2.41 200 65040 65010 i
V* 192.0.2.21 200 65030 65020 65010 i
N*> 10.0.30.0/23 192.0.2.21 65030 i
V*> 10.0.40.0/21 192.0.2.41 200 65040 i
V* 192.0.2.21 200 65030 65020 65010 65040 i
I*> 10.0.41.0/24 203.0.113.83 50 65083 i
N*> 10.0.83.0/24 203.0.113.83 110 65083 i
N* 192.0.2.41 65040 65083 i
N* 192.0.2.21 65030 65083 i
N*> 192.0.2.0/24 0.0.0.0 i

As intended, the 10.0.41.0/24 announcement from AS 65083 is ac


cepted, despite its RPKI invalid status. To test our filters, AS 65083 is
also advertising prefix 10.0.42.0/23, which is missing from the table
listed above as intended. But AS 65083 did in fact announce that prefix:

Router# show ip bgp neighbors 203.0.113.83 filtered-routes


RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Path


*> 10.0.42.0/23 203.0.113.83 65083 i

Total number of prefixes 1

Because we specified soft-reconfiguration inbound for this BGP


neighbor, the router keeps a copy of all the paths this neighbor sent,
which allows us to inspect filtered prefixes with the show ip bgp
neighbors ... filtered-routes command. Unfortunately, this
view does't list the RPKI validation state.

Of course an anti-DDoS filtering service can arrange to override RPKI


validation with its peers and transit providers, but not with the whole
world. However, that shouldn't be a problem: the less specific prefix
advertised by the address holder will make sure traffic flows to their
transit ISPs. As long as those, as well as any peers of the address hold
er, peer with the anti-DDoS filtering service and apply the RPKI ex
emption, the traffic will end up with that service.

154
Last but not least, let's see how an attacker can defeat RPKI route ori
gin validation in example 37, where apparently someone nefarious has
taken over AS 65083.

Example 37. Announcing a prefix with a spoofed origin AS to defeat RPKI


!
router bgp 65083
!
address-family ipv4 unicast
network 10.0.41.0/24
network 10.0.42.0/23 route-map fake-origin
neighbor 203.0.113.82 attribute-unchanged as-path
exit-address-family
!
address-family ipv6 unicast
network 2001:db8:41::/48
network 2001:db8:42::/47 route-map fake-origin
neighbor 2001:db8:90::6:5082:1 attribute-unchanged as-path
exit-address-family
!
route-map fake-origin permit 10
set as-path prepend 65040
exit
!

So AS 65083 applies a route map when originating the to be spoofed


prefixes, and that route map then prepends the AS path with the AS
number that's authorized in the relevant ROA, AS 65040 in this case.
The result:
Router# show ip bgp
RPKI validation codes: V valid, I invalid, N Not found

Network Next Hop LocPrf Path


V*> 10.0.10.0/23 192.0.2.41 200 65040 65010 i
V* 192.0.2.21 200 65030 65020 65010 i
N*> 10.0.30.0/23 192.0.2.21 65030 i
V*> 10.0.40.0/21 192.0.2.41 200 65040 i
V* 192.0.2.21 200 65030 65020 65010 65040 i
I*> 10.0.41.0/24 203.0.113.83 50 65083 i
V*> 10.0.42.0/23 203.0.113.83 210 65040 i
V* 192.0.2.21 200 65030 65083 65040 i

So 10.0.42.0/23 shows up as valid and as if it's directly learned from


AS 65040, although the next hop address is the same as that from an

155
other prefix learned from AS 65083. Because the AS 65083 router was
configured with attribute-unchanged as-path and “65040” was
already prepend to the path, it simply used “65040” as the AS path. For
the other prefix AS 65083 was sending, the AS path was still empty so
it had to insert its own AS number into the path before sending the
BGP update.

Depending on router settings such as enforce-first-as and other


filters such as customer filters or IRR-based filters, spoofed announce
ments may still be filtered, but that is definitely not a given. So RPKI
route validation is not an airtight system.

Also, there is a certain risk involved with trusting a system like RPKI
with many moving parts to inject filters into your routers that could
very well filter legitimate paths if something goes wrong. Then again,
RPKI is the best tool we have today to address issues with BGP such as
the incidents mentioned under “scary stories”. All of these incidents
(in unmodified form) would have had no or relatively little impact had
RPKI route origin validation been in place at the time.

BGPsec
AS the internet moved from a sheltered academic environment into the
wider world and gained popularity in the 1990s, it became clear that
BGP offers very few protections against someone other than the legit
imate holder of an address block announcing that address block. An
other problem is that path attributes such as the AS path don't have
any protection against manipulation as BGP updates make their way
across the globe.

S-BGP (Secure BGP) was proposed around 2000 to address these is


sues. The route origin validation part was removed as that's now han
dled by RPKI. And after years of discussion and refinement, the BGP
sec specification was published in 2017 as [RFC 8205] and a number
of additional RFCs.

The main thing that BGPsec does is replace the AS_PATH attribute with
the BGPsec_PATH. Like the regular AS_PATH attribute, the BGPsec_PATH

156
also lists all the AS numbers from the AS that originated a prefix to
current AS. In addition, after each hop there is a signature over the
path so far and also the next hop. Upon reception these signatures are
checked to make sure the path wasn't tampered with and each next AS
in the path was indeed authorized by the previous one.

The immediate benefit that BGPsec brings is that it make it impossible


to successfully fake/spoof the origin AS for the purpose of bypassing
RPKI route origin validation, like in example 37 above.

However, apart from closing that RPKI loophole, it's not clear how
cryptographically protecting the AS path is all that helpful in practice.
Deliberate attacks mainly consist of announcing more specifics. With
RPKI blocking that, an attacker can no longer “shoplift” a targeted
sub-prefix of the victim's larger address block, but they have steal the
whole shop, so to say.

In other words: the attacker has to go head-to-head announcing the


intended victim's entire prefix and hope enough other networks will
send their traffic towards the fake origin rather than the legitimate ori
gin.

Also, BGPsec is very resource intensive. First of all, prefixes can no


longer be grouped into a single update message if their path attributes
are the same. Because of the explicit next hop AS authorization, it's no
longer possible to generate one update and send it to multiple peers.

And for every hop in the AS path of every prefix, a signature must be
checked. Even a highly optimized implementation of the ECDSA P-256
algorithm running on a modern desktop CPU makes verifying a mil
lion prefixes times an average four hops take several minutes. Double
that for two full BGP feeds, and add more for peers that also send sig
nificant numbers of prefixes.

As such, it's not surprising that there are no production BGPsec im


plementations, and it's questionable if the internet at large will ever
adopt the protocol.

157
So how secure is BGP?
At the beginning of the chapter, I posited that BGP can be considered
secure if we can be confident that we're able to answer the following
questions with “yes”:

1. Are BGP messages exchanged between the right BGP speakers


unmodified?

2. Are the BGP speakers saying the right things? Meaning:

a. Is the AS that originates a prefix authorized to do so by the


legitimate holder of the prefix?

b. Is the prefix further propagated in BGP in accordance with


the wishes of the legitimate holder of the prefix?

It's definitely possible to answer question 1 with “yes”. Even if TCP


MD5 and GTSM isn't good enough, there are other options: TCP-AO
and IPsec.

With just RPKI, we get close to a “yes” for question 2a, but we don't
quite get there. RPKI + BGPsec would do the trick, though.

With question 2b, BGPsec gets us closer to a “yes”, but again, not quite
there. The problem is that it's not the address holder who gets to de
cide which path updates follow, but that each hop gets to decide on the
next hop.

158
Making BGP faster

An important job of BGP is to route around failures. So if for whatever


reason, a BGP router can't successfully talk to a neighbor anymore, it
must stop sending packets to that neighbor and send them over anoth
er path.

This is implemented through the “hold time”. [RFC 4271] suggests a


default value of 90 seconds for the hold time, and a third of the hold
time (so 30 seconds) for the keepalive time. So both BGP neighbors
send a keepalive messages every 30 seconds. When a BGP router no
longer receives keepalive messages, at some point the amount of time
since the router last read any data from the neighbor exceeds the hold
time. This triggers BGP to send a “hold timer expired” notification to
the neighbor and tear down the BGP session. All paths learned from
the neighbor are removed from the BGP RIB and new best paths are
selected to reach the affected prefixes.

90 seconds is already quite a long time to wait for rerouting to restore


connectivity after a failure, but Cisco actually doubled the 90 and 30
seconds to 180 and 60 seconds. Example 38 shows how that works out
in practice. What we do here is ping an address inside AS 65040, then
simulate an outage by shutting down the interface that connects ASes
65082 and 65040 and see how long the gap in the ping sequence is.

There are no special configurations, so no need to show configuration


snippets. However, running the example is a bit complex because we
need to interact with three routers:

• Our standard router 82 / R1 to observe the BGP state and BGP table

• iBGP-router 824 / R4 to run the ping on

• Router 40 in AS 65040 to shut down the interface from the remote


end

159
(We can't run the ping on router 82 because that router would use the
address from the interface to AS 65040 that we're going to bring down
so the pings won't resume after rerouting. In general doing pings and
traceroutes to external destinations from an eBGP router can have un
expected results because the source addresses are atypical.)

Before we do anything, the AS 65040 prefix is available over two paths:

R1# show ip bgp


Network Next Hop LocPrf Path
* 10.0.40.0/21 192.0.2.21 65030 65020 65010 65040 i
*> 192.0.2.41 65040 i

Now we start the ping on R4:

R4# ping 10.0.47.255


PING 10.0.47.255 (10.0.47.255): 56 data bytes
64 bytes from 10.0.47.255: seq=0 ttl=63 time=0.062 ms
64 bytes from 10.0.47.255: seq=1 ttl=63 time=0.143 ms
64 bytes from 10.0.47.255: seq=2 ttl=63 time=0.102 ms
64 bytes from 10.0.47.255: seq=3 ttl=63 time=0.100 ms
...

And while the ping is running, we tell the AS 65040 router to shut
down the interface towards AS 65082:
ISP40# conf t
ISP40(config)# interface eth0.1401
ISP40(config-if)# shutdown

Over at R4, the ping is still running but no output. After nearly three
minutes, the output resumes:

...
64 bytes from 10.0.47.255: seq=167 ttl=60 time=0.173 ms
64 bytes from 10.0.47.255: seq=168 ttl=60 time=0.157 ms
64 bytes from 10.0.47.255: seq=169 ttl=60 time=0.135 ms
^C

With the last ping before the pause being number 3 and the first one
after 167, we missed 163 packets at a rate of one packet per second. We
can also see that the path now takes more hops as the TTL for the in
coming ping packets is now 60 while it was 63 before. This is what
show ip bgp summary and show ip bgp looked like during the simu
lated outage:

160
R1# show ip bgp sum
Neighbor V AS MsgRcvd MsgSent Up/Down State/PfxRcd
192.0.2.21 4 65030 21 14 00:07:45 4
192.0.2.41 4 65040 24 23 00:00:25 Active
192.0.2.154 4 65082 12 27 00:07:45 1

R1# show ip bgp


Network Next Hop LocPrf Path
*> 10.0.40.0/21 192.0.2.21 65030 65020 65010 65040 i

Adjusting the BGP timers


To make BGP reroute faster after an outage, let's adjust the keepalive
time and the hold time. We can set a default for all BGP sessions with
timers bgp followed by the keepalive time and then the hold time.
We can also set a keepalive time and a for individual neighbors. Ex
ample 38 sets a default of 6 and 20 seconds, respectively, and then 30
and 90 seconds for the iBGP session. iBGP sessions are normally be
tween loopback interfaces so there is virtually no risk of these BGP ses
sions going down.

Example 39. Reducing the hold time


!
router bgp 65082
timers bgp 6 20
neighbor 192.0.2.154 remote-as 65082
neighbor 192.0.2.154 description iBGP to R4
neighbor 192.0.2.154 timers 30 90
!

In the BGP open message, both BGP speakers announce the hold time
they'd like to use. The lower of the two will be used by both sides. (Un
less a higher bgp minimum-hold time is in effect. But then the BGP
session may not come up at all.) So with the example 39 configuration
on R1, but no such settings on the AS 65040 router, we see this:
ISP40# show ip bgp neighbors 192.0.2.42
BGP neighbor is 192.0.2.42, remote AS 65082, local AS 65040,
external link
Hostname: R1
BGP version 4, remote router ID 192.0.2.251, local router ID
10.0.47.255

161
BGP state = Established, up for 00:07:28
Last read 00:00:04, Last write 00:00:04
Hold time is 20, keepalive interval is 6 seconds

(The show ip bgp neighbors ... output has lots of additional in


formation.)

After learning the 20 second hold time from R1, the AS 65040 router
divided that number by three to arrive at the 6 second keepalive time.
If we now again run a ping from R4 and after a few seconds shut down
the eth0.1401 interface on the AS 65040 router, we get the following
results:
R4# ping 10.0.47.255
PING 10.0.47.255 (10.0.47.255): 56 data bytes
64 bytes from 10.0.47.255: seq=0 ttl=63 time=0.091 ms
64 bytes from 10.0.47.255: seq=1 ttl=63 time=0.104 ms
64 bytes from 10.0.47.255: seq=2 ttl=63 time=0.103 ms
64 bytes from 10.0.47.255: seq=18 ttl=60 time=0.108 ms
64 bytes from 10.0.47.255: seq=19 ttl=60 time=0.138 ms
64 bytes from 10.0.47.255: seq=20 ttl=60 time=0.136 ms
^C

That's a significant improvement: connectivity was restored after 15


seconds. So maybe we should set the even lower? The lowest the hold
time can be set and still work is 3 seconds.

It's possible to set the hold time to 0. That means no keep


alives are sent. Don't do this. Not only will outages no longer
be detected, but it's even possible that when connectivity be
tween the two BGP neighbors is restored and one tries to re
connect, the other will not accept the connection because it
already has a BGP session in the established state.

In the early 2000s I once got bitten by too aggressive hold times. Back
in those days a router that was just big enough to handle several full
BGP feeds and also do peering over an internet exchange didn't have a
very fast CPU. One day there was some instability in BGP somewhere,
with the result that the router got a lot of updates over several BGP
sessions at the same time. All this update churn meant that the router
was so busy installing new paths in the BGP RIB, recalculating the best

162
path, and then generate BGP updates of its own, that it didn't get
around to sending keepalive messages at the appropriate intervals.
With the result that the other end would tear down the BGP session as
the hold time expired. Which lead to more work for the poor router.
Which in turn made it fail to send keepalives on time, and so on.

It's very likely that today's routers won't fall into this trap. To some
degree because the router CPUs are much faster, although that's large
ly negated by the BGP table getting so much bigger. But mostly be
cause it's unlikely that BGP implementations still use such a monolith
ic BGP update processing system that keepalives are delayed if there
are incoming update messages to process.

Still, I wouldn't recommend extremely aggressive holdtimes. That 20


seconds in the example seems like a reasonable number. Or maybe 15
seconds. At 10 seconds I would probably start getting a bit uncomfort
able.

Also, BGP failure detection doesn't solely depend on the hold time. A
standard feature on most routers is “fast external fallover”. This means
the router tracks the up/down state of the physical interface an eBGP
session runs over. When the interface goes down, the BGP session is
then immediately taken down. For this reason, it's always good to
have a direct cable between two eBGP routers.

Having any switches in the middle may foil the fast external fallover
feature, as it's the switch that sees the interface link signal go away, but
the router doesn't know that because it only sees the state of its con
nection to the switch, not the end-to-end state of the connection to the
remote router.

Today, most hardware interfaces provide a very stable link up indica


tion. But in the past, it would sometimes happen that the link state of
an interface would be a bit jittery. Immediately reacting to that by
bringing the BGP session over that link down is not helpful. In those
situations, configure no bgp fast-external-failover under the
router bgp ... heading. This turns off the feature router-wide. It's
on by default.

163
BFD: bidirectional forwarding detection
Bidirectional forwarding detection (BFD) [RFC 5880] is a protocol for
quickly detecting link failures. It can be used with different routing
protocols, including BGP. BFD is designed to, when possible, test
whether the neighbor is still forwarding packets, not whether a routing
protocol such as BGP is still running. With BFD, it's possible to detect
failures within a few tens of milliseconds. However, this will easily de
tect failures that aren't really there, so such aggressive timing should
only be used if important applications really need it. Example 40
shows a BFD configuration.

Example 40. BFD


!
router bgp 65082
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
neighbor 192.0.2.41 bfd
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
neighbor 2001:db8:30:8201::1 bfd profile 2seconds
!
bfd
profile 2seconds
detect-multiplier 4
transmit-interval 500
receive-interval 500
!
peer 192.0.2.41
peer 2001:db8:30:8201::1 local-address 2001:db8:30:8201::2
!

We define two BFD peers using the addresses we also use for the BGP
sessions. In the case of the IPv6 peer, it's necessary to explicitly specify
the address on our side. For the BGP sessions, all that's needed is
neighbor ... bfd. For the IPv6 BGP session, we use the profile
2seconds which modifies the timers and detect multiplier.

BFD has two main timers: one that indicates how fast we're prepared
to receive, and one that indicates how fast we want to send. During
BFD session establishment, this information is exchanged by both

164
sides so each will limit how often it sends test packets to stay within
the receive interval of the other side. So both sides can send test pack
ets at different rates. The FRRouting default is 300 ms for transmit and
receive. The multiplier is how many packets in a row must be lost be
fore BFD declares the neighbor down. The default is 3. This means that
with the FRR default settings, a failure will be detected after between
600 and 900 milliseconds. Let's test as before:
R4# ping 10.0.47.255
PING 10.0.47.255 (10.0.47.255): 56 data bytes
64 bytes from 10.0.47.255: seq=0 ttl=63 time=0.086 ms
64 bytes from 10.0.47.255: seq=1 ttl=63 time=0.106 ms
64 bytes from 10.0.47.255: seq=2 ttl=63 time=0.104 ms
64 bytes from 10.0.47.255: seq=4 ttl=60 time=0.148 ms
64 bytes from 10.0.47.255: seq=5 ttl=60 time=0.111 ms
64 bytes from 10.0.47.255: seq=6 ttl=60 time=0.102 ms
^C

This is with the default 180-second hold time, but thanks to BFD the
BGP session was declared down and rerouting happened fast enough
that we only lost a single ping packet.

The BFD state can be inspected as follows:


R1# show bfd peers
peer 2001:db8:30:8201::1 vrf default
ID: 2190318232
Remote ID: 1088911902
Active mode
Status: up
Uptime: 9 minute(s), 39 second(s)
Diagnostics: ok
Remote diagnostics: ok
Peer Type: configured
Local timers:
Detect-multiplier: 4
Receive interval: 500ms
Transmission interval: 500ms
Echo receive interval: 50ms
Echo transmission interval: disabled
Remote timers:
Detect-multiplier: 3
Receive interval: 300ms
Transmission interval: 300ms
Echo receive interval: 50ms

165
If BFD is enabled on a BGP session, this is also visible in the output of
the show ip bgp neighbors ... command:

R1# show ip bgp neighbors 192.0.2.41


BGP neighbor is 192.0.2.41, remote AS 65040, local AS 65082,
external link
...
Connections established 6; dropped 5
Last reset 00:18:21, Notification sent (Cease/BFD Down/Hard
Reset)
...
BFD: Type: single hop
Detect Multiplier: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:00:18:21

Graceful restart
What happened when we created link failures earlier this chapter is
that the router on the AS 65082 end didn't notice the lost connectivity
for a while, as it was approaching the hold time. Should the link failure
be fixed before the existing BGP session times out, then AS 65040 will
initiate a new BGP session before the old one has disappeared on the
AS 65082 side.

And what does the AS 65082 router do when it hears the good news
that AS 65040 is reachable again? It removes all the paths that it had
learned from the AS 65040 router from the BGP RIB and the FIB and
then quickly proceeds to process BGP updates from AS 65040 in order
to restore those same paths. That doesn't seem like the most optimal
approach.

This is the issue that the graceful restart mechanism [RFC 4724] ad
dresses. With graceful restart enabled, if a reconnecting neighbor can
tell a router that it has retained its forwarding state (the contents of the
FIB) during the BGP session disconnect, the router will not flush its
routes. Instead, it marks the routes that would normally be removed at
this stage as “stale”. It then proceeds to process the newly reconnected
neighbor's update messages. When the neighbor is finished sending all
the prefixes eligible to be sent, it sends the “end-of-RIB marker”. This

166
is the receiving router's cue to remove any stale entries that weren't
“refreshed” by an update message from the RIB and FIB.

Where BFD helps to switch to alternative paths as fast as possible


when there is a failure, graceful restart does the opposite: it tries to
hold on to existing paths as long as possible. So in situations where
there are alternative paths and switching to them is easy, BFD shines.
In situations where there are no alternative paths, graceful restart fits
the bill.

Things get interesting when you use BFD and graceful restart together.
The rationale would be that BFD was created to detect forwarding
plane failures. If the BFD implementation can indeed detect that the
forwarding plane of a neighbor still works even though the control
plane (the CPU that runs the routing protocols and house keeping) has
problems, then BFD has no reason to bring down a BGP session.

For instance, when the BGP implementation has crashed but the BGP
prefixes are still in the FIB. When BGP is restarted it will reconnect and
graceful restart makes sure there is no unnecessary FIB remove/rein-
stall cycle for BGP prefixes.

However, most BFD implementations aren't quite capable of pulling


this off. In those cases, BFD and graceful restart try to do opposite
things, so it's best to not use both at the same time.

In the FRRouting BGP implementation, graceful restart is enabled by


default, but not activated for any address family. This means the router
will only function as a “receiving speaker” in [RFC 4724] parlance. A
router in this mode is sometimes called a “graceful restart helper”.

The helper is the router that marks previous paths as stale and re
moves remaining stale paths after the end-of-RIB marker. But that only
happens if the other router has graceful restart enabled for one or more
address families and tells the receiving router it was able to maintain
its forwarding plane state.

Some high availability routers use graceful restart to be able to fail


over from one route processor (that handles the control plane) to an

167
other without interrupting packet forwarding. As such, it's good to
have graceful restart enabled in helper mode to facilitate the high
availability functionality on other routers.

168
Best practices

It's one thing to run BGP, but it's another thing to do it well. In this
chapter I'll cover some topics that haven't been addresses so far that
will make BGP run better. We'll also look at BGP best practices sug
gested elsewhere.

“Black starts”
Electrical power stations can and do go down, either for planned
maintenance or because of some kind of issue that makes them “trip”.
But guess what: to start up again, they need electricity. Normally they
can simply get that over the grid from other power stations that are
still running. But what if all power stations in a large area are down?
Starting up after such a wide scale power outage is called a “black
start”. To make a black start possible, some power plants have addi
tional facilities so they can start up without relying on power from the
grid. The black start stations can then supply power to let the other
stations also start.

Something similar applies in networks. Not the actual startup of the


routers. They will automatically restart after a power failure when
power is restored. And after most crashes, they'll reboot. BGP and oth
er routing protocols will initialize and within a few minutes, packets
can start flowing again.

But what if the network becomes unreachable because of an issue with


RPKI ROAs, IRR records or because of errors in another network? At
this point, it's impossible to log into the RIR portal to update ROAs or
IRR objects. Or open a ticket with a service provider or look up infor
mation about a third party network that may be leaking your prefixes
so you can contact them and ask them to stop.

One thing to think about is that if the network depends on the dis
tributed RPKI repository, but at the same time, being able to get at the
distributed RPKI repository and/or the RIR portal to make changes to

169
ROAs, you now have a circular dependency that makes fixing prob
lems a big challenge. So it's extremely important to always have a way
to connect to the resources you need to diagnose and fix problems
without having to use your own infrastructure.

This means that you have to be able to connect to relevant services


without having to depend on your own network for connectivity.
Which includes being able to use the right credentials when you're us
ing that alternative connectivity.

For instance, if you host mail servers in your own network, you will
probably not be able to send or receive email using your normal email
address during an outage. Or maybe you normally use a desktop
computer in the network operations center, but now you're using a
laptop with a 4G connection. But that laptop doesn't have access to
your normal password manager.

It's important to think through these scenarios and make sure a “black
start” can happen as quickly as possible.

Shutdown for maintenance


When doing maintenance, it's rather disruptive to just turn off a router,
unplug network connections, reboot the router or shut down network
links.

What you want to do before you perform any of these potentially dis
ruptive actions is shut down the relevant eBGP sessions. If you're go
ing to bring a link down, this would be the BGP session or sessions
that run over that link. If you're going to bring down or reboot the
router, shut down all eBGP sessions first. This is with the example 11
configurations running:

Router# conf t
Router(config)# router bgp 65082
Router(config-router)# neighbor 192.0.2.41 shutdown
Router(config-router)# neighbor ix-ipv4-peers shutdown
Router(config-router)# ^Z

170
The first line shuts down the BGP session with AS 65040 and the sec
ond line shuts down the BGP sessions that are members of the peer
group ix-ipv4-peers, which are ASes 65083, 65084 and 4206508500.
In the show ip bgp summary overview it's made explicit that these
BGP sessions are administratively shut down:

Router# show ip bgp summary


Neighbor AS Up/Down State/PfxRcd PfxSnt
203.0.113.83 65083 00:00:06 Idle (Admin) 0
203.0.113.84 65084 00:00:06 Idle (Admin) 0
203.0.113.85 4206508500 00:00:06 Idle (Admin) 0
192.0.2.21 65030 00:10:45 5 1
192.0.2.41 65040 00:00:12 Idle (Admin) 0

On some routers it may be possible to add a message after shutdown


and that message will be sent to the neighbor(s) in question which will
probably add it to their log [RFC 8203]. The BGP sessions can be reac
tivated with no neighbor ... shutdown.

The advantage of shutting down BGP sessions half a minute or so be


fore performing disruptive actions is that rerouting happens before
packet forwarding is disrupted. Without shutting down the BGP ses
sions, packet forwarding is disrupted immediately, but rerouting will
take seconds to several minutes as BGP sessions time out and destina
tions are rerouted over alternative paths. Doing it this way guarantees
at least some packet loss and that users running interactive network
applications will see hiccups at a minimum and possibly stalls or lost
connections. With shutting down BGP sessions first, this impact is min
imized and users may not notice anything at all.

However, depending on the network topology it's possible that shut


ting down BGP sessions still leads to disruption. To further minimize
this risk some routers implement a “graceful shutdown” mechanism
[RFC 6198]. After initiating a graceful shutdown of a BGP session, the
router will send updates with a special community to the neighbor in
question. The neighbor can then give those paths a lower local prefer
ence, but still keep using them until it receives a path with a better lo
cal preference. So to work as intended, graceful shutdown must be
supported on both ends. Regular administrative shutdown doesn't re

171
quire support from the neighbor. See Cisco's documentation of grace
ful shutdown for more information.

Setting a maximum prefix limit


There are two approaches to limiting prefixes from a peer: with a very
exact limit or with a broad limit. Suppose this peer advertises 100 pre
fixes to us. We could then set the limit to something like 125. As this
peer starts to advertise more prefixes, we increase the number, always
leaving some margin. However, the issue with this is that this requires
very regular attention to increase the limits for different peers, and
even then there will be times when a peer increases the number of pre
fixes they announce by more than the margin so the limit gets tripped.

In practice, when a peer “leaks” prefixes they shouldn’t be advertising,


they leak a lot of prefixes—often all of them, or at least thousands
(IPv6) or tens of thousands (IPv4). In that case, it doesn’t matter all that
much if our maximum gets tripped at prefix 126 or at prefix 10,001 two
seconds later. So just having a few broad ranges of prefix limits is al
most as good as having very specific limits, but the amount of work
required to manage the limits is much lower.

For instance, we can use a limit fo 10,000 for all peers that advertise
fewer than 3000 prefixes, a limit of 100,000 for all peers that advertise
between 3000 and 30,000 prefixes, and no limit for peers that advertise
more. The IPv6 BGP table is about a tenth of the size of the IPv4 BGP
table, so there all these numbers could be a factor 10 lower.

Flap damping and MRAI


Back in the 1990s, it sometimes happened that networks kept “flap
ping”: first announce one or more prefixes and then quickly withdraw
those prefixes. And then announce them again, only to withdraw them
again. And so on. The reason for this was probably buggy BGP imple
mentations and/or unstable, flapping links. All this churn was prob
lematic for router CPUs.

172
The solution was a mechanism called flap damping [RFC 2439]. With
this system enabled, prefixes accrue a penalty when they flap. When
the penalty reaches a threshold, the prefix is suppressed and not ad
vertised to neighbors. The penalty decreases using an exponential de
cay, and when it falls below a threshold, the prefix is no longer sup
pressed and thus advertised to neighbors.

To avoid the situation where flap damping by one network would be


seen as extra flapping by a network further downstream, RIPE recom
mended a set of coordinated flap damping parameters culminating in
RIPE document 229 in 2001. However, RIPE document 378 from 2006
recommends against deploying flap damping:

“As the power of routers has increased, the original needs for
BGP Flap Damping is no longer a major concern for operators
or router equipment vendors as it was in the mid-1990s when
route flapping consumed a significant percentage of the CPU
of early routers. In fact, the negative effects of RFD, as de
scribed above, have become the major concern, the cure has
become worse than the disease!”

However, it's likely that some networks still deploy flap


damping, so avoid withdrawing and then re-advertising
your prefixes multiple times in a short time to avoid your
prefixes becoming unreachable in those networks for several
minutes to half an hour.

Note that the BGP standard specifies that there is a minimum delay
between updates for the same prefix. This is the minimum route ad
vertisement interval (MRAI). The default value for the MRAI is 30 sec
onds for eBGP sessions and 5 seconds for iBGP sessions. So if a router
sends a withdraw for a certain prefix to an eBGP neighbor, but then a
few seconds later it has a new path for that prefix, it will sit out the 30
seconds before sending that next update. This has the advantage that if
more updates arrive during that 30 second period, those are coalesced
into a single update, rather than a stream of updates in quick succes
sion.

173
So the MRAI makes BGP more stable. But it also delays BGP conver
gence. You can adjust the MRAI for a neighbor with neighbor ...
advertisement-interval <seconds> where routers will typically
accept a value between 0 and 600 seconds.

Limiting AS path length


There are some recommendations to filter out prefixes with excessively
long AS paths. I did a little research in 2019 and found that while
99.95% of AS paths have no more than 10 hops with prepends re
moved and no more than 20 hops including prepends, there was a sin
gle path of 5 hops plus 40 prepends, for a total of 45 hops. That clearly
serves no purpose.

Some routers allow filtering on the number of AS path hops. In that


case, allowing a maximum of, say, 50 hops seems reasonable. Unfortu
nately, even though some people recommend filtering excessively long
AS paths and some actually do it, there is no agreement on where the
cutoff should be.

So my recommendation is to keep the number of prepends preferably


to at most three, and definitely at most five. If that number of prepends
doesn't give you what you need, use another way to achieve your in
tended traffic engineering results. Try to keep total AS path length to
10 if possible, and definitely don't go over 20 if you can help it.

Best practices documents


Several organizations and individuals have published BGP best prac
tices over the years. Interestingly, those don't all cover the same rec
ommendations. Most results when searching the web for “BGP best
practices” are limited in some way, for instance they only cover one
router vendor or they only cover some small aspect of BGP. However,
there are four documents that I think deserve qualified recommenda
tions, and one honorable mention:

174
ANSSI's BGP configuration best practices is a relatively long docu
ment, but it gets very to the point while also explaining why each prac
tice is necessary or useful, and how to implement it.

Philip Smith's BGP Best Current Practices. Philip Smith has given this
presentation around the world since at least 2005. The slides have a ton
of suggested practices, but very little in the way of explanation why
they're necessary or useful. I'm sure he covers that in the presentation,
it's just not on these slides.

The Mutually Agreed Norms for Routing Security (MANRS) Network


Operator Actions. This MANRS guide is not exactly a best practices
guide. Rather, it goes into the four MANRS principles of network op
eration:

1. Prevent propagation of incorrect routing information (filters)

2. Prevent traffic with spoofed source addresses

3. Facilitate communication between network operators

4. Facilitate validation of routing information (RPKI)

It does have extensive guidance and examples on how to set every


thing up to accomplish these goals.

The NSA's A Guide to Border Gateway Protocol (BGP) Best Practices.


Unsurprisingly, this covers just BGP security practices.

Honorable mention: NIST's Border Gateway Protocol Security recom


mendations. This has some good background information, but as the
document was published in 2007 and not updated, it's not a good
source of current best practices.

Martian and bogon filters


Some IPv4 and IPv6 address space set aside for special purposes
should not appear in the global BGP table. As such, it can't hurt, and
perhaps it will help, to filter out those prefixes from all updates com

175
ing from all transit providers and peers. (Obviously you only allow
exactly the customer's prefix(es) from customers, right?)

For a short discussion on special addresses, see the section Special ad


dresses in the appendix on IP address notes. Packets with these ad
dresses in them are often called “martians”, as they seem to be coming
from Mars. Nobody on Earth could be sending such packets, after all?
The prefixes may also be called martians, and the filters that filter them
out martian filters.

Personally, I'd be more worried about filtering martian packets rather


than martian prefixes. Some martian addresses may actually be used for
some valid purposes in your own network, and having packets with
martian source addresses enter your network may lead to undesirable
backscatter [W].

Another class of undesired packets and prefixes is bogons. These are


addresses from valid global unicast space that haven't been allocated
(or assigned) to an address holder by any of the RIRs. In an ideal
world, bogon prefixes wouldn't be able to end up in the global BGP
table. However, there are several issues with filtering bogons.

Although we're “out of IPv4 addresses”, surprisingly, adding up all


the IPv4 address space given out by the RIRs still leaves 25 million ad
dresses, or 0.7%, unaccounted for. So those could be subject to bogon
filtering. The IPv4 bogons list is relatively modest at some 1500 prefix
es.

At the time of this writing, only 0.07% of IPv6 global unicast space has
been allocated by the RIRs. That's in about 60,000 prefixes, with usual
ly empty space between those prefixes so if an address holder needs
more space, their prefix can grow into that reserved space. This means
that the list of prefixes that describes the IPv6 unallocated space has no
fewer than 135,000 entries.

Whether such large filters can be stored in a router configuration and


applied to incoming updates fast enough depends a lot on the router
hardware and architecture. Routers can do a longest-match-first

176
lookup in a routing table very quickly using binary search [W] or more
advanced algorithms. Filters that use these algorithms will be fast even
when they're huge, but if filters are simply evaluated line by line from
start to finish, then such large filters are too slow.

Because the RIRs allocate address space every day, it's extremely im
portant to update bogon filters very regularly.

If despite this limitations, you want to apply bogon filters, have a look
at the Team Cymru bogon reference.

177
Tools and resources

There are many online tools and resources that make running BGP for
internet routing easier. First and foremost: BGP looking glasses. These
let anyone look inside the BGP tables of remote networks in order to
see what path is used to reach a certain prefix from the vantage point
of that network. They also often support ping and traceroute. Some
useful looking glasses:

• Telia, with locations in Sweden, Norway, Finland, Denmark and Es


tonia

• British Telecom, with locations in Europe, a few in North America as


well as São Paulo, Johannesburg, Hong Kong, Tokyo, Singapore and
Sydney

• Seabone, with locations throughout most of the world

• Tata Communications, with many locations in North America, also


Europe and a few more around the world

A bit more typing, but a very good way to see what's happening with
the global BGP table is Route Views. This project runs a big router with
dozens of BGP feeds from other networks. You can connect to the
Route Views router with telnet:
% telnet route-views.routeviews.org
[...]
Username: rviews
route-views>

You can then execute show ip bgp (and show bgp ipv6) commands
on that router. This includes commands like show ip bgp regexp
_112$ to see all the prefixes originated by AS 112, for example. These
commands tend to run slow but can be very illuminating.

Another useful tool is RIPEstat, which show all kinds of different in


formation about prefixes and ASes.

178
There are also services that will monitor your prefixes and warn you if
their status changes, like when a new AS starts originating (parts of)
them. These are typically paid-for services. An example is BGPmon.
There is also a BGPalerter tool that you can run yourself and that will
use RIPE RIS data from 600 ASes.

PeeringDB
PeeringDB is what the name suggests: a database with information rel
evant for peering. The main interface to PeeringDB is the website, but
there's also a nAPI and a whois interface. You'll find it hard to set up
much peering with other networks without registering information
about your AS in PeeringDB. This looks like:
% whois -h whois.peeringdb.com as112
Network Information
===================

Name : DNS-OARC-112
Primary ASN : 112
Also Known As :
Website : https://www.as112.net/
IRR AS-SET : AS112
Network Type Non-Profit
Approx IPv6 Prefixes : 2
Approx IPv4 Prefixes : 2
Looking Glass :
Route Server :
Created at 2016-07-01T12:40:44Z
Updated at : 2022-07-27T05:33:16Z

Peering Policy Information


==========================

URL :
General Policy : Open
Location Requirement : Not Required
Ratio Requirement : False
Contract Requirement : Not Required

179
Public Peering Points (81)
==========================

Exchange Point ASN IP Address Speed


-------------- --- ---------- -----
AMS-IX 112 80.249.208.39 1G
2001:7f8:1::a500:112:1
Denver IX 112 149.112.18.9 10G
2001:504:109::9
THINX Warsaw 112 212.91.1.112 1G
2001:7f8:60::112:1
DE-CIX Barcelona 112 185.1.119.112 10G
2001:7f8:10a::70:0:1

This way, it's easy to see if a certain network peers in any common lo
cations, or which potential peers are available at an internet exchange
or a private peering facility. And when setting up the BGP sessions for
peerings over an IX, a peer's AS number and neighbor addresses are
easily found on PeeringDB.

Meetings and Network Operator Groups


Once a few people in an organization spend a good part of their time
on internet connectivity and routing related tasks, it makes sense to get
together with others doing the same thing.

RIPE meetings in Europe go all the way back to 1989 and the first
meeting of the North American Network Operators Group (NANOG)
was in 1994. NANOG is a bit more focussed on the technical aspects of
running an internet-connected network, while at RIPE meetings there
is also focus on the activities of the RIPE NCC. These days, there are
operator groups and/or meetings in most regions of the world:

• APNOG: Asia-Pacific, with two APRICOT meetings every year

• AfNOG: Africa

• CaribNOG: Caribbean

• MENOG: Middle East

• PacNOG: Pacific region

180
• SANOG: South Asia

Many countries [W] also have Network Operator Groups. The Global
Peering Forum (GPF) is less about network operations and more about
peering.

Other resources
The NANOG mailing list is an important resource for network opera
tors: it's often the first place to hear about significant network inci
dents.

The IPv4 and IPv6 versions of the CIDR Report have daily new sta
tistics about the global BGP table. Depressingly, the same exact reacha
bility information from the current 936,780 prefixes in the IPv4 BGP
table could be expressed in 520,873 prefixes if networks wouldn't need
lessly de-aggregate their prefixes. And for IPv6 it's no better with
165,631 current prefixes vs 88,506 needed with aggregation.

A good place to see what's going on with global (and per RIR region)
RPKI deployment is the NIST RPKI Monitor. Globally, just over 40% of
IPv4 prefixes has RPKI state valid, with about a percent invalid. For
IPv6 it's 38% valid, but no less than almost 5% invalid.

You can tell if there is any filtering/blocking of RPKI invalid prefixes


by trying to access invalid.rpki.cloudflare.com. If the page won't open
(or you can't ping the DNS name) then RPKI invalid filtering is going
on somewhere between your system and Cloudflare. A traceroute
invalid.rpki.cloudflare.com can help you find out where.

181
Appendix: the router command line

For more information on configuring Cisco routers using the com


mand line interface (CLI), see Cisco's documentation here. A good
number of different vendors use a similar or even very similar inter
face. Others, most notably Juniper, use a quite different system.

This appendix is the really short version on how to use the CLI. I'll as
sume you can log in to the router's CLI with Telnet, SSH or a console
cable. You then get a prompt like:

Router>

Before you can enter configuration commands, you need to enter privi
leged or “enable” mode:

Router> enable
Password:
Router#

Note that the prompt now has a # instead of a >, indicating that the
router will now accept all commands, including configuration com
mands.

With FRRouting or Quagga, you can use the vtysh tool and you're
immediately in “enable mode”.

To change the configuration, first enter configuration mode:

Router# configure terminal


Enter configuration commands, one per line. End with CNTL/Z.
Router(config)#

At this point, you can enter the examples from this book that start and
end with an exclamation mark. After entering configuration com
mands, type exit (possibly multiple times) or control-Z.

At first, it may be confusing to have the separate command mode and


configuration mode, so check the prompt to make sure in which mode
you are when unexpected things happen. Also, there are multiple con

182
figuration modes. For instance, we'll find ourselves in BGP configura
tion mode pretty regularly:

Router# conf t
Router(config)# router bgp 65082
Router(config-router)#

If you need to get back to BGP configuration mode, you'll have to first
be in configuration mode (which we traditionally do with conf t
rather than configure terminal) and then use router bgp <AS>.

At any point you can type a question mark to see what options are
available. You only have to type as enough characters so the router
knows which command you have in mind, for instance, sh ip bgp
sum rather than show ip bgp summary.

If you enter an incorrect command, and the command is accepted by


the router, then enter the same command again with no in front of it:
Router(config-router)# neighbor 192.0.2.21 description ISP 20
Router(config-router)# no neighbor 192.0.2.21 description ISP 20
Router(config-router)# neighbor 192.0.2.21 description ISP 30

In some situations it's enough to simply repeat the command with the
right parameter:
Router(config-router)# neighbor 192.0.2.21 description ISP 20
Router(config-router)# neighbor 192.0.2.21 description ISP 30

Cisco, Quagga, FRR configuration differences


If you're using a (really) old Cisco router and you want to try out some
of the BGP configuration examples, you'll have to provide some con
figuration boilerplate:
!
ip subnet-zero
!
router bgp 65086
no synchronization
network 10.86.0.0 mask 255.255.0.0
neighbor 203.0.113.84 send-community
no auto-summary
!

183
ip bgp-community new-format
!
no ip http server
ip classless
!

Note especially the network ... mask ... notation with a netmask
rather than a prefix length. To a Quagga router, the extra settings above
are meaningless, because it's either already the default or not even
supported.

FRRouting moves away from Quagga in a number of ways. See the full
FRRouting documentation here. If you're going to spend any time with
the FRR command line through the vtysh utility, you definitely want
to add this to your vtysh.conf files:
!
terminal paginate
!

This way, long output will pause after each page. An interesting fea
ture of FRR is [RFC 8212] support. This RFC mandates that eBGP ses
sions must have some incoming and outgoing filter to allow incoming
updates to be added to the BGP table and outgoing updates to be gen
erated, respectively. In production environments this will generally not
be a problem. But it may lead to surprises when migrating from some
thing older to FRR. Turn this off with no bgp ebgp-requires-policy
under the router bgp heading.

FRR does break compatibility with other routers and routing software
by changing the following:

ip as-path access-list ...


ip community-list ...
ip large-community-list ...

to:

bgp as-path access-list ...


bgp community-list ...
bgp large-community-list ...

184
This makes it impossible to create a non-trivial BGP configuration that
works on both FRR and other routers. Also note that unlike Quagga,
FRR will load a configuration file that has configuration commands
that it doesn't recognize and simply ignore the unrecognized com
mands without an error message. So a Cisco/Quagga config file with
ip as-path access-list ... in it will seemingly work, except that
the AS path access list will be missing.

The handling of address families is also different between Cisco,


Quagga and FRRouting. When Cisco routers started supporting IPv6,
the IPv6 specific part of a BGP configuration was placed in an ad
dress-family ipv6 unicast section under the router bgp ...
heading. The IPv4 specific part of a BGP configuration, such as neigh
bor ... prefix-list ... (that refers to an IPv4 prefix list), could
still be mixed with the settings that aren't IPv4 or IPv6 specific, such as
neighbor ... remote-as ....

However, Cisco routers would often decide it was time to cluster the
IPv4 BGP settings in an address-family ipv4 unicast section. And
then you'd be stuck with that. Quagga, on the other hand, seems to
stick with the mixing, either much longer or always. FRR in on the
other hand, always uses an address-family ipv4 unicast section.

However, all routers accept entering commands in the mixed style. But
with FRR you don't get command completion/listing with tab and ?
respectively.

185
Appendix: BGP minilab

You can run most of the examples in a “BGP minilab” so you can see
how they work and perform your own experiments. The minilab uses
virtual FRRouting routers that run in Docker containers. The practice
network is set up as follows:

ISP 10 - AS 65010 ISP 40 - AS 65040

IX-AS 65090
ISP 20
AS 65020
Rserv

ISP 30
AS 65030

R2R3 R

R3 R4

82 - AS 65082 83 - AS 65083 84 - AS 65084 85 - AS 4206508500

Figure 8: BGP mini lab practice network

The components of the practice network are:

• “Network 82”, our own network. The main router is R1 or simply


Router, with three additional routers (R2, R3 and R4) that are used
in later examples. Network 82 gets transit service from ISPs 30
and 40, and can peer with networks 83, 84 and 85 through the in
ternet exchange.

• ISPs 10, 20, 30 and 40, where ISPs 10 and 40 sit at the top of the
hierarchy and peer with each other. ISP 30 is a transit customer of
ISP 20, and ISP 20 is a transit customer of ISP 10.

186
• An internet exchange with a route server.

• Three peers: networks 83, 84 and 85. These are also all customers
of ISPs 30 and 40 and connect to the internet exchange.

Installing the minilab and running examples


Install the minilab on your own computer as follows:

• Install Docker

• Under Windows: make sure it's possible to run Powerhell scripts

• Download the example configurations and the supporting


scripts from my website and unzip them

• Start Docker

• Start the command line: terminal (Mac), shell or xterm (Linux) or


Powershell (Windows) and make the folder/directory with the
downloaded examples your current directory

With Docker running you can use the following scripts:

• start.sh / start.ps1: starts the virtual routers and loads the


configurations to run an example

• connectrouter.sh / connectrouter.ps1: connects to an al


ready running virtual router

• stoprouters.sh / stoprouters.ps1: stops all running virtual


routers

To run an example, use the example script followed by the example


number (or name). So on Mac/Linux:
./start.sh example 1

On Windows:

.\start.ps1 example 1

This will start up the required virtual routers and connect you to the
main router “Router” a.k.a. Router82. When you log out, all the virtual

187
routers are shut down. If you want to run several examples, add the
keeprunning argument when starting an example, like:

.\start.ps1 keeprunning 1

This way, when you disconnect from the virtual router, some of the
“support” virtual routers are kept running so they don't have to be
restarted when starting another example. You can use the additional
keyword detach to run router82 in the background, making it easier to
connect to different routers. Some examples do this automatically.

The connectrouter scripts take a router number as the first argument


and will connect you to that router. When you log out, the virtual
router keeps running. You can also add a command and then the script
will run that command then return while the virtual router keeps run
ning:

% ./connectrouter.sh 82 show ip bgp summary


Router# show ip bgp summary
BGP router identifier 192.0.2.251, local AS number 65082
RIB entries 5, using 560 bytes of memory
Peers 2, using 18 KiB of memory

Neighbor AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State


192.0.2.21 65030 3 6 0 0 1 00:00:02 1
192.0.2.41 65040 4 5 0 0 1 00:00:02 1

Total number of neighbors 2

Total num. Established sessions 2


Total num. of routes received 2
%

Use the stoprouters script to stop all running virtual routers.

Saving your configuration overwrites the existing example


configuration. So use the write command with care.

188
Appendix: a non-converging BGP
configuration

This is an example of a network with four autonomous systems (A - D)


with one router each that are all interconnected. A announces a prefix
to the three others. Normally, each router would prefer its direct path
to A, but the following policies are in effect:

• B the path through C as long as that path is less than three hops

• C the path through D as long as that path is less than three hops

• D the path through B as long as that path is less than three hops

So when B establishes a connection to A, it has no other choice and


uses the path A. When C establishes connections to A and B, it also
uses the direct path to A. But when D establishes connections to the
other three networks, it prefers the path B A.

B now sees the path C A so it switches over to that path. This means
that D now sees the path B C A, which is three hops so it switches to
the direct path to A.

C now sees the path D A so it switches to that path. B now sees CD A


so it switches to its direct path to A.

We’re now back to our initial situation so the cycle starts from scratch.
This cycling will go on forever.

Example A, router A:
!
router bgp 65065
network 192.0.2.0/24
neighbor 203.0.113.66 remote-as 65066
neighbor 203.0.113.66 advertisement-interval 1
neighbor 203.0.113.67 remote-as 65067
neighbor 203.0.113.67 advertisement-interval 1
neighbor 203.0.113.68 remote-as 65068
neighbor 203.0.113.68 advertisement-interval 1
!

189
Example A, router B:
!
router bgp 65066
neighbor 203.0.113.65 remote-as 65065
neighbor 203.0.113.65 advertisement-interval 1
neighbor 203.0.113.67 remote-as 65067
neighbor 203.0.113.67 route-map preferred in
neighbor 203.0.113.67 advertisement-interval 1
neighbor 203.0.113.68 remote-as 65068
neighbor 203.0.113.68 advertisement-interval 1
!
ip as-path access-list 2 permit ^[0-9]+$
ip as-path access-list 2 permit ^[0-9]+_[0-9]+$
!
route-map preferred permit 10
match as-path 2
set local-preference 200
!
route-map preferred permit 20
!

Example A, router C:
!
router bgp 65067
neighbor 203.0.113.65 remote-as 65065
neighbor 203.0.113.65 advertisement-interval 1
neighbor 203.0.113.66 remote-as 65066
neighbor 203.0.113.66 advertisement-interval 1
neighbor 203.0.113.68 remote-as 65068
neighbor 203.0.113.68 advertisement-interval 1
neighbor 203.0.113.68 route-map preferred in
!
ip as-path access-list 2 permit ^[0-9]+$
ip as-path access-list 2 permit ^[0-9]+_[0-9]+$
!
route-map preferred permit 10
match as-path 2
set local-preference 200
!
route-map preferred permit 20
!

190
Example A, router D:
!
router bgp 65068
neighbor 203.0.113.65 remote-as 65065
neighbor 203.0.113.65 advertisement-interval 1
neighbor 203.0.113.66 remote-as 65066
neighbor 203.0.113.66 advertisement-interval 1
neighbor 203.0.113.66 route-map preferred in
neighbor 203.0.113.67 remote-as 65067
neighbor 203.0.113.67 advertisement-interval 1
!
ip as-path access-list 12 permit ^[0-9]+$
ip as-path access-list 12 permit ^[0-9]+_[0-9]+$
!
route-map preferred permit 10
match as-path 12
set local-preference 200
!
route-map preferred permit 20
!

The relevant differences between the configurations are that router A


(AS65065) announces our test prefix 192.0.2.0/24, and routers B - D
all have the route map preferred applied to incoming updates from
one of the others. This route map uses the AS path access list 12 to
match AS paths with one or two AS numbers in them. In that case, the
local preference is set to 200. Routes that have more than two ASes in
the AS path and routes from the other three routers have the default
local preference of 100.

The advertisement-interval 1 that’s applied to all BGP sessions is


to reduce the minimum route advertisement interval (MRAI) from the
default 30 seconds to one second so updates go out faster.

191
Appendix: IP address notes

IPv4 subnetting cheat sheet

Prefix Subnet mask From To Increment


length
/0 0.0.0.0 0.0.0.0 255.255.255.255 -
/1 128.0.0.0 0.0.0.0 127.255.255.255 128.0.0.0
/2 192.0.0.0 0.0.0.0 63.255.255.255 64.0.0.0
/3 224.0.0.0 0.0.0.0 31.255.255.255 32.0.0.0
/4 240.0.0.0 0.0.0.0 15.255.255.255 16.0.0.0
/5 248.0.0.0 0.0.0.0 7.255.255.255 8.0.0.0
/6 252.0.0.0 0.0.0.0 3.255.255.255 4.0.0.0
/7 254.0.0.0 0.0.0.0 1.255.255.255 2.0.0.0
/8 255.0.0.0 0.0.0.0 0.255.255.255 1.0.0.0
/9 255.128.0.0 0.0.0.0 0.127.255.255 128.0.0
/10 255.192.0.0 0.0.0.0 0.63.255.255 64.0.0
/11 255.224.0.0 0.0.0.0 0.31.255.255 32.0.0
/12 255.240.0.0 0.0.0.0 0.15.255.255 16.0.0
/13 255.248.0.0 0.0.0.0 0.7.255.255 8.0.0
/14 255.252.0.0 0.0.0.0 0.3.255.255 4.0.0
/15 255.254.0.0 0.0.0.0 0.1.255.255 2.0.0
/16 255.255.0.0 0.0.0.0 0.0.255.255 1.0.0
/17 255.255.128.0 0.0.0.0 0.0.127.255 128.0
/18 255.255.192.0 0.0.0.0 0.0.63.255 64.0
/19 255.255.224.0 0.0.0.0 0.0.31.255 32.0
/20 255.255.240.0 0.0.0.0 0.0.15.255 16.0
/21 255.255.248.0 0.0.0.0 0.0.7.255 8.0
/22 255.255.252.0 0.0.0.0 0.0.3.255 4.0
/23 255.255.254.0 0.0.0.0 0.0.1.255 2.0
/24 255.255.255.0 0.0.0.0 0.0.0.255 1.0
/25 255.255.255.128 0.0.0.0 0.0.0.127 128
/26 255.255.255.192 0.0.0.0 0.0.0.63 64
/27 255.255.255.224 0.0.0.0 0.0.0.31 32
/28 255.255.255.240 0.0.0.0 0.0.0.15 16

192
Prefix Subnet mask From To Increment
length
/29 255.255.255.248 0.0.0.0 0.0.0.7 8

4 2
/30 255.255.255.252 0.0.0.0 0.0.0.3
/31 255.255.255.254 0.0.0.0 0.0.0.1
/32 255.255.255.255 0.0.0.0 0.0.0.0 1

Special addresses
IANA, the Internet Assigned Numbers Authority, has registries for
special IPv4 and IPv6 addresses. For all the details, consult those. But
I'll repeat the most notable special address ranges here. These are all
the IPv4 address blocks that are used by more than a single holder. Be
cause they're not globally unique, they have no business appearing in
the global BGP table, and packets with source addresses in those
ranges are invalid on the internet. (In some cases there is a valid local
use.)

Prefix Purpose
0.0.0.0/8 Default route
10.0.0.0/8 Private use
100.64.0.0/10 Service provider NAT
127.0.0.0/8 Loopback interface
169.254.0.0/16 Link local (self-assigned in absence of a
DHCP server)
172.16.0.0/12 Private use
192.0.2.0/24 Documentation
192.168.0.0/16 Private use
198.51.100.0/24 Documentation
203.0.113.0/24 Documentation
224.0.0.0/4 Class D: multicast
255.255.255.255/32 Local broadcast address

An interesting case is the class E block: 240.0.0.0/4. These addresses


were set aside for “future use” when IPv4 was created. But when we
ran out of IPv4 addresses and being able to use those 268 million ad

193
dresses would have been helpful, it turned out that many implementa
tions wouldn't allow class E addresses to be configured. Those were set
aside for future use, after all. As a result, these addresses remain un
used.

These are the IPv6 address blocks that should not appear in the global
BGP table and shouldn't be used in source addresses for packets that
flow across the internet:

Prefix Purpose
::/3 Various special purposes
2001:db8::/32 Documentation
fc00::/7 Unique-local
fe80::/10 Link-Local unicast
fec0::/10 Previously: site-local addresses
ff00::/8 Multicast

In practice, so far all global unicast IPv6 address space allocated/as-


signed by the RIRs falls within 2000::/3. However, [RFC 3177] warns
against making assumptions about the IPv6 address space not current
ly set aside for global unicast use, so the remaining 85% of the IPv6
address space doesn't end up unusable like the IPv4 class E space men
tioned above. In theory, this also applies to ::/3. This block is some
what special, and so far, parts of this address block are only used for
special purposes. If you want to be conservative in your filtering, you
may want to filter these special purpose blocks as listed by IANA
rather than the entire ::/3.

194
About the author

Iljitsch van Beijnum got his start in the Dutch Internet Service Provider
business in 1995. He soon realized that in order to maintain more than
one connection to the internet, you need something called “BGP”. In
1997, he co-founded Pine Internet (later Pine Digital Security). In 1999,
he worked for UUNET Netherlands on designing and implementing a
new Dutch high speed backbone.

In 2000, Iljitsch started his own business now called inet⁶ consult. Be
tween 2000 and 2007, he mostly did work for web hosting companies,
among other things helping them connect to internet exchanges.

In 2002, he authored “BGP, Building Reliable Networks with the Bor


der Gateway Protocol”, published by O'Reilly, and in 2005 “Running
IPv6”, published by Apress. He also started attending IETF meetings
in 2002.

In 2007, Iljitsch started writing for Ars Technica. Later that year he
moved to Spain to become a research assistant at UC3M, where he did
more IETF work, most notably on NAT64 and DNS64 as well as a sug
gested improvement to BGP. Iljitsch holds a bachelor's degree in In
formation and Communication Technology from the Haagse
Hogeschool in The Hague and a master's degree in telematics from
UC3M Madrid.

After returning to the Netherlands, in 2016 Iljitsch joined Logius, an


agency of the Dutch Ministry of the Interior and Kingdom Relations.
As a network architect, he was responsible for the Dutch government
wide IPv6 numbering plan. In 2019 he left Logius to return to being
independent.

Follow Iljitsch on Twitter or connect on LinkedIn.

195
Copyright and acknowledgments

Copyright © 2022 Iljitsch van Beijnum

Edition: 1.0, 2022-11

The information in this book is provided as-is. Although believed to be


correct at the time of publication, it is possible that some information
in this book is incorrect or out of date.

Cover photo: Chantal de Bruijne / Shutterstock

About 140 words from Cisco's description of the BGP path selection
algorithm were used in the description of the algorithm in this book.

Router and switch icons from Cisco were used in the figures in this
book.

AS 65030 communities for example 21 were inspired by Level3’s com


munity information published in the RIPE database, with a few sen
tences copied over.

196

You might also like