Internet Routing With BGP
Internet Routing With BGP
Internet
Routing
with BGP
Introduction
BGP is a routing protocol: its main job is to allow each network to learn
which ranges of IP addresses are used where, so packets can flow
along the correct route.
However, BGP has a more difficult job to do than other routing proto
cols. Yes, it has to make the packets reach their destination, but BGP
also has to pay attention to the business side: those packets only get to
flow over a network link if either the sender or the receiver pays for
the privilege.
This book covers the fundamentals of the technical side of BGP, and
also looks at the intersection between the technical and business as
pects of internet routing.
The book contains 40 configuration examples that readers can try out
on their own computer in a “BGP minilab”.
2
Table of contents
Introduction 2
3
Table of contents
About this book 6
Intended audience 7
Conventions used in this book 7
Internet routing 9
IP addresses 11
Classes 11
Subnet masks 13
Classless Inter-Domain Routing (CIDR) 13
IPv6 16
The BGP protocol 18
The IETF 18
Distance vector vs link state 19
BGP versions 21
Autonomous Systems 22
BGP neighbor relationships 22
BGP messages 23
Path attributes 25
Multiprotocol BGP 39
36
34
31
29
27
BGP states and finite-state machine
BGP operation
BGP prerequisites
Connectivity
Router hardware
IP addresses and AS numbers
BGP configuration 101 41
Filtering BGP 48
AS path filters 49
Prefix filters 53
Community-based filters 57
Consistency between filters 61
Transit and peering 64
Internet exchanges 64
3
The business of peering: peering policies 66
Hot potato routing 66
Valley-freeness 68
BGP peering configuration 71
Peer groups 74
Internet exchange route servers 77
Traffic engineering 81
The BGP path selection algorithm 81
Route maps 84
Setting the local preference 86
AS path prepending 88
Setting and adjusting the MED 90
Influencing neighboring networks with communities 96
Announcing more specific prefixes 101
Multipath BGP 105
ECMP load balancing strategies 111
iBGP 113
iBGP and internal routing protocols 118
Loopback addresses for iBGP 122
Route reflectors 125
BGP security 129
MD5 passwords 129
The “TTL hack”: GTSM 133
Some scary stories 135
Internet Routing Registries 137
RPKI 142
BGPsec 156
So how secure is BGP? 158
Making BGP faster 159
Adjusting the BGP timers 161
BFD: bidirectional forwarding detection 164
Graceful restart 166
Best practices 169
“Black starts” 169
Shutdown for maintenance 170
Setting a maximum prefix limit 172
Flap damping and MRAI 172
Limiting AS path length 174
4
Best practices documents 174
Martian and bogon filters 175
Tools and resources 178
PeeringDB 179
Meetings and Network Operator Groups 180
Other resources 181
Appendix: the router command line 182
Cisco, Quagga, FRR configuration differences 183
Appendix: BGP minilab 186
Installing the minilab and running examples 187
Appendix: a non-converging BGP configuration 189
Appendix: IP address notes 192
IPv4 subnetting cheat sheet 192
Special addresses 193
About the author 195
Copyright and acknowledgments 196
5
About this book
I already wrote a book about BGP back in 2002. So why another one?
What I’ve learned over the years is that at its core, BGP is quite simple.
However, there are many hidden nuances and caveats that people
usually only begin to understand when they run into them in practice.
But learning those things the hard way on a live network is less than
ideal.
To keep both the writing (for me) and reading (for you) of this book
manageable, the book only covers the BGP protocol and BGP configu
ration for connecting a network to the internet. There is a lot more to
running a network, please find that information in other books and
online resources. BGP is also extensively used in data centers and en
terprise networks. This also not covered in this book.
You’ll get the most out of this book by running the virtual example
network yourself and try out the examples. With today’s technology,
it's possible to use Docker to run a bunch of virtual routers on a regu
lar Windows, MacOS or Linux system. The examples are based on Free
Range Routing, open source routing software that is configured very
similar to “classic IOS” Cisco routers. However, the exact configuration
language isn’t the point; once you understand the concepts, looking up
the right keywords in the vendor documentation is the easy part.
That said, if you’d like to see configuration examples for other types of
routers, please let me know and I may be able to add those to a future
version of this book.
6
Intended audience
This book is intended to be useful for anyone who wants or needs to
know more about BGP, and how BGP is used for internet routing. A
large part of the book discusses configuration examples, but even if
you skip those, you should still get a good feel for the problems BGP
solves. (And sometimes creates!)
7
If you're reading the (reflowable) e-book version of this book, try make
your font size and/or window size settings such that this text will fit
on a single line without wrapping:
This is how long some lines in the example router output may get...
This way, the example output will be formatted as intended and easier
to read. If you find that setting the font size so the line fits makes the
text too small, try turning the reader into landscape mode, possibly
only for looking at the examples.
8
Internet routing
However, the real complication with internet routing is that the job is
not simply finding the shortest path between any two locations, but
also taking into consideration the business aspects of running a net
work. What if networks A and B both connect to Microsoft? After all,
users of both networks A and B want to be able to download their
Windows updates and work on their Office365 documents with the
highest possible performance.
9
network isn’t used for traffic that falls outside the scope of the services
they provide.
10
IP addresses
Especially the first part of this chapter is pretty basic, but in order to
make sure that everyone is on the same page later in the book, I’m go
ing over this material anyway. Feel free to skip this chapter; you can
always come back to it later if necessary.
For larger networks, this doesn’t work very well, as the overhead of
keeping track of where in the network a given MAC address is used
quickly becomes problematic. So protocols intended for larger inter
networks [W] use addresses that consist of a network part and a local
part (or “host” part). This way, routing tables remain manageable:
routers only need to know which network is used where, rather than
keep track of individual addresses.
Classes
Of course we want to have both a large number of networks, as well as
a large number of local addresses per network. However, that way,
addresses get rather large. So the designers of the Internet Protocol (IP)
used a trick and allowed for a small number of very large networks
(class A), a medium-sized number of medium-sized networks (class B)
and a large number of small networks (class C).
11
• Class B networks are numbered from 128.0 to 191.255, with local
addresses numbered from 0.0 to 255.255
See the subnet cheat sheet at the end of the book for a full table of sub
net sizes. Computers and routers can now easily use binary operations
to derive the network and host parts of the address.
Back in the late 1980s and the early 1990s, many universities started to
connect to the internet. So they needed address space. For a university,
a class A network is way too big; a university is not going to connect
13
millions of devices. (Well, UCL, MIT and Stanford at some point held
networks 11, 18 and 36, respectively [RFC 790].)
On the other hand, a class C network with only 256 addresses is gener
ally not enough for a university. So they tended to get class B address
blocks. But as there are only 16384 of those, they started to run out
pretty quickly. A university would perhaps need 4000 addresses, wast
ing more than 60,000 when using a class B address block. The alterna
tive was to use a number of class C blocks instead. For instance, 16
class Cs adds up to just over 4000 addresses, so that would be a good
fit.
But when this new policy was adopted, the routing tables started to
grow much faster than before: a class B network takes up one entry in
the routing table, but 16 class C networks take 16 entries in the routing
table. So the number of entries in the BGP table started to outgrow the
capacity of early 1990s routers very rapidly.
Rather than having IP addresses fall into separate classes that implicit
ly specify how many address bits are used for the network part, CIDR
explicitly indicates the number of address bits.
14
dress. So 192.0.2/24 or 169.254/16 rather than 192.0.2.0/24 or
169.254.0.0/16. Routers typically require the full version.
Prefix notation can get somewhat unintuitive when the number of bits
isn’t an even 8, 16 or 24. For instance, how many addresses are in the
range 172.16.0.0/12? After all, 172.16 specifies 16 bits, not 12. This
gets clearer in binary:
to
Again, have a look at the subnet cheat sheet at the end of the book.
A crucial concept with CIDR is that of longest match first. Unlike with
classful routing, with CIDR it’s possible to have overlapping prefixes.
For each address, there’s a prefix at every possible prefix length that
matches that address. So that’s 0-32. Three examples of prefixes that
match the address 172.22.1.1 are:
15
• 172.16.0.0/12
• 172.22.0.0/16
• 172.22.1.0/24
Same thing for overlapping prefixes: we use the most specific match, the
one with the largest number following the slash. That’s the longest
match first rule. Note that longest match first rule supersedes the BGP
path selection algorithm: the path selection algorithm only applies to
routes towards the exact same prefix, while longest match first decides
between different but overlapping prefixes.
IPv6
IPv6 was introduced around 1995, a few years after the deployment of
CIDR. The point of IPv6 is to allow for more IP addresses, hence the
much larger size: while IPv4 addresses are 32 bits, IPv6 addresses are
four times as long at 128 bits.
16
plicable, the longest consecutive series of 0: sequences is replaced by a
single :: sequence.
IPv6 prefixes work the same way as IPv4 prefixes, for instance
2001:db8::/32. The IPv6 default route is ::/0. IPv6 is always class
less, and subnet sizes are thus written in prefix notation. However,
there is a strong convention that IPv6 subnets are /64 in size. And /48
is a very common network size, leaving 16 bits for subnet numbering,
providing room for 65,536 subnets.
There are many special purpose IPv6 addresses and address types
[RFC 4291] and see the appendix IP address notes, but “global uni
cast” IPv6 addresses are the most relevant to BGP routing. Currently,
2000::/3 is set aside for global unicast use. That’s all IPv6 addresses
that start with 2xxx: and 3xxx:. Global unicast means regular ad
dresses for one-to-one communication, as opposed to multicast (one
to-many) and anycast (one-to-any) addresses.
17
The BGP protocol
“BGP” stands for “border gateway protocol”. Back in 1989, when the
first BGP specification was published, the word “gateway” was used
for what we now call a router. So BGP really means “border router pro
tocol”. A border router is, of course, the last router in your network,
which connects to the first router in the next network. BGP is the pro
tocol these two border routers in neighboring networks use to ex
change routing information.
The IETF
Internet protocols such as BGP are developed and maintained by the
Internet Engineering Task Force (IETF). The IETF is an unusual stan
dards organization, as it doesn’t have members: everyone can partici
pate simply by joining the mailing lists for the different working
groups. Three times a year, there are IETF meetings. The meeting fee
(currently $875) is the main source of revenue for the IETF. As there is
no formal participation, IETF decision making is done by “rough con
sensus”. This means a decision must be supported by a large majority
of those who express an opinion, but it doesn’t have to be completely
unanimous.
18
ment is published under a new RFC number. RFCs start their life as a
working group “internet-draft”, which is iterated until the document is
ready for publication as an RFC. Individuals may also write drafts,
which may or may not be adopted by a working group and progress to
an RFC.
The best way to read RFCs online is as the HTML version at the RFC
Editor website www.rfc-editor.org. Originally, RFCs were published in
a very simple text-only format. The HTML versions add information
about a document’s status at the top, as well as links to related RFCs.
19
With link state protocols, a router doesn’t tell its neighbors about the
conclusions of its path calculations, but rather, the data it used to reach
those conclusion. So each router independently calculates the best path
to reach each destination.
Link state protocols have the advantage that they’re faster than dis
tance vector protocols. With a link state protocol, whenever a router
detects that it has lost the connection to a neighboring router, it will
send out an update to its remaining neighbors, which is quickly
“flooded” throughout the network. Then each router recalculates the
best paths. With a distance vector protocol, a router first has to recom
pute all paths, and only then it can inform its remaining neighbors of
the change.
A limitation of link state protocols is that all routers must use the same
algorithm and the same parameters to calculate paths. If they didn’t,
routing loops would be possible.
The main example of a distance vector protocol is RIP [W]. RIP is a very
simple protocol that uses a hop count as a way to determine which
path is best. That can mean that one 1 Gbps hop is preferred over two
10 Gbps hops, which is usually not what you’d want. A big downside
of RIP is that it's very slow to react to lost connectivity due to the
count-to-infinity problem [W]. The current IPv4 version of RIP is
RIPv2, the IPv6 version is RIPng.
Cisco built its own more advanced distance vector routing protocols:
IGRP and EIGRP [W].
20
IS-IS [W] is a link state protocol created for the OSI CNLP protocol. It
was later extended to also support routing IPv4 and IPv6. IS-IS is
mainly used in very large IP networks.
BGP versions
BGP version 1 was published in 1989 [RFC 1105]. Versions 2
[RFC 1163] and 3 [RFC 1267] quickly followed over the next two
years. With version 3, BGP looked a lot like the BGP we know today,
except that it still only supported classful addressing. BGP-4 added
support for classless inter-domain routing. BGP-4 was first published
in 1994 [RFC 1654]. There have been two revisions of the specification
(not of the protocol), with the most recent one published in 2006
[RFC 4271].
Amazingly, we still use BGP version 4 today, 28 years after the protocol
specification was first published. There are two reasons for this:
1. It's really hard to change the routing protocol that's used inter
net-wide.
21
Autonomous Systems
Networks that run BGP are called autonomous systems (ASes). The idea
is that each AS presents a consistent view of itself to the outside world,
and what happens inside an AS is irrelevant to other ASes, as far as
BGP is concerned.
BGP routers communicate with their neighbors over TCP port 179.
Both neighbors try to connect to the other on port 179. This means that
sometimes router A is the “client” and router B is the “server”, and
sometimes the other way around. After the TCP session has been es
tablished, the two routers start to exchange BGP messages. The TCP
22
session stays connected indefinitely. So it’s not unusual to see BGP
TCP sessions that have been up for weeks or even months.
When the TCP session goes away, the BGP routers on both sides throw
out all the routing information they’ve learned over that BGP session
and then try to set up a new TCP session.
BGP messages
When a BGP TCP session connects, the two routers will start to ex
change BGP messages. The following is a brief description of each
message type; for detailed information see section 4 of RFC 4271.
All BGP messages start with a “marker” for compatibility with older
BGP versions, with the rest of the message following the type-length
value model [W]. There are five BGP messages:
1. Open
2. Update
3. Keepalive
4. Notification
5. Route-refresh
The Open message contains a version field, which was useful during
the transition from BGP-3 to BGP-4. The router also puts its AS num
ber, its router ID and its configured hold time in the Open message.
The router ID is a 32-bit value that’s unique for a router (usually one of
its IPv4 addresses) and the hold time is how long the router will wait
before declaring the BGP session dead when it doesn’t see any incom
ing BGP messages.
Last but not least, there’s room for optional parameters. These are typ
ically used to negotiate the use of BGP extensions.
23
if they're included in the update, use two fields: path attributes and
NLRI.
The withdrawn routes are routes (prefixes) that the neighbor had pre
viously told us we could reach through them, but now this is no longer
the case. So the local router removes those paths from its BGP table.
See the section BGP operation later this chapter for how this works.
Routers send a Notification message when they need to tear down the
BGP session. This is usually because an error has occurred, but also
when the session needs to be terminated because of maintenance, or as
part of capabilities negotiation. The Notification message has an error
code and an error subcode as well as room for optional additional
data.
24
Path attributes
There are four types of path attributes:
25
4. MULTI_EXIT_DISC (optional non-transitive): the multi exit dis
criminator (MED) is also often called “metric”. Is used to choose
between paths learned from the same neighboring AS.
The following are path attributes that were added later to BGP, and are
thus optional.
26
PATH, but the next 32-bit router will add any AS hops missing
from the AS4_PATH using the AS_PATH.
Multiprotocol BGP
The routing protocols we still use today were all initially created in the
1980s, long before IPv6 saw the light of day. Of course, once IPv6 ar
rived, it also needed routing protocols. For RIP and OSPF, new ver
sions of those protocols were built from the ground up. This is the
“ships in the night” concept: RIPv2 and OSPFv2 handle IPv4 routing
while RIPng and OSPFv3 handle IPv6 routing. Other than their basic
design, the IPv4 and IPv6 versions of these routing protocols are com
pletely separate and they don’t interact at all.
IS-IS uses the opposite approach: the one IS-IS protocol handles IPv4
and/or IPv6 routing alongside the OSI CLNP for which it was created.
Like IS-IS and unlike RIP and OSPF, there’s just one BGP that handles
both IPv4 and IPv6. This is made possible by the BGP multiprotocol
extensions [RFC 4760].
Rather than just add support for IPv6, multiprotocol BGP adopts the
“address family” concept, with IPv4 and IPv6 being different address
families, along with other address families such as Ethernet VPN
(EVPN).
27
In BGP, address families are identified with the address family identifi
er (AFI). There’s also a subsequent address family identifier (SAFI)
that’s used to differentiate between (for instance) prefixes used for uni
cast (one-to-one) and multicast (one-to-many) communication. IANA
maintains AFI and SAFI registries.
So for instance a /20 prefix would be three bytes in length that hold
the 20-bit prefix padded to 24 bits to make the prefix value three bytes
long. Interestingly, the RFC that describes the use of the multiprotocol
extensions for IPv6 [RFC 2545] doesn’t even bother specifying the
same for IPv6. Obviously the only difference is that the prefix length
can now be up to 128 rather than 32.
All interfaces that have IPv6 enabled must have a link local address in
addition to any regular global unicast addresses. Link local addresses
are addresses that are only used locally on a subnet, and thus don’t
have to be globally unique. They fall within the prefix fe80::/64.
Routes in routing tables typically point to the link-local addresses of
routers.
28
16 (128 bits), when there’s also a link-local next hop address, it’s 32 (2 ×
128 bits).
With multiprotocol BGP, it’s possible to run the BGP TCP session either
over IPv4 or over IPv6, and the session can carry IPv4 and/or IPv6
prefixes. Routers will announce the AFIs/SAFIs they want to enable on
a new session in the Open message. To avoid problems with next hop
address processing, it’s best to use an IPv4 BGP session to exchange
IPv4 prefixes with neighboring ASes and an IPv6 BGP session for IPv6
prefixes. For iBGP this is slightly different, as we’ll see in the iBGP
chapter.
29
Idle
Active
Established
BGP sessions start in the Idle state. In the Idle state, the router doesn’t
try to connect to the neighbor in question, and incoming connection
attempts are rejected. It is possible to move directly from the Idle to the
Connect state, but usually, when the router is ready to start a BGP ses
sion, the session first moves to Active.
In the Active state, there is no active connection yet, but the router ac
tively tries to connect to its neighbor. From Active, the connection can
move to Connect, OpenSent or OpenConfirm.
In the Established state, the two routers on opposite sides of the BGP
session are ready to exchange routing information in the form of BGP
30
Update messages. It may take some time for the initial set of updates
to be exchanged after a session enters the Established state. If an error
occurs the session returns to the Idle state.
BGP operation
In this section, we'll have a look at how BGP routers exchange prefixes.
An important rule is that a router may only propagate (announce) to
its neighbors paths that it actually uses itself. So if a router has a choice
of multiple paths towards a given destination prefix, it must first select
the best one out of these paths.
We’ll look at the flow of BGP updates between two autonomous sys
tems, AS 10 to the left and AS 40 to the right. At this point, AS 10 and
AS 40 don’t have a BGP session established between them yet:
AS 10 AS 40
Network Path Network Path
> 192.0.2.0 20 30 82 > 192.0.2.0 82
> 198.51.100.0 4206 > 198.51.100.0 4206
Both ASes have two prefixes in their BGP table: the 192.0.2.0/24 and
the 198.51.100.0/24 prefixes. (The /24 prefix length is implied for
these class C networks.) AS 10 can reach the 192-prefix through two
intermediate hops and is directly connected to AS 4206, the origin of
the 198-prefix. For AS 40, both prefixes are reachable directly over one
hop paths.
31
AS 10 AS 40
Network Path Network Path
> 192.0.2.0 40 82 <= > 192.0.2.0 82
20 30 82 => 10 20 30 82
198.51.100.0 40 4206 <= > 198.51.100.0 4206
> 4206 => 10 4206
So at this point, both ASes now have two paths towards each prefix:
the one they already had, and the new one from the other AS. By send
ing each other copies of these prefixes, the routers in both ASes invite
the other to send traffic to these destinations through them.
AS 10 AS 40
Network Path Network Path
> 192.0.2.0 40 82 > 192.0.2.0 82
20 30 82 => x
198.51.100.0 40 4206 > 198.51.100.0 4206
> 4206 10 4206
Note that for the 198-network, each router keeps the new path in its
BGP table. For both, their original path is shorter and therefore pre
ferred over the new path learned from the other AS. So in this case,
32
there is no conflict. Should one of the routers decide to start using the
path through the other, it will send a withdraw at that point.
In this stable state, no further updates are sent, just periodic Keepalive
messages to make sure the BGP session is still operational. When a
router stops receiving Keepalive messages or loses the BGP TCP ses
sion, it removes all paths learned from the neighbor from its BGP table,
selecting new best paths as necessary, and starts trying to re-establish
the BGP session. When it does, prefixes are exchanged again as de
scribed above.
33
BGP prerequisites
In this chapter we’ll have a look at what you need to have in place be
fore you can get started setting up BGP. Those prerequisites are:
1. Connectivity
2. Routers
3. IP addresses
Connectivity
It’s a good idea to first select one or more internet service providers
(ISPs) that will provide you with connectivity to the internet with BGP.
They may want to provide input on which routers you should get,
they may be able to help you with obtaining IP addresses, and you
need to list two ASes that you’re going to connect to over BGP in order
to get an AS number.
Using BGP with one connection to one ISP doesn’t make much sense,
as the added complexity of running BGP doesn’t provide any direct
benefits. (Unless you’re starting with one ISP and will be adding an
other later.)
However, you may want to use multiple connections to the same ISP.
There are several ways to do this, such as using static routes, VRRP,
RIP or OSPF. It is of course also possible to use a full BGP setup, with
your own IP addresses and a “real” AS number. But it may prove diffi
cult to obtain an AS number in this situation.
A good solution for multiple connections to the same ISP is using BGP
with IP addresses provided by the ISP and a private AS number. BGP
can then handle distributing traffic over the two (or more) connections
and rerouting when failures occur. However, the ISP will not propa
34
gate your BGP updates to the rest of the world. Talk to your ISP about
how to set this up.
In most situations, you’ll want to have two ISPs. Starting with more
than two would be unnecessarily complex, but adding more ISPs later
is certainly possible. The idea is that both ISPs send you a full copy of
the global BGP table, so you’ll have two paths towards each prefix.
Your routers will then select the best path of the two for each prefix.
If you don’t mind that all outgoing traffic goes through just one ISP,
you may be tempted to simply accept a default route through BGP
rather than the full BGP table. However, this has the downside that if a
certain destination is not reachable through ISP A (which sends you
the higher priority default route) but that destination is reachable
through ISP B (which sends you the backup default route), then your
router will send the traffic to ISP A and it won’t reach the destination.
With full tables from both ISP A and B, the prefix in question will sim
ply be rerouted through ISP B and it will still be reachable.
We’ll talk much more about peering in the chapter Transit and peering,
but for now, it’s important to know that there is a small group of tier 1
networks [W] which solely rely on direct peering links between them.
Sometimes they disagree about the conditions for interconnecting di
rectly and may temporarily suspend the connectivity between their
networks (depeer). When this happens, customers of one network can't
reach customers of the other network.
All ISPs other than the 15 or so tier 1 ISPs depend on one or more of
the tier 1 ISPs for at least some of their connectivity. So when selecting
ISPs, make sure it’s not two ISPs that both depend solely on one of the
tier 1s. Most smaller ISPs buy service from multiple larger ones, so this
is unlikely to be an issue, but it's definitely a good idea to ask both
about this.
35
Router hardware
First of all, you need at least two routers, so if one fails or needs main
tenance, you have a second one that keeps you connected to the inter
net. An important question is if your two BGP routers should be the
same, or different from each other.
Having two routers that are the same or similar models from the same
vendor are much easier to work with than two very different models
or especially routers from different vendors. However, routers running
the same software are susceptible to the same bugs. So a complete
monoculture is less than ideal. However, being hit by show-stopping
bugs is relatively rare, so diversifying your router portfolio is probably
something that can wait until your network grows beyond two or
three routers. Another thing to keep in mind is that if you buy two
identical routers, they’re likely to reach the end of their useful lifetime
at the same time, so you may need to buy two new ones at the same
time again at some point in the future.
There are basically two types of routers: devices that are sold as
“routers” or perhaps “switches”, and general purpose computer
hardware running router software. The advantage of special purpose
devices is that you don't have to worry about the parts working well
together. The advantage of software routers is the added flexibility.
As IBM itself mentions on the history section of its website, back in the
day there was a catch phrase in the industry: “Nobody ever got fired
for buying IBM”. In other words: there’s safety in going with the mar
ket leader. When it comes to routers, that would be Cisco and Juniper.
36
Which is of course not to say that other vendors, such as Extreme,
Arista, Nokia or even maker of budget BGP-capable routers Mikrotik,
don’t build quality products.
By the end of 2021, the IPv4 BGP table reached 900k (900,000) prefixes.
The growth rate of the IPv4 BGP table has been declining from an av
erage 10% per year in the 2010s to 6% in 2021. The IPv6 BGP table was
about 145k by the end of 2021, but has been growing much faster at
31% per year the past few years and 37% in 2021. Based on the 2021
growth rates, we can predict the following table sizes over the next
years:
1. The BGP RIB (routing information base): this table holds all BGP
information received from all neighbors. So with two ISPs send
ing full IPv4 and IPv6 tables, that's 2 × (900k + 145k) = 2.1 mil
lion prefixes, along with their path attributes.
37
2. The main IPv4 and IPv6 RIBs or just “routing table”: these tables
holds the best route from each routing protocol. So if your net
work has 5000 IPv4 and 5000 IPv6 OSPF routes in addition to the
full IPv4 and IPv6 BGP tables, the main routing tables hold 900k
+ 5k = 905k IPv4 and 145k + 5k = 150k IPv6 routes.
The BGP and main RIBs are stored in RAM. These days, RAM is usual
ly not a constrained resource in routers, and a few gigabytes will hold
a lot of RIB entries. But check anyway, especially if you plan to have
more than a couple of IPv4 full table BGP feeds.
The gating factor for the number of prefixes a router can handle is usu
ally the maximum FIB size. Currently, a router that can handle a mil
lion prefixes can hold just the IPv4 table, with no room for IPv6 or
much growth. A router that can handle 1.5 million FIB entries will be
useable for a few more years. 2 million is the minimum for new routers
with an intended economical lifespan of five years.
38
IP addresses and AS numbers
Five regional internet registries (RIRs) are responsible for giving out IP
addresses and AS numbers. These are the RIRs and their service re
gions:
• AfriNIC, Africa
• APNIC, Asia-Pacific
However, the supply of IPv4 address space has run out in all five re
gions. AfriNIC and APNIC operate under final /11 and final /8 poli
cies, respectively. Each LIRs in the respective regions will be able to
obtain one last block of at most /22 (1024 addresses) from those RIRs.
The RIPE NCC and LACNIC even used up their respective final /8
and final /10. So along with ARIN, the RIPE NCC and LACNIC now
have a waiting list.
When LIRs request IPv4 addresses, they go on the waiting list, and
they’re given addresses as they become available after address space is
returned to the RIRs and kept in quarantine for some time. The RIPE
NCC allows for /24s, while ARIN and LACNIC allow for larger re
quests but these may of course lead to longer wait times.
IPv4 address space can also be transferred from its existing holder to a
new one, usually through a broker or the RIR. In other words: you can
buy IPv4 addresses on the open market. The going rate is about 50 to
60 US$ per address at the time of this writing. Be careful buying IPv4
39
addresses, it has happened that a network sold some IPv4 addresses
but then kept using those addresses themselves.
There is of course more than enough IPv6 address space available. ISPs
usually get at leas a /32, or more if they can document that they’ll
need the additional address space in the foreseeable future. These are
provider aggregatable (PA) address blocks, which can be used to provide
address space to customers. Networks that aren't service providers can
get provider independent (PI) address blocks, those are usually /48.
40
BGP configuration 101
With all the preparations out of the way, we’re now ready to start con
figuring a router to speak BGP!
The assumption is that the router is already set up, has connectivity to
two ISPs, and that the router interfaces towards those ISPs are config
ured with the right IP addresses. Example 1 shows the simplest possi
ble BGP configuration with two ISPs.
If you want to try this example and the other examples for yourself,
have a look at Installing the minilab and running examples at the end
of the book. If you've never configured a router using a Cisco-like
command line interface (CLI), have a look at Appendix: the router CLI
for a short introduction.
The router bgp 65082 line tells the router that we want to configure
the BGP protocol, and that this router belongs to AS 65082. The next
line tells the router that we want to originate the prefix 192.0.2.0/24.
Originate means that this router injects this prefix into BGP and tells
the rest of the world that these addresses are used in our AS.
41
On Cisco routers, we can't specify our prefix or prefixes us
ing CIDR notation. Instead, we'll have to use a mask. In this
case that would be network 192.0.2.0 mask
255.255.255.0. But when displaying the configuration, the
mask part will be left out, as the mask that corresponds
to /24 is implied for class C networks. With the FRRouting
software for Linux, either prefix notation or a mask is ac
cepted.
We can monitor the progress of the BGP session establishment with the
show ip bgp summary command. This is what an older router would
show if we asked it what's going on with BGP:
This will look a little different when you try the example yourself, as
the output of the router commands sometimes has to be edited so the
lines don't get too long and some less relevant information is left out.
Also, different routers will show slightly different output, but they will
largely show the same information.
For the first neighbor, the state is a number. This means the BGP ses
sion is in the Established state, and the number is the number of pre
fixes received and accepted from the neighbor. (I.e., prefixes filtered
out don’t count.)
Should the InQ or OutQ numbers be higher than zero, this means the
routers are still busy exchanging prefixes. However, a zero here
doesn’t necessarily mean they’re not exchanging prefixes currently.
The second neighbor is in the Active state, and has never been up (in
the Established state). If this persists or if the state goes to Idle, there’s
likely a problem that warrants talking to someone who can check the
42
other end of the BGP session. But in the case above we were just a bit
impatient and the second BGP session came up a few moments later.
That is not good. So some newer routers will not send any outgoing
updates until an outgoing filter or policy is configured and not accept
incoming updates until an incoming filter or policy is configured, as
per [RFC 8212]. So FRRouting version 8 (which is used if you want to
run the examples yourself using the Docker BGP minilab), you'll get
the following results with the example 1 configuration in effect:
43
Router# show ip bgp summary
The fact that the router sends four prefixes to each neighbor is a bit un
expected. So let's see which prefixes it's sending to neighbor
192.0.2.21:
It's a bit odd that FRRouting sends prefixes it just learned from AS
65030 back to AS 65030, but that shouldn't cause problems. And indeed
it sends the AS 65040 prefixes to AS 65030 (as well as the other way
around), so in the next chapter we're going to add some filters to keep
that from happening.
44
So our own prefix is not in our router's IP routing table. In that situa
tion, the logic is that if the router itself doesn't know where to send
packets for this prefix, how can it advertise this prefix to the rest of the
world? We can fix this using a static route:
The 250 is the priority of the static route. Any other routes for that
same prefix with a lower priority value will override the Null0 route.
With this route in effect, the router advertises the prefix to its neigh
bors:
45
Example 3: The IPv6 version of examples 1 and 2
!
router bgp 65082
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
no neighbor 2001:db8:30:8201::1 activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor 2001:db8:30:8201::1 activate
exit-address-family
!
ipv6 route 2001:db8:82::/48 Null0 250
!
And:
46
Router# show bgp ipv6 unicast
BGP table version is 2, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
RPKI
Origin
validation
codes: icodes:
- IGP,Vevalid,
- EGP,I?invalid,
- incomplete
N Not found
The fe80:: next hop address is an IPv6 link local address. All IPv6
routing protocols are required to use link local next hop addresses, as
link local addresses are required so routers can send ICMPv6 redirect
[RFC 4861] messages when necessary. However, for iBGP to work
properly, regular global unicast next hop addresses are required.
Which is also present if we further inspect the prefix in question:
47
Filtering BGP
So the router is advertising a prefix received from ISP 30 to ISP 40. This
means that ISP 40 may start sending traffic towards ISP 30 through our
AS. That's certainly not something we want—after all, we pay our ISPs
so they handle our traffic, not the other way around!
48
In larger and/or more complex networks, AS path filters and prefix
filters aren’t always appropriate. In those cases, it's better to filter using
BGP communities. This type of filtering is more complex to set up, but
after that, it’s easier to manage when there are changes.
AS path filters
AS path filters use regular expressions (regex or regexp [W]) to allow or
deny routes based on what's in the AS path. A regex is a pattern that
will match or not match a line of text. For our purposes, that line of
text is the textual representation of the AS path. Regular expressions
used by Cisco routers (and other routers with a similar command line
interface, such as Quagga and FRRouting) are more limited than regu
lar expressions found elsewhere. This is the syntax that is supported:
. any character
[] enclose a choice/range of characters, such as [2345] or [2-5]
() enclose a string of characters
+ the preceding character or ()-enclosed string must occur one or more
times
* the preceding character or ()-enclosed string may occur zero or more
times
^ start of the line
$ end of the line
_ a comma, left brace ({), right brace (}), a space or the start or end of the
line
650 Any AS path with the sequence 650 in it. That includes 1000
650 2000 but also 1000 65082 2000.
_650_ Any AS path with AS 650 in it. That includes 1000 650 2000
and just 650 but not 1000 65082 2000.
^650_ Any AS path that starts with AS 650, including just 650.
_650$ Any AS path that ends with AS 650, including just 650.
^(650_)+ Any AS path that only contains AS 650 one or more times.
49
^(650_)* An empty AS path or AS paths that only contain AS 650 mul
tiple times.
^$ An empty AS path.
Traditionally, AS path access lists are numbered, but today, in nearly all
cases a name can be used, too. AS path access list 2 permits
^(_65082)*$, the regular expression that allows our AS path through,
even if there are prepends present. After this single line, the implicit
50
deny comes into play. In router filters, the rule is that if something isn't
allowed, it’s denied. So all AS paths that don't match ^(_65082)*$ are
implicitly denied. We apply the AS-path access-list to both BGP neigh
bors with the neighbor ... filter-list 2 out configuration
command. We could of course have made two different filters for the
two neighbors, but that wasn't necessary. We can also filter incoming
BGP updates with an AS-path access-list by configuring filter
list ... in for a neighbor.
The reason for this is that the new filter only applies to advertisements
that happen after the filter is created or changed. So in order to see the
effect of the new filter, we need to make the router send all of its pre
fixes to its neighbor. Normally, this only happens when a BGP session
is established. So disconnecting the BGP TCP session and then waiting
51
for it to be reestablished will make sure the filter takes effect. The
command to do this is clear ip bgp <neighbor address> to reset
the BGP session towards the neighbor with that address, clear ip
bgp <neighbor AS> to reset the BGP sessions towards all neighbors
with that AS number, or clear ip bgp * to reset all BGP sessions.
But, such a hard reset is a very crude way to apply new or modified
filters. Depending on many factors, this may lead to a noticeable inter
ruption of your networks's reachability. If there are multiple resets in a
short time and remote networks implement BGP flap damping (see the
section on flap damping later in the book), then the unreachability to
wards those networks may persist for 30 minutes. However, flap
damping isn't widely used anymore.
In any event, it's much better to do a soft clear using the route refresh
mechanism. With this, the router simply goes through its entire BGP
table and sends updates as allowed by the outgoing filters that are cur
rently in effect. The neighbor will apply its current incoming filters. We
can do this with the clear ip bgp ... out command.
Here we have the desired result: only our own prefix 192.0.2.0/24 is
advertised to the neighboring AS. The same AS path filters can be ap
plied to the IPv6 as well as the IPv4 address families.
52
Prefix filters
Another way to make sure only our own prefix(es) are advertised is
with a prefix filter. It’s highly recommended to have both AS path and
prefix filters, as this way an issue with one of the filters doesn’t imme
diately let incorrect prefixes escape your network. This is especially
relevant on Cisco and Cisco-like platforms, where configuration
changes apply immediately after each line is entered. This means that
while modifying a filter, there is a short time when just part of the filter
has been entered. And yes, prefixes have been known to escape in the
fraction of a second it took for the whole filter to be pasted from a pre
prepared text file to the router's command line.
Prefix filters are also helpful in making sure that if another AS adver
tises (part of) our own address space to us, we don't listen to that. Oth
erwise, we me end up sending traffic to our own addresses out of the
network where other people can take a look at it or impersonate our
servers. So we want a prefix list that allows our own prefix out:
!
ip prefix-list out-prefixes permit 192.0.2.0/24
!
And another prefix list that blocks our own prefix in the incoming di
rection, but lets everything else through:
!
ip prefix-list in-prefixes deny 192.0.2.0/24
ip prefix-list in-prefixes permit any
!
53
Example 5. Someone announces part of our address space to us
!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.41 remote-as 65040
neighbor 192.0.2.41 description ISP 40
!
ip prefix-list out-prefixes seq 5 permit 192.0.2.0/24
ip prefix-list in-prefixes seq 5 deny 192.0.2.0/24
ip prefix-list in-prefixes seq 10 permit any
!
The reason the /32 gets through is that the prefix list only blocks the
exact 192.0.2.0/24 block, but not any more specifics (sub-prefixes).
Fortunately, we don't have to list all possible longer prefixes, but we
can filter sub-prefixes based on prefix length instead. We can do that
with the le (less or equal) and/or ge (greater or equal) arguments. For
instance:
!
ip prefix-list test permit 192.0.2.0/24 ge 26 le 28
!
54
ip prefix-list in-prefixes deny 192.0.2.0/24 le 32
ip prefix-list in-prefixes permit 0.0.0.0/0 le 24
!
Should you prefer doing the same with 192.0.2.0/24 ge 24, you'll
find that the router doesn't like that:
The first filter line denies what we don't want. If we end the filter here,
the “implicit deny” will kick in and the filter will not allow any prefix
es through. We could finish the filter with a permit any. (Any is the
same as 0.0.0.0/0 le 32.)
However, in this case we'll finish the filter with permit 0.0.0.0/0 le
24. This allows all prefixes as long as the prefix length is no longer
than /24. This is fairly common practice on the internet, with the result
that /24 is the longest prefix that we can expect to be accepted by all
ASes throughout the internet.
You may have noticed seq 5 and seq 10 in front of the filter rules. If
we display the configuration with show running-configuration
we’ll also see those sequence numbers; the router adds these automati
cally. We can then later add new filter rules between existing ones.
Example 7 has the IPv6 versions of the IPv4 prefix lists in example 6.
55
Example 7. IPv6 prefix filters
!
router bgp 65082
neighbor 2001:db8:30:8201::1 remote-as 65030
neighbor 2001:db8:30:8201::1 description ISP 30
no neighbor 2001:db8:30:8201::1 activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor 2001:db8:30:8201::1 activate
neighbor 2001:db8:30:8201::1 prefix-list in-filter in
neighbor 2001:db8:30:8201::1 prefix-list out-filter out
exit-address-family
!
ipv6 prefix-list in-filter seq 5 deny 2001:db8:82::/48 le 128
ipv6 prefix-list in-filter seq 10 permit ::/0 le 48
ipv6 prefix-list out-filter seq 5 permit 2001:db8:82::/48
!
The main difference between IPv4 and IPv6 prefix-lists is that for IPv6,
the neighbor ... prefix-list ... commands go under the ad
dress-family ipv6 section of the BGP configuration. Note that the
name for the IPv4 and IPv6 prefix lists are the same, that’s not a prob
lem for the router but it could be somewhat confusing.
In the out-filter we again permit just our own prefix, and in the in
filter we deny our own prefix up to the maximum /128 prefix
length. The second line of the in-filter prefix-list allows all IPv6 prefix
es with a prefix length of /48 or less. That’s similar to the IPv4 conven
tion that prefixes up to /24 are accepted, although with IPv6, the /48
practice is less universal; some networks accept longer prefixes. The
outgoing filter performs as expected:
Router# show bgp ipv6 unicast neighbors 2001:db8:30:8201::1
advertised-routes
BGP table version is 2, local router ID is 192.0.2.251, vrf id 0
56
Community-based filters
The combination of an AS path filter and prefix filters works well.
However, AS path filters become hard to manage in networks that
have their own BGP customers, especially as the number of routers
increases. The reason for this is that when you add a BGP customer,
you'll have to update your AS path filter to allow the customer's AS
number, and then configure the new filter on all your BGP routers. As
such, AS path filters are not recommended for networks that have BGP
customers.
The same is true for prefix filters. Those also become harder to manage
if you’re a fast growing network that adds new IP address ranges fairly
regularly. So prefix filters aren't recommended for networks that have
BGP customers or networks that deploy new IP address blocks regu
larly.
Communities are not part of the core BGP specification, they were
added in 1996 [RFC 1997]. Communities are 32-bit values that can be
attached to prefixes. The IANA keeps a list of well-known communi
ties. The most relevant well-known communities are:
57
In addition to using well-known communities, networks can define
their own and attach desired behaviors to them. The convention is that
in this case, the first 16 bits of the 32-bit community value is the AS
number of the AS that defines the behavior. The two 16-bit parts are
displayed as numbers with a colon in between. For instance, 64499:13
is a community defined by AS 64499. Communities can be attached to
prefixes and acted upon in the same AS, but it’s also possible for com
munities to trigger actions in external ASes. We’ll do the former here
and discuss the latter in the chapter Traffic engineering.
58
route-map in-rmap permit 10
set comm-list localprefixes delete
!
route-map out-rmap permit 10
match community localprefixes
set comm-list localprefixes delete
!
Right after router bgp 65082 there are two network statements, our
usual suspect 192.0.2.0/24 is now followed by route-map origi
nate. There is also a second prefix 203.0.113.0/24 without the
route map, so we can easily see how the two prefixes are handled dif
ferently in a moment. The originate route map shows up later in the
configuration, where it adds the community 65082:1 to any prefixes
that have this route map applied to them.
The IPv4 BGP session as well as the IPv6 BGP one all have two other
route maps applied in-rmap for incoming BGP updates and out-rmap
for outgoing BGP updates. The out-rmap matches prefixes that have
the community 65082:1 through the community list localprefixes.
If that community is present, we use the localprefixes community
list to remove the 65082:1 community, to avoid flooding the internet
with irrelevant communities.
Because the route map had a match, the prefix is allowed through. Pre
fixes without the 65082:1 community reach the end of the route map,
where they’re subject to the implicit deny rule, so these prefixes are not
allowed through. These route maps work the same on IPv4 and IPv6
prefixes.
The reason we also have the in-rmap route map is to make sure that if
a BGP neighbor sends us prefixes with community 65082:1 on them,
that community is stripped off: the set comm-list localprefixes
delete line removes all communities from a prefix that match the lo
calprefixes community list. Communities that don’t match the
community list are left in place. Without this, any prefixes that we re
ceive from external networks would be announced to the rest of the
world if that community happens to be present. It is unlikely that this
59
would happen by accident, but someone could attach the community
out of malicious intent.
With this way of filtering, connecting a new customer with their own
AS and/or IP prefix(es) to the network requires doesn't require updat
ing the outgoing AS path and prefix filters on all routers.
60
In networks with more than a handful of routers, updating the config
urations on all of them manually is too time consuming and quickly
leads to out-of-sync filters, which invariably leads to time consuming
troubleshooting later. In a network that can automatically deploy con
figuration changes, this is less of an issue, but even then, updating all
router configurations is not something you'd want to do unless you
really have to.
61
We can then use the show ip bgp community-list localprefixes
command (or the show ip bgp community 65082:1 command) to
see which prefixes have this community attached:
Router# show ip bgp community-list localprefixes
BGP table version is 6, local router ID is 192.0.2.251, vrf id 0
This quickly reveals that we didn't put the route map originate on
the 203.0.113.0/24 prefix and didn't add the 203 prefix to the prefix
list, so the community filter and the prefix list lead to the same result.
The AS path filter can't differentiate between 192.0.2.0/24 and
203.0.113.0/24, as they are both originated locally and thus have the
same AS path.
In the real world, it wouldn't make sense to use three different filters
together. A useful approach would be:
62
1. When first deploying BGP, use prefix filters + AS path filters.
And it may prove useful to have one or more filters that aren't actually
applied to BGP sessions, but can be used for consistency checking with
the commands discussed above.
63
Transit and peering
So far, we’ve mostly assumed that you buy service from an ISP and
that smaller ISPs buy service from bigger ISPs. This service, where the
ISP provides connectivity to all destinations, is called “transit”.
With regional networks A and B peering with each other, traffic from A
to any other destination than B and traffic from B to any other destina
tion than A would still go over the NFSNET Backbone.
Internet exchanges
The obvious way to peer with another network is for the two peers to
install a direct connection between them. This is known as a private
64
network interconnect (PNI) or simply “private peering”. To keep costs
to a minimum, the preferred way to do this with an in-house connec
tion within the same facility (data center) where the two networks are
present.
65
The business of peering: peering policies
Smaller networks are generally happy to peer with anyone, as any traf
fic that is handled by peering doesn’t use more costly transit capacity.
They thus have an “open” peering policy.
For larger networks, there are several reasons they may not want to
peer with smaller networks. For instance, for a network that operates
in a single country, it’s a very good deal to peer with a large in
ternational network, as they basically get to use the large network’s
long distance network for free. For the large network, this is not a very
good deal, and they would rather see the small network buy transit
service from them. Peering with very many small networks in many
locations is also simply a lot of work for relatively little benefit. Espe
cially if the large network already peers with the transit providers of
the small networks.
66
that, all else being equal, BGP will apply “early exit” routing, also
called “hot potato” routing. In other words: BGP will use the external
connection to the AS in question that can be reached over the shortest
distance through the internal network.
Both networks have to carry traffic long distance, but as web servers
generate much more traffic than web users, ISP A needs a lot more
transatlantic capacity than network B. This means that without further
measures, content providers basically get a free ride because the net
works that mostly connect consumers end up handling far more long
distance traffic.
For this reason networks that have mostly “eyeballs” require that in
coming and outgoing traffic must be balanced. (Eyeballs as in users
that look at websites and videos and thus generate incoming traffic but
much less outgoing traffic.)
67
Valley-freeness
In the 1990s, it became clear that it in some situations, BGP never con
verges to a stable state. In this context, stable means that no AS wants
to make any changes. See Appendix: non-converging configurations
for an example where BGP never converges at all. A more realistic ex
ample is that of “BGP wedgies” [RFC 4264], where a backup configu
ration may get stuck in the backup state even when the primary path
becomes available again. So BGP does converge to a stable state, just
not to the intended one.
In 2001, the seminal paper Stable Internet routing without global coor
dination showed that if an AS observes a set of guidelines, that AS will
see BGP converge to a stable state.
The main takeaway from these guidelines is that prefixes learned from
a customer must have a higher local preference than prefixes learned
from non-customers. So under normal circumstances, an ISP always
sends traffic to a customer over the link to that customer. Exceptions
are possible for backup links, but those require extra attention.
68
1. 1 2
2. 1 2
30 40 50 30 40 50
3. 4. 5.
1 2 1 2 1 2
30 40 50 30 40 50 30 40 50
6.600 30 1 40 700
2 50 7. 30
1
40
2
50
600 700
8. 30
1
40
2
9. 1 2
10. 1 2
50 30 40 50 30 40 50
69
In a hierarchical network diagram, such invalid paths are easily identi
fied by a “valley” along the way. In order for a path to be valley-free, a
path may go up the hierarchy (from customer to provider) until it
reaches a peering link or starts to go down the hierarchy (from
provider to customer). After that peering link or provider to customer
link, the path may only go down the hierarchy through additional
provider to customer links.
It’s important to understand that neither the part of the path left or
right of the valley is invalid in and of itself, it’s the combination of
those two halves that makes the path invalid.
70
BGP peering configuration
As per the table above, the outgoing BGP filters on BGP sessions to
wards transit providers and customers are the same. However, incom
ing prefixes are treated differently, as is show in example 10.
71
We haven’t seen the maximum-prefix configuration command yet.
With this command, we can limit the number of prefixes we’ll accept
from a neighbor. This setting is useful for peers, but generally not for
transit providers and customers: transit providers send us all global
prefixes, so there’s no point in setting a limiting on BGP sessions to
wards transit providers. For customers, unless those are very large
networks in their own right, we’ll want to explicitly allow their indi
vidual prefixes, so there’s no need to additionally limit the number of
prefixes we’ll accept.
The session stays down until restarted manually with the clear ip
bgp ... command. Additional arguments to the maximum-prefix
command can be one or more of the following: a percentage at which a
warning is generated, restart and a number in minutes after which
the session is restarted and warning-only.
The other difference between the peer configuration and earlier transit
provider configurations, in addition to the maximum-prefix setting, is
72
the change to the in-prefixes and in-ipv6-prefixes filters. These
now filter out the address blocks 203.0.113.0/24 and
2001:db8:90::/64 and any sub-prefixes in incoming BGP updates.
Those are the prefixes used by the internet exchange, which contain all
the neighbor addresses for our IX peers:
The next hop addresses for the routes learned from these IX peers also
fall within the IX “peering LAN” prefix. A BGP route for an IX prefix
may reroute these addresses, which may disrupt the BGP session and/
or disrupt the traffic flow towards peers. So it’s important to filter out
the IPv4 and IPv6 peering LAN prefixes of all internet exchanges that
your network connects to on all BGP sessions, not just the ones towards
the respective internet exchange peers.
The main effect was that BGP packets between other peers would now
flow through an extra router hop, which BGP doesn't allow for eBGP
sessions. So these sessions started to go down in large numbers. In
2014 the AMS-IX had to renumber again, this time from a /22 to a /21.
Same thing happened again.
73
00:0006:5083, or just :6:5083 as per the IPv6 notation rules regard
ing leading zeros. The /64 prefix of the peering LAN
(2001:db8:90::/64) goes in front of the semicolon-padded AS num
ber, and :1 is added to the end for the AS’s first router connected to
the exchange, :2 for the second and so on. With as the results
2001:db8:90::42:650:8500:1 and 2001:db8:90::6:5083:1, re
spectively.
Peer groups
When peering at a large internet exchange, it’s not uncommon to have
BGP sessions with more than a hundred peers. That creates two prob
lems: when the router has to send a BGP update message, it has to do a
lot of work, and the configuration gets very long. Peer groups are in
tended to solve both problems.
Although peer groups are especially useful for internet exchange peer
ing configurations, the word “peer” applies to the concept of a BGP
peer (neighbor); peer groups can be used for all types of BGP neigh
bors.
When BGP neighbors are part of the same peer group, the router only
has to create one BGP update message for the group, and it can then
send copies of that message to all the members. Without peer groups,
the router has go through the process of applying filters and policies
for each neighbor separately whenever it needs to send an update
message.
74
Example 11. An IPv4 peer group configuration
!
router bgp 65082
neighbor ix-ipv4-peers peer-group
neighbor ix-ipv4-peers description IPv4 IX peers, max 10 prefixes
neighbor ix-ipv4-peers maximum-prefix 10
neighbor ix-ipv4-peers prefix-list in-prefixes in
neighbor ix-ipv4-peers prefix-list out-prefixes out
neighbor ix-ipv4-peers filter-list 2 out
neighbor 203.0.113.83 remote-as 65083
neighbor 203.0.113.83 peer-group ix-ipv4-peers
neighbor 203.0.113.83 description IX peer 83
neighbor 203.0.113.84 remote-as 65084
neighbor 203.0.113.84 peer-group ix-ipv4-peers
neighbor 203.0.113.84 description IX peer 84
neighbor 203.0.113.85 remote-as 4206508500
neighbor 203.0.113.85 peer-group ix-ipv4-peers
neighbor 203.0.113.85 description IX peer 85
neighbor 203.0.113.85 maximum-prefix 100
!
Example 11 shows that peer group members don’t have to share all
settings: the remote AS is different for each peer. They do have to share
the settings that may influence outgoing updates, i.e., outbound filters
and route maps. Other settings, including inbound filters and route
maps, may differ. In the example, the last line specifies a maximum
prefix limit of 100 for IX peer 85, which overrules the limit of 10 that
would otherwise be inherited from the peer group.
For IPv6, peer groups get more complex, as we can see in example 12,
which adds a peer group version of the IPv6 part of example 10.
75
neighbor 2001:db8:90::6:5084:1 remote-as 65084
neighbor 2001:db8:90::6:5084:1 description IX peer 84
no neighbor 2001:db8:90::6:5084:1 activate
neighbor 2001:db8:90:0:42:650:8500:1 remote-as 4206508500
neighbor 2001:db8:90:0:42:650:8500:1 description IX peer 85
no neighbor 2001:db8:90:0:42:650:8500:1 activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor ix-ipv6-peers activate
neighbor ix-ipv6-peers maximum-prefix 10
neighbor ix-ipv6-peers prefix-list in-ipv6-prefixes in
neighbor ix-ipv6-peers prefix-list out-ipv6-prefixes out
neighbor ix-ipv6-peers filter-list 2 out
neighbor 2001:db8:90::6:5083:1 peer-group ix-ipv6-peers
neighbor 2001:db8:90::6:5084:1 peer-group ix-ipv6-peers
neighbor 2001:db8:90:0:42:650:8500:1 peer-group ix-ipv6-peers
exit-address-family
!
If a peer group is applied under the router bgp heading, the peer
group applies to both session related settings, such as the remote AS,
the description and neighbor ... shutdown / no neighbor ...
shutdown, as well as settings specific to the IPv4 address family, such
as filters. The same peer group or another one can be applied under
the address-family ipv6 unicast heading and will then govern
IPv6 specific settings.
76
Internet exchange route servers
The advantage of internet exchanges is that they make it possible to
peer with a large number of other networks in one place. The down
side is that it still takes a lot of work to contact all these other networks
and set up BGP sessions with them. For this reason, internet exchanges
have route servers. Unlike normal peers, a route server propagates
paths learned from one peer to all its other peers. Example 13 adds
IPv4 and IPv6 BGP sessions with the internet exchange route server.
Route servers take advantage of the fact that in this situation, where
both of the route server’s neighbors are connected to the same layer 2
network, BGP is smart enough to keep the next hop address the same.
This means that paths learned through the route server have the same
next hop address as paths learned directly, as we can see for the prefix
es 10.0.83.0/24 and 10.0.84.0/24 that are learned both directly
and from the route server:
Router# show ip bgp
BGP table version is 4, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
77
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Note that the next hop address is the same directly and for the path
learned from the route server. When a BGP router re-advertises a path
from one router on a subnet to another router on the same subnet, it
doesn't update the next hop address so packets can flow directly be
tween the two other routers without going through the one in the
middle.
Paths through the route server are an extra AS hop longer, as the route
server adds its own AS to the AS path as required by the BGP specifi
cation. This means the bilateral peering sessions (directly with the peer
in question) have a shorter AS path so these paths are preferred over
multilateral peering through a route server.
78
Many routers check if the first AS in AS paths of incoming updates is
indeed the neighbor AS, and if not, log one or more errors and tear
down the BGP session:
2020/05/14 14:01:21 BGP: 203.0.113.90 incorrect first AS (must be
65090)
2020/05/14 14:01:21 BGP: %NOTIFICATION: sent to neighbor
203.0.113.90 3/11 (UPDATE Message Error/Malformed AS_PATH) 0 bytes
FRRouting has this check turned off by default, but if you turn it on
with neighbor ... enforce-first-as, the BGP session is not torn
down like a Cisco router does as shown above. Rather, prefixes with a
different first AS than the neighbor AS in the AS path are filtered out.
When peering with route servers that don't add their AS to the AS path
and routers that do enforce this by default, it’s necessary to use the no
bgp enforce-first-as configuration command on Cisco routers:
With the route server now leaving out its AS number from the AS path,
the prefixes learned directly from peers and those same prefixes
learned from the route server look the same:
Router# show ip bgp
Network Next Hop Metric LocPrf Weight Path
* 10.0.83.0/24 203.0.113.83 0 0 65083 i
*> 203.0.113.83 0 0 65083 i
* 10.0.84.0/24 203.0.113.84 0 0 65084 i
*> 203.0.113.84 0 0 65084 i
* 10.0.85.0/24 203.0.113.85 0 0 4206508500 i
*> 203.0.113.85 0 0 4206508500 i
It is of course still possible to list all paths learned from the route
server using the show ip bgp neighbors ... routes command:
79
Router# show ip bgp neighbors 203.0.113.90 routes
Network Next Hop Metric LocPrf Weight Path
*> 10.0.83.0/24 203.0.113.83 0 0 65083 i
*> 10.0.84.0/24 203.0.113.84 0 0 65084 i
* 10.0.85.0/24 203.0.113.85 0 0 4206508500 i
See the section on the multi-exit discriminator in the next chapter for
an example where route server paths are given lower priority than di
rect paths.
80
Traffic engineering
Cisco RFC
4271
1 Prefer the path with the highest WEIGHT.
2 * Prefer the path with the highest LOCAL_PREF.
3 Prefer the path that was locally originated.
4 a Prefer the path with the shortest AS_PATH.
5 b Prefer the path with the lowest origin type.
6 c Prefer the path with the lowest multi-exit discrim
inator (MED).
7 d Prefer eBGP over iBGP paths.
8 e Prefer the path with the lowest IGP metric to the
BGP next hop.
9 Determine if multiple paths require installation in
the routing table for BGP multipath.
10 When both paths are external, prefer the path that
was received first (the oldest one).
81
11 f Prefer the route that comes from the BGP router
with the lowest router ID.
12 If the originator or router ID is the same for mul
tiple paths, prefer the path with the minimum
cluster list length.
13 g Prefer the path that comes from the lowest neigh
bor address.
Most of the text in the table above comes from Cisco’s description,
which contains additional notes for most steps. Steps 1, 3 and 5 usually
don’t come into play. Step 9 is only relevant with BGP multipath,
which we’ll discuss later this chapter. Steps 10 and above are pure tie
breakers that make sure the router will eventually be able to select a
path even though there is no real reason to prefer one over the other.
This leaves steps 2, 4, 6, 7 and 8.
When a BGP router receives an update, it runs the path selection algo
rithm for each prefix contained in the update. Obviously, if after the
update there is only one valid path towards a prefix, it selects that path
as the best one. If after the update there are multiple routes or paths
towards a prefix, the router evaluates the steps in the algorithm until a
single best path is left.
RFC 4271 really only specifies one rule to select the best path: Cisco’s
step 2, prefer the path with the highest LOCAL_PREF. RFC 4271
steps a - g are all considered tie breakers. The local preference attribute
is always present for paths learned from other routers within our own
AS over iBGP. When a router learns a path from an external AS, or
originates a path itself, there may not be a local preference value avail
able. In that case, 100, the default local preference value, will be used.
The router now identifies the maximum local preference value for all
the paths under consideration. Then all paths that have a lower local
preference than this maximum are removed from consideration. (They
remain in the BGP table.) So if there are three paths with local prefer
ences of 90, 100 and 110, respectively, the maximum is 110. The paths
82
with the local preference of 90 and 100 are removed from considera
tion. This leaves the path with a local preference of 110, which is now
selected as the best path and the rest of the steps are skipped.
However, suppose the router learns a new path that also has a local
preference of 110. In this case, the maximum local preference is still
110, and the paths with 90 and 100 are still removed from considera
tion. This leaves the two paths that have a local preference of 110,
which then both move on to the next step in the algorithm.
Cisco’s step 4 and the BGP specification’s tie breaker a is prefer the
path with the shortest AS_PATH. That is the AS path with the fewest
AS numbers in it. So the AS path 64999 65000 is shorter than the AS
path 12 34 56. Like with the local preference, if only one path/route
has the shortest AS path, that one is declared best and the remaining
steps in the algorithm are skipped. If multiple paths/routes share the
shortest AS path length, those move on to the next step.
83
Step 8/e is prefer the path with the lowest IGP metric to the BGP
next hop. If there are still multiple paths under consideration at this
point, those paths are equally preferred purely from a BGP viewpoint,
so at this step, the algorithm looks at information from the interior
gateway protocol such as OSPF. This step compares the IGP metrics for
the next hop addresses of the paths still under consideration, effective
ly preferring the path that requires traveling the shortest distance
through the internal network.
This means that tuning of the cost values in the internal network will
influence BGP, but only if all the BGP attributes are the same or very
similar. That will usually be the case for paths learned from the same
neighboring AS in multiple locations. This step is the one that makes
BGP use hot potato / early exit routing.
In many cases, we’ll want to overrule or adjust the results of the path
selection algorithm. This is done by using a route map to set the local
preference to a certain value, make the AS path longer (prepending) or
change the MED.
Route maps
Changing the path attributes that are used by the BGP best path selec
tion algorithm is done with route maps. We’ve used route maps earlier
in the Filtering BGP chapter, but before we continue, it’s a good idea to
cover the workings of route maps in a bit more detail.
84
However, if a route map clause is a deny clause, if there is a match, the
prefix is not allowed into the BGP RIB or propagated to the neighbor. If
there was no match, the next route map clause is applied. When ap
plied to BGP sessions, route maps have the usual “implicit deny” be
havior. So when a prefix progresses through the route map without
matching a permit clause, that prefix is not added to the BGP RIB or
sent to the BGP neighbor. (A clause without a match condition matches
everything.)
!
ip prefix-list more-than-24 seq 5 permit 0.0.0.0/0 ge 24
!
ipv6 prefix-list more-than-48 seq 5 permit ::/0 ge 48
!
ip as-path access-list as64999 permit ^64999$
!
route-map long-prefixes-from-64999 permit 10
match as-path as64999
match ip address prefix-list more-than-24
!
route-map long-prefixes-from-64999 permit 20
match as-path as64999
match ipv6 address prefix-list more-than-48
!
route-map long-prefixes-from-64999 deny 30
match ip address prefix-list more-than-24
!
route-map long-prefixes-from-64999 deny 40
match ipv6 address prefix-list more-than-48
!
route-map long-prefixes-from-64999 permit 50
!
85
both must match. When that’s the case, no set actions are performed,
but the prefix is permitted in.
IPv4 prefixes up to /24 and IPv6 prefixes up to /48 as well as any pre
fixes with an AS path that isn’t exactly 64999 didn’t match, so they go
on to the deny 30 and deny 40 clauses. Those match all prefixes
longer than /24 and longer than /48, respectively. A match in a deny
route map clause means the prefix is denied.
86
Example 15. Increasing the local preference
!
router bgp 65082
neighbor ix-ipv4-peers peer-group
neighbor ix-ipv4-peers description IPv4 IX peers, max 10 prefixes
neighbor ix-ipv4-peers maximum-prefix 10
neighbor ix-ipv4-peers prefix-list in-prefixes in
neighbor ix-ipv4-peers prefix-list out-prefixes out
neighbor ix-ipv4-peers route-map peers-in in
neighbor ix-ipv4-peers filter-list 2 out
neighbor ix-ipv6-peers peer-group
neighbor ix-ipv6-peers description IPv6 IX peers, max 10 prefixes
no neighbor ix-ipv6-peers activate
!
address-family ipv6
network 2001:db8:82::/48
neighbor ix-ipv6-peers activate
neighbor ix-ipv6-peers maximum-prefix 10
neighbor ix-ipv6-peers prefix-list in-ipv6-prefixes in
neighbor ix-ipv6-peers prefix-list out-ipv6-prefixes out
neighbor ix-ipv6-peers route-map peers-in in
neighbor ix-ipv6-peers filter-list 2 out
!
route-map peers-in permit 10
set local-preference 110
!
87
*> 10.0.85.0/24 203.0.113.85 110 0 4206508500 i
* 192.0.2.21 0 65030 4206508500 i
* 192.0.2.41 0 65040 4206508500 i
*> 192.0.2.0/24 0.0.0.0 32768 i
AS path prepending
Changing the local preference is very useful when we want all traffic to
prefer paths learned from certain neighboring ASes over other neigh
boring ASes, but it’s a rather crude tool when there are multiple net
work paths, and we just want to move some traffic from one to another.
In this case, we don’t want to completely sidestep the entire BGP path
selection algorithm, we just want to nudge it a bit. We can do this by
making the AS path a bit longer. In example 16, the peering with net
works 83, 84 and 85 has been shut down, so the prefixes those net
works originate come in through ISPs 30 and 40. 83 prepends the AS
path for the prefix it announces towards ISP 30; 84 prepends towards
ISP 40 and 85 doesn’t prepend.
88
neighbor 203.0.113.82 remote-as 65082
neighbor 203.0.113.82 description peer R
neighbor 203.0.113.82 shutdown
!
route-map prepend1 permit 10
set as-path prepend 65083
!
Also, the BGP session with peer 203.0.113.82 is shut down. (Use no
neighbor ... shutdown to bring the session back up.) The result is
the following BGP table entries in network 82:
89
is chosen. There is also no MED (metric), but that wouldn't have
changed the results. For the 10.0.85.0/24 prefix, the AS path doesn’t
provide any guidance, so let’s look at this prefix in more detail:
Router# show ip bgp 10.0.85.0/24
BGP routing table entry for 10.0.85.0/24, version 20
Paths: (2 available, best #2, table default)
Not advertised to any peer
65030 4206508500
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, valid, external
Community: 65030:1
Last update: Tue Nov 1 15:29:56 2022
65040 4206508500
192.0.2.41 from 192.0.2.41 (198.51.100.255)
Origin IGP, valid, external, best (Older Path)
Community: 65040:2
Last update: Tue Nov 1 15:30:06 2022
This doesn’t seem to make much sense, as the path through AS 65030
should be preferred by step 10 (prefer the oldest, last update 15:29:56
is older than 15:30:06), step 11 (prefer the lowest router ID, with
198.51.100.223 being lower than 198.51.100.255) as well as step 13
(prefer the lowest neighbor address, with 192.0.2.21 being lower
than 192.0.2.41). However, the update at 09:28:37 was due to a soft
reset (clear ip bgp 192.0.2.41 in), so although the path was up
dated at 15:30:06, that update didn’t change anything, so the path
through AS 65040 still counts as the oldest and is thus preferred by
step 10, as indicated by (Older Path).
90
fer one of these paths over the other or others. In example 17, we first
add a second BGP session towards ISP 30, and configure the ISP 30
router to set an MED of 10 on the first session and an MED of 20 on the
second session.
Example 17-2. Setting different MEDs on two parallel BGP sessions on ISP
router 30
!
router bgp 65030
network 10.0.30.0/23
neighbor 192.0.2.22 route-map customer-in in
neighbor 192.0.2.22 route-map med10 out
neighbor 192.0.2.26 remote-as 65082
neighbor 192.0.2.26 route-map customer-in in
neighbor 192.0.2.26 route-map med20 out
!
route-map med10 permit 10
set metric 10
!
route-map med20 permit 10
set metric 20
!
We can see the MED values of 10 and 20 show up for prefixes learned
over the two BGP sessions with AS 65030:
91
Router# show ip bgp
Network Next Hop Metric LocPrf Weight Path
*> 10.0.40.0/21 192.0.2.41 0 0 65040 i
* 10.0.83.0/24 192.0.2.41 0 65040 65083 i
* 192.0.2.25 20 0 65030 65083 i
*> 192.0.2.21 10 0 65030 65083 i
* 10.0.84.0/24 192.0.2.41 0 65040 65084 i
* 192.0.2.25 20 0 65030 65084 i
*> 192.0.2.21 10 0 65030 65084 i
* 10.0.85.0/24 192.0.2.41 0 65040 4206508500 i
* 192.0.2.25 20 0 65030 4206508500 i
*> 192.0.2.21 10 0 65030 4206508500 i
In the output above, for the prefix 10.0.83.0/24 the path with MED
10 and AS path 65030 65083 is selected as best. This path obviously
wins from the one with MED 20 and the same AS path (learned
through the other BGP session with AS 65030), but it’s not immediately
clear why the paths through AS 65030 are preferred over the path
through AS 65040. This must come down to the last few tie breaker
steps in the BGP path selection algorithm. (As one of those is how re
cent the last update was, running the example yourself may not pro
duce the same result.)
This result does clearly illustrate that the MED is only compared be
tween paths learned from the same neighboring AS. However, it may
be useful to be able to influence path selection when multiple paths
with the same AS path length are learned from different neighboring
ASes. In practice, AS path prepending tends to shift too much traffic
from one connection to another. Paths towards a given prefix through
different ISPs often have the same AS path length, so a prepend to
wards one ISP may push traffic to and/or from as much as half of the
92
internet to another connection. Example 18 sets bgp always-com-
pare-med.
The result is that now the path through AS 65040 that has no MED
wins from the paths through AS 65030 with MEDs of 10 and 20:
Router# show ip bgp
BGP table version is 7, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
RPKI
Origin
validation
codes: icodes:
- IGP,Vevalid,
- EGP,I?invalid,
- incomplete
N Not found
However, even though [RFC 4271] specifies that a path with no MED
should be considered to have the lowest possible MED, some imple
mentations may deviate from this, either by default or because they’re
configured to do so. Example 19 shows a bgp bestpath med miss
ing-as-worst configuration:
93
Router# show ip bgp
BGP
Default
Status
table
codes:
local
version
pref
s suppressed,
is
100,
9, localdrouter
damped,
IDhis
history,
192.0.2.251,
* valid,
vrf>id
best,
0
local AS 65082
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found
94
If we look at the paths for prefixes 2001:db8:83::/48 and
2001:db8:84::/48, we can now tell which ones are learned from the
peer directly and which ones are learned through the route server: for
the first prefix, we see a path with a MED of 2, and another one with a
MED of 0. So the first one was learned from the route server, the other
one directly.
Router# show bgp ipv6 unicast
BGP table version is 5, local router ID is 192.0.2.251, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found
For the second prefix, the MEDs are 102 and 200, respectively. What’s
happening here is that AS 65084 has the opposite preference from ours:
they are attaching a higher MED to their prefix when announced di
rectly to peers, and a lower MED when announcing their prefix to the
route server, which the route server subsequently propagates unmodi
fied.
We can also configure the router to subtract a value from the MED at
tached to incoming prefixes with set metric -2, for instance. How
ever, many prefixes won’t have an MED or an MED of 0. Lowering
95
such a MED will result in a MED that’s still 0. So lowering the MED
will often not have the desired effect.
For a network with many upstream ISPs and/or many peers, this is
less of a problem: they can prepend towards some ISPs or peers and
not others. Often, ISPs let their customers take advantage of this abili
ty. The usual way to do this is by attaching certain communities to a
prefix. Each network has their own system for this. For example, Telia
Carrier (AS 1299) allows setting communities for regions such as Eu
rope (1299:200x), North America (1299:500x) and Asia (1299:700x).
There are also communities for individual peers, such as 1299:566x
for Comcast and 1299:264x for Deutsche Telekom. The x denotes 0 - 3
for a number of prepends or 9 to not announce the prefix in question to
that network at all.
96
Router83# show ip bgp
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.5 0 65030 65082 i
* 198.51.100.13 0 65040 65082 i
Network 84:
Network 85:
Router85# show ip bgp
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.101 0 65030 65082 i
* 198.51.100.109 0 65040 65082 i
So they all prefer the path through ISP 30 (AS 65030), possibly over
burdening the connection from the test network to ISP 30, while the
connection to ISP 40 (AS 65040) is left underutilized.
Prepending towards AS 65030 would make them all prefer the path
through AS 65040, so then that connection would be overburdened
and the connection to AS 65030 would be underutilized. By using
community mechanisms provided by ISPs, we can selectively prepend
and get a better balance.
For the purposes of our example, AS 65030 uses the following commu
nities, which are often stored in the AUT-NUM object of an internet
routing registry (IRR) in a format like this:
aut-num: AS65030
as-name: ISP30
descr: Internet Service Provider Three Zero
admin-c: ABCD
tech-c: EFGH-RIPE
remarks:
remarks: -----------------------------------------------------
remarks: Don't announce community
remarks: -----------------------------------------------------
remarks: 0:XXX - don't announce to AS XXX
97
remarks: -----------------------------------------------------
remarks: Customer traffic engineering communities - prepending
remarks: -----------------------------------------------------
remarks: 65001:0 - prepend once to all peers
remarks: 65001:XXX - prepend once to AS XXX
remarks: 65002:0 - prepend twice to all peers
remarks: 65002:XXX - prepend twice to AS XXX
remarks: 65003:0 - prepend 3 x to all peers
remarks: 65003:XXX - prepend 3 x to AS XXX
remarks: -----------------------------------------------------
remarks: Large communities for 32-bit (or 16-bit) AS numbers
remarks: -----------------------------------------------------
remarks: 65030:1:0 - prepend once to all peers
remarks: 65030:1:XXX - prepend once to AS XXX
remarks: 65030:2:0 - prepend twice to all peers
remarks: 65030:2:XXX - prepend twice to AS XXX
remarks: 65030:3:0 - prepend 3 x to all peers
remarks: 65030:3:XXX - prepend 3 x to AS XXX
remarks: -----------------------------------------------------
98
address-family ipv6
neighbor 2001:db8:30:8201::1 activate
neighbor 2001:db8:30:8201::1 prefix-list in-prefixes in
neighbor 2001:db8:30:8201::1 prefix-list out-prefixes out
neighbor 2001:db8:30:8201::1 route-map isp30-out out
neighbor 2001:db8:30:8201::1 filter-list 2 out
exit-address-family
!
route-map isp30-out permit 10
set community 65002:65084
set large-community 65030:1:4206508500
!
route-map rserv90-out permit 10
set community 0:65083 0:65084
!
99
A slight advantage of using regular communities when possible is that
this way, no large community attribute needs to be added to the path,
and the large communities themselves take 12 bytes rather than 4 for a
regular community.
There are also extended communities [RFC 4360], but these are com
plex and although many implementations support them to some ex
tent, they are not used in ways similar to regular communities or large
communities.
Let’s have a look at the BGP tables in networks 83, 84 and 85 to see
what the effect of the example 21 has been. First, network 83. No
change:
Router83# show ip bgp
Network Next Hop LocPrf Weight Path
*> 10.0.84.0/24 203.0.113.84 110 0 65084 i
* 198.51.100.13 0 65040 65084 i
* 198.51.100.5 0 65030 65084 i
*> 192.0.2.0/24 198.51.100.13 0 65040 65082 i
* 198.51.100.5 0 65030 65082 i
Network 84. The path through AS 65030 now has two prepends: AS
65082 appears three times. AS 65030 uses set as-path prepend
last-as 2 to prepend the last AS in the path rather than its own AS.
The unprepended path through AS 65040 is now preferred:
Router84# show ip bgp
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 198.51.100.45 0 65040 65082 i
* 198.51.100.37 0 65030 65082 65082
65082 i
Network 85. The path through AS 65030 now has one prepend (so AS
65030 appears twice) and the path through AS 65040 would now be
now preferred, except that the direct path through peering has an even
shorter AS path and also a higher local preference:
100
Network Next Hop LocPrf Weight Path
*> 192.0.2.0/24 203.0.113.82 110 0 65082 i
* 198.51.100.109 0 65040 65082 i
* 198.51.100.101 0 65030 65082 65082 i
101
other networks have no other choice and can only send the traffic to us
over the path we prefer.
With this configuration in effect, ISP 40 now sees the following paths to
the two halves of 192.0.2.0/24:
Router40# show ip bgp
Network Next Hop LocPrf Path
*> 192.0.2.0/25 198.51.100.142 110 65010 65020 65030 65082 i
*> 192.0.2.128/25 192.0.2.42 200 65082 i
102
cess. (I’m using 192.0.2.0/24 as the example prefix here to maintain
consistency with other examples.)
But even when using “safe” prefix lengths of /24 or shorter for the
more specifics, these more specifics are now only available over one
path so there is a significant risk that they’ll become unavailable at
some point. For instance, if the connection between the example net
work and ISP 30 goes down, 192.0.2.0/25 will be completely un
reachable. We can solve this issue by announcing the aggregate (the
full prefix) as well as the more specifics. This is what example 23 does.
103
!
router bgp 65082
neighbor 192.0.2.21 shutdown
!
The update containing the withdrawal first comes in over the shortest
path. Routers that receive the withdrawal will thus select the next
shortest path. However, in the meantime, the update containing the
withdrawal is traveling down that next shortest path, with is also
withdrawn.
So routers select the third shortest path, which is also quickly with
drawn. This goes on until finally, the longest path is withdrawn.
104
hard and fast rule, this could be different depending on the levels of
interconnectivity between the autonomous systems involved.
Multipath BGP
Over the past two decades, Ethernet has almost completely taken over
as the layer-2 technology that underpins the internet and other net
works. And traditionally, increases in Ethernet speeds were by a factor
ten. So when a Gigabit Ethernet link fills up, the next step is a 10 Giga
bit Ethernet link. However, such a big increase is often relatively costly.
Usually, it makes more sense to simply deploy a second port rather
than upgrade to hardware that’s ten times faster.
ECMP can also be used at layer 3. In that case, the routing protocol
used over the parallel links must be able to determine that multiple
paths can be used in parallel without risk of routing loops, and install
multiple routes for these multiple paths in the routing table. In exam
ple 24, we add a second BGP session over a second connection be
tween our test network and ISP 30.
105
Example 24. Two parallel BGP sessions
!
router bgp 65082
network 192.0.2.0/24
neighbor 192.0.2.21 remote-as 65030
neighbor 192.0.2.21 description ISP 30, first connection
neighbor 192.0.2.21 prefix-list in-prefixes in
neighbor 192.0.2.21 prefix-list out-prefixes out
neighbor 192.0.2.25 remote-as 65030
neighbor 192.0.2.25 description ISP 30, second connection
neighbor 192.0.2.25 prefix-list in-prefixes in
neighbor 192.0.2.25 prefix-list out-prefixes out
!
As a result, we now get two copies of each prefix from ISP 30. For in
stance:
For the two paths we get from AS 65030, all BGP attributes are the
same, except for the next hop address (192.0.2.21 or 192.0.2.25).
We can tell that the two BGP sessions are towards the same router at
the other end because the BGP identifier is the same: 198.51.100.223.
Even the last update time is at first glance the same, so selecting the
best path came down to the last tie breaker: prefer the path that comes
from the lowest neighbor address, as indicated by (Neighbor IP).
106
Router# show ip route 10.0.30.0/23
Routing entry for 10.0.30.0/23
Known via "bgp", distance 20, metric 0, best
Last update 00:14:25 ago
* 192.0.2.21, via eth0.1201, weight 1
107
65030
192.0.2.25 from 192.0.2.25 (198.51.100.223)
Origin IGP, metric 0, valid, external, multipath
Community: 65030:1
65030
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, metric 0, valid, external, multipath, best
(Neighbor IP)
Community: 65030:1
• The paths have the same AS path (the AS path must be identical, not
just the same length)
• The paths have the same IGP metric to the BGP next hop
In other words: only the BGP path selection algorithm steps from 10
and up (f and up) are ignored when determining if paths can be used
108
for multipath. Some BGP implementations also support unequal cost
multipath where packets are distributed over multiple paths with un
equal costs, but this requires additional settings.
In example 25, the two paths are learned from the same router in AS
65030, but this is not a requirement: multipath can also be done for
paths learned from different routers in the same neighboring AS.
However, if there’s one router on the other side, there’s another option
to perform multipath with BGP: with just one BGP session instead of
two or more. This is done in example 26.
The first two lines set up an address for the lo loopback interface. Un
like hosts, which all use 127.0.0.1 and ::1 as the addresses for their
loopback interface, routers have “real” addresses on their loopback
interfaces. This way, the router has an address that is always “up” even
if physical interfaces may go down, which is useful for management,
and also for iBGP, as we’ll see in the iBGP chapter.
109
ebgp-multihop 2. Pointing the update source to the lo interface
makes our router use the address of the lo interface as the source ad
dress in the BGP TCP session towards this neighbor. However, this
makes it seem like there is an extra hop between the two routers,
which BGP normally doesn’t allow. With ebgp-multihop 2 we let the
router know that an extra hop is allowed for this BGP session.
(Note that the interface lo and following line goes in the zebra.
conf file and the two ip route lines into the static.conf file, not
the bgpd.conf file. If you’re using vtysh to talk to FRRouting that will
happen automatically.)
The only clue that something is going on under the surface is that we
get the same address as the next hop address, the neighbor address
and the neighbor’s router ID. However, the routing table does show
we’re load balancing traffic over multiple interfaces:
Router# show ip route 10.0.30.0/23
Routing entry for 10.0.30.0/23
Known via "bgp", distance 20, metric 0, best
Last update 00:03:53 ago
198.51.100.223 (recursive), weight 1
* 192.0.2.21, via eth0.1201, weight 1
* 192.0.2.25, via eth0.1202, weight 1
110
BGP sessions and thus fewer paths in the BGP RIB, preserving memo
ry and CPU cycles on the router.
Best practice is to take the source and destination IP addresses, the pro
tocol number (TCP, UDP, ICMP et cetera) and the source and destina
111
tion port numbers and hash those. The hashing algorithm is often a
CRC function rather than a cryptographic hash. Then use the hash to
assign the packet to one of a fixed number of “buckets”. For instance,
there may be 16 buckets. And each bucket is assigned to a certain net
work link or next hop address. So for instance, buckets 1 - 8 may be
assigned to link A and 9 - 16 to link B. When link C is added to the
group, the buckets may be reassigned, with 1 - 5 to A, 6 - 10 to C and 11
- 16 to B. (So for the TCP sessions in buckets 1 - 5 and 11 - 16 there is no
disruption.)
Obviously, a single TCP session will never balance over multiple links
with per-flow load balancing. Small numbers of TCP sessions will also
often saturate one link but not the other or others. Rule of thumb is
that with 1000 TCP sessions or more per-flow load balancing will use
all links equally.
112
iBGP
So far, we’ve only looked at external BGP, or eBGP. Any BGP session
towards a router in another autonomous system is an eBGP session.
However, if an AS contains multiple routers router with eBGP sessions,
then it’s very helpful if those routers that together handle BGP for the
AS in question coordinate their efforts. This is where internal BGP
(iBGP) comes in.
The idea is that every BGP router within an AS maintains an iBGP ses
sion with every other BGP router in the AS. In service provider net
works, it’s common that all routers run iBGP, even routers that don’t
connect to external ASes. This way, all routers have a full view of all
BGP information so they can make the best routing decisions.
In the early days of BGP the assumption was that only border routers
would run BGP and would then “redistribute” the routing information
learned through eBGP into an internal routing protocol such as OSPF.
But with something like a million prefixes in BGP that practice would
be quite hard on an internal gateway protocol (IGP) such as OSPF
these days. It's easier to just run BGP on every router.
Unlike with eBGP sessions, on iBGP sessions routers don’t add their
own AS number to the AS path and they don’t update the next hop
address. And unlike with eBGP, there is not requirement that there is a
direct connection between two iBGP routers: it’s completely fine to
have additional hops in between. Last but not least, there are normally
no filters or route maps applied to iBGP sessions.
Example 27 is based on a very basic setup where there are two BGP
routers that each connect to a different ISP and the two routers connect
to each other using iBGP. Because these two routers are always directly
connected to each other, there is no need to run an internal routing
protocol. Presumably, the two routers run VRRP [W] so hosts on the
internal network have a virtual IP address they can use as their default
gateway, so they still have a working default gateway if one of the
113
routers goes down. However, a VRRP configuration is not part of the
example.
114
address-family ipv6
network 2001:db8:82::/48
neighbor 192.0.2.121 activate
neighbor 2001:db8:40:8202::1 activate
neighbor 2001:db8:40:8202::1 prefix-list in-prefixes in
neighbor 2001:db8:40:8202::1 prefix-list out-prefixes out
neighbor 2001:db8:40:8202::1 filter-list 2 out
exit-address-family
!
The configuration for the iBGP session between the two routers is ex
ceedingly simple, because there are no filters, route maps or other set
tings on the iBGP session. There is no need to explicitly configure a
BGP session as an iBGP session; a session is an iBGP session when the
remote AS is the same as the router’s own AS—65082 in the example.
In addition to the iBGP session towards R1, R2 also has an eBGP ses
sion with ISP 40. With both R1 and R2 having at least one eBGP session
towards an ISP, each router can maintain connectivity to the internet
on its own when the other router is down. This means that both R1
and R2 use network 192.0.2.0/24 and network 2001:db8:82::/48
to originate our IPv4 and IPv6 prefixes.
If we now have a look at the BGP table in router R2, we see this:
R2# show ip bgp
BGP table version is 10, local router ID is 192.0.2.254, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found
115
Network Next Hop LocPrf Weight Path
*> 10.0.10.0/23 192.0.2.45 0 65040 65010 i
*>i10.0.20.0/22 192.0.2.21 100 0 65030 65020 i
* 10.0.30.0/23 192.0.2.45 0 65040 65010 65020 65030
i
*>i 192.0.2.21 100 0 65030 i
*> 10.0.40.0/21 192.0.2.45 0 65040 i
*> 10.0.83.0/24 192.0.2.45 0 65040 65083 i
i 203.0.113.83 100 0 65083 i
*> 10.0.84.0/24 192.0.2.45 0 65040 65084 i
* i 192.0.2.21 100 0 65030 65084 i
*>i192.0.2.0/24 192.0.2.121 100 0 i
0.0.0.0 32768 i
The routes learned over iBGP look different in two ways. First, the pre
fix is preceded by an i, indicating that the path was learned over iBGP.
Second, one of the iBGP paths lacks the * indicating that the path is
valid. We can see this in more detail by looking at a particular prefix:
The reason the iBGP paths are invalid is because on R2, the next hop
address 192.0.2.21 points to a next hop address that is directly connect
ed to R1. R2 has no route to that address:
R2# show ip route 203.0.113.0/24
% Network not in table
There are two ways to solve this. The easy way only works in very
simple networks, and our two-router AS is as simple as they come. Ex
ample 28 tells R1 and R2 to update the next hop address in iBGP up
dates and set that next hop address to its own address on the interface
used for the (i)BGP session in question.
116
Example 28-1. next-hop-self on R1
!
router bgp 65082
neighbor 192.0.2.122 remote-as 65082
neighbor 192.0.2.122 description iBGP to R2
neighbor 192.0.2.122 next-hop-self
!
address-family ipv6
neighbor 192.0.2.122 activate
neighbor 192.0.2.122 next-hop-self
exit-address-family
!
iBGP paths in the BGP table now all have an asterisk indicating they’re
valid:
R2# show ip bgp
Network Next Hop LocPrf Weight Path
*> 10.0.10.0/23 192.0.2.45 0 65040 65010 i
* 10.0.30.0/23
*>i10.0.20.0/22 192.0.2.121 100 0 65030 65020 i
192.0.2.45 0 65040 65010 65020 65030
i
*>i 192.0.2.121 100 0 65030 i
*> 10.0.40.0/21 192.0.2.45 0 65040 i
* 10.0.83.0/24 192.0.2.45 0 65040 65083 i
*>i 192.0.2.121 100 0 65083 i
*> 10.0.84.0/24 192.0.2.45 0 65040 65084 i
* i 192.0.2.121 100 0 65030 65084 i
*>i192.0.2.0/24 192.0.2.121 100 0 i
0.0.0.0 32768 i
117
R2# show bgp ipv6 unicast
BGP table version is 3, local router ID is 192.0.2.254, vrf id 0
Default local pref 100, local AS 65082
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
RPKI validation codes: V valid, I invalid, N Not found
Because the next hop address isn’t modified when propagating prefix
es over iBGP, there are no complications distributing IPv6 prefixes over
an IPv4 iBGP session, so we activate the iBGP neighbors 192.0.2.121
and 192.0.2.122, respectively, for the IPv6 address family. This way,
there is no need to maintain separate IPv6 iBGP sessions. However,
this does mean that if something goes wrong with internal IPv4 rout
ing so the iBGP sessions over IPv4 are disrupted, this will impact ex
ternal IPv6 routing.
118
Figure 4. An autonomous system with three routers
This means that R1 none of R1s addresses are reachable to R3, and the
other way around. We fix that by running an internal routing protocol
that makes sure all our internal routers know how to reach all the ad
dress prefixes used in our internal network.
So let's enable OSPF for IPv4 and IPv6 on our three test routers. Exam
ple 29 shows just the configuration for R2, as the R1 and R3 configura
tions are identical except that R1 only has interface eth0.821 and R#
only eth0.822.
119
For IPv4, OSPF is enabled by specifying a prefix, and then all interfaces
with an address that falls within that prefix run OSPF. Area 0.0.0.0,
or simply area 0, is the backbone area. These days, it's rarely necessary
to use additional areas.
120
O>* 192.0.2.44/30 [110/20] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.112/28 [110/20] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.144/28 [110/30] via 192.0.2.232, eth0.822, 00:01:37
O 192.0.2.160/28 [110/10] is directly connected, eth0.823,
00:02:27
O 192.0.2.224/28 [110/10] is directly connected, eth0.822,
00:02:27
O>* 192.0.2.251/32 [110/20] via 192.0.2.232, eth0.822, 00:01:37
O>* 192.0.2.252/32 [110/10] via 192.0.2.232, eth0.822, 00:01:37
O 192.0.2.253/32 [110/0] is directly connected, lo, 00:02:27
O>* 203.0.113.0/24 [110/20] via 192.0.2.232, eth0.822, 00:01:37
(To make the output fit, I removed weight 1 from each line.)
The result is that BGP prefixes with next hop addresses in that
203.0.113.0/24 prefix are reachable without trouble:
This does mean that reaching this prefix and others like it require a
two-stage process: first look up the BGP next hop address, and then
turn that BGP next hop address into an actual next hop address and
output interface:
121
The [200/0] and [110/20] numbers are the “administrative distance”
and the metric. The administrative distance is how preferred prefixes
from one routing protocol are over the same prefixes of another rout
ing protocol.
For OSPF, the distance is 110 by default. For BGP it's 20 if BGP dele
gates an eBGP path to the main routing table and 200 if it is an iBGP
path. So if the same prefixes is known through OSPF, eBGP and iBGP,
the eBGP path is installed in the forwarding information base (FIB)
and thus used for forwarding packets.
If R4 has packets to send to R1, those normally go over the direct link
between R1 and R4. But if that link fails, the two can still communicate
through R2 and R3.
122
IP addresses configured on it become unreachable. Instead, we config
ure an address for iBGP use on the interface that never goes down: the
loopback interface.
123
R4 uses the path through interface eth0.824 (its direct link to R1) to
reach R1's loopback address 192.0.2.251. There's also a path trough
R3 and R2, but remember, each routing protocol only sends a copy of
its best path to the main routing table (RIB), so we don't see that path
in the show ip route output. This means that the iBGP packets be
tween R1 and R4 flow as shown in figure 6.
But what if the link between R1 and R4 goes down? Let's simulate that
by shutting down the interface:
R4# conft
R4(config)# interface eth0.824
R4(config-if)# shutdown
R4(config-if)# ^Z
124
In other words, the flow of the iBGP session between R1 and R4 is
rerouted as shown in figure 7 without any interruption. This is espe
cially important when R1 and R4 have exchanged many prefixes, as
having to remove all these prefixes and then retransmit and reinstall
them would create significant load on the router CPU and disrupt
connectivity to some degree.
However, the router still has to update the RIB/FIB because the BGP
next hop addresses are now reachable through a different interface.
Route reflectors
The iBGP full mesh requirement—i.e., having each router learn eBGP
prefixes directly from the router where those prefixes enter the AS—is
nice and simple, and avoids potential routing loops. However, in our
network with four routers, each router already has three iBGP sessions.
That number quickly rises as the the number of routers in the network
increases. Not only does the extra overhead start to add up in a larger
network with dozens or even hundreds of routers, but adding a router
becomes a nightmare: every existing router has to be configured with
an iBGP session towards the new router.
There are two mechanisms to get around the iBGP full mesh require
ment: confederations and route reflectors.
125
invisible to external ASes. Confederations are not in wide use, so we
won't discuss the details here.
Route reflectors [RFC 4456], the other system to work around the
iBGP full mesh requirement, on the other hand, is extensively used in
larger networks. What a route reflector does is propagate the prefixes it
learns over iBGP to its clients. So a client gets all the prefixes know to
the different eBGP routers in the AS over a single iBGP session with a
route reflector.
Note that the route reflector client status is set for each address family
separately. There are no configuration changes on R4, the route reflec
tor client, except to remove the iBGP sessions towards R1 and R2 that
are no longer needed.
126
R1 and R2 are not configured to be route reflector clients, they are
”non-client peers” of the route reflector. This means they still need to
maintain iBGP sessions with all other routers in the AS other than the
route reflector clients.
The path learned from R1 has an OSPF metric of 20 and is thus pre
ferred over the path learned from R2, which requires an extra hop
through R3 and thus has an OSPF metric of 30. However, now that R4
is a route reflector client in example 31, this is different:
The result is that the path with the higher IGP metric (30 vs 20) is used
by R4 as R4 now only receives just the path that R3 considers best. And
R3 has a direct link to R2, while R1 requires an extra hop through R4.
As such, it's important to carefully consider the placement of route re
127
flectors within the topology of the network. Of course for redundancy
a route reflector client should always talk to at least two route reflec
tors. If those are in different locations in the network, that will reduce
the incidence of less optimal routing that comes with the deployment
of route reflectors.
The originator and cluster list are new attributes that make sure route
reflectors don't cause routing loops. This would be a risk when route
reflectors are clients of other route reflectors. That can happen by acci
dent, but in very large networks, it's actually common to have a hierar
chy of route reflectors.
128
BGP security
As it was created in the innocent days of the early internet, BGP wasn't
designed with security concerns in mind. Over the years, that changed,
for the most part because it turned out mistakes made by the operators
of one AS could severely impact large parts of the rest of the internet.
And to a lesser degree because the lack of defenses in the BGP protocol
and in BGP operation were exploited for malicious purposes.
MD5 passwords
Because eBGP neighbors connect to each other over a shared layer-2
network (i.e., usually the same Ethernet), it's not really possible for
remote attackers to get between them. More extreme scenarios are pos
sible, be it hard to pull off and/or hide. For instance, an attacker phys
ically interjects himself between two BGP routers and becomes a “man
in the middle”. Or perhaps at an internet exchange, an attacker steals
another member's IP address.
129
But the most likely attack vector is someone sending fake TCP reset
packets. When a computer (or router) receives a TCP packet for a TCP
session that it doesn't recognize, it sends back a TCP RST packet. When
the sender of the original TCP packet receives the RST packet, it knows
that the TCP session is dead and removes it. So for instance, if a system
reboots, this reset mechanism will make sure that TCP sessions that
were active before the reboot don't linger on the other side of that TCP
connection.
How do we protect against these attacks? Ideally, we'd use IPsec [W],
which is designed to protect against exactly these kinds of attacks.
However, protecting BGP sessions with IPsec has never entered com
mon practice among network operators.
Instead, we use the TCP MD5 signature option [RFC 2385]. This
works by calculating an MD5 hash over the TCP segment that contains
a BGP message plus a password that both sides have agreed upon. The
MD5 has is then placed in a TCP option and the TCP segment is
transmitted to the other side.
The receiver also calculates the MD5 hash in the same way, and then
checks if the resulting hash is the same as the one in the TCP option. If
so, we can be sure both ends used the same password and the segment
130
wasn't modified in transit. So the segment is accepted for further pro
cessing. If not, it's discarded without further action.
131
The password-protected BGP session towards AS 65083 immediately
came up. But even after giving it some time, the session towards AS
65084 never came up. Also, the message counters are still zero, so no
BGP messages were exchanged at all. On a Cisco router, a message like
this will appear in the log:
Or, if the other side has a password configured, but the passwords on
both sides of the BGP session don't match:
With routing software like FRRouting, the routing software won't gen
erate log entries as the issue is handled in the kernel TCP code.
Router# conf t
Router(config)# router bgp 65082
Router(config-router)# neighbor 203.0.113.84 password secretpwd
Router(config-router)# ^Z
In practice, not everyone uses TCP MD5 passwords on all of their BGP
sessions. Doing this is most important for BGP sessions carrying many
prefixes, such as eBGP sessions with transit ISPs or transit customers
as well as iBGP sessions. Having to remove so many prefixes after a
reset, rerouting those destinations over other paths, and then re-estab
lishing the BGP session and restoring the previous state is rather dis
ruptive.
132
Another reason to be reluctant with using TCP MD5 passwords every
where is that, although the MD5 algorithm should be very fast, in
practice some router CPUs may have a hard time keeping up with a
large flood of TCP packets with MD5 hashes to check. So then this
mechanism becomes a denial-of-service attack surface.
The MD5 algorithm has long since been proven vulnerable to collision
attacks. But RFC 2385 doesn't have a mechanism that allows for up
grading the hashing algorithm. [RFC 5925] specifies a new TCP au
thentication option (TCP-AO) that addresses the limitations of the TCP
MD5 option, but so far, TCP MD5 is still more common than TCP-AO,
which is less widely supported.
IP level access lists (packet filters) that reject these packets won't work
very well because the addresses will be spoofed. Most networks make
sure their customers can't successfully send packets with spoofed
source addresses, but if an attacker finds a place where this is possible,
these packets will look legitimate to your access lists.
(The hop limit field is the new name for the otherwise identical IPv4
time to live field. The purpose of the TTL and hop limit is to make sure
packets won't circle around forever when routing loops occur.)
The sender sets this field to 255. If the packet is then delivered to a re
ceiver on the same subnet, the hop limit will still be 255. However, if
the packet passed through a router, the router will have decremented
133
the hop limit value. So if the receiver sees a value of 255, it knows the
packet was sent locally.
Interestingly, BGP already did sort of the same thing: it sets the TTL to
1. This way, if something goes wrong and packets between two eBGP
speakers flow through a third router, that router will decrement the
TTL to 0 and discard the packet.
Note that unlike with the TCP MD5 password, it seems we have to set
ttl-security hops 1 for the peer group, not the individual neigh
bors. I expect that this limitation may not apply on all types of routers.
But when it does, this can be addressed by simply making a duplicate
134
peer group with GTSM disabled on the old peer group and enabled on
the new one.
Peer 83 also has GTSM enabled, but 84 hasn't. So the BGP session to
wards AS 65083 comes up without trouble, but the one towards AS
65084 remains stuck in the Connect, OpenSent or OpenConfirm states:
Neighbor AS MsgRcvd MsgSent InQ OutQ Up/Down State
203.0.113.83 65083 7 8 0 0 00:03:28 1
203.0.113.84 65084 0 5 0 0 never OpenSent
135
gator. Which of course immediately becomes overwhelmed by all the
traffic.
Now this would have been problematic regionally if the Youtube pre
fix was in fact a /24. In that case, routers elsewhere would have to
choose between the path to the real Youtube and the leaked path
through PCCW and Pakistan Telecom. Outside Asia, most networks
would probably have used the legitimate path towards Youtube as that
would have been shorter. But Youtube actually announced a /22, so
the “more specific” /24 vacuumed up the traffic to Youtube's stream
ing servers from all over the world.
Four years later, something similar happened again with some of the
same players involved.
136
Then in 2018 there was an incident with definite malicious intent:
someone hijacked the IP prefixes for Amazon AWS' DNS servers. The
attacker then sent back fake DNS replies for myetherwallet.com in
order redirect visitors of that wallet service for the Ether cryptocurren
cy to their own servers. This way, the attackers could obtain the login
credentials of users of the service that who to log in and ignored an
HTTPS certificate warning. Apparently the attackers were able to get
away with $150,000 worth of Ether.
In general, ISPs can, and definitely should, create and maintain such
filters on BGP sessions with customers. Things get more difficult for a
large ISP that has a smaller ISP as its customer. This means the small
ISP has to inform the big ISP whenever it adds new address prefixes,
or when it adds new customers that advertise one or more prefixes of
their own. And then wait for the big ISP to update their filters accord
ingly. It gets worse with peering, because now each peer needs to in
form every other peer of changes to the prefixes they advertise. There
is no way to make this work manually.
137
the IRRs is complete and trustworthy, this makes it possible to gener
ate filters automatically.
138
% whois -h whois.afrinic.net 196.216.2.6
inetnum: 196.216.2.0 - 196.216.3.255
netname: AFRINIC
descr: AfriNIC - Internal Use
country: ZA
org: ORG-AFNC1-AFRINIC
admin-c: CA15-AFRINIC
tech-c: IT7-AFRINIC
status: ASSIGNED PI
mnt-by: AFRINIC-HM-MNT
ARIN and LACNIC show largely the same information, but in a differ
ent format. The format that the other three use is actually the IRR for
mat, RPSL: the Routing Policy Specification Language[RFC 2622].
The heavy lifting in RPSL is done by the aut-num and route / route6
objects. In addition to administrative information listed above, an aut
num object may list a network's routing policy:
% whois -h whois.ripe.net as1125
aut-num: AS1125
as-name: UNSPECIFIED
descr: SURF Test Network
export:
import: from AS1103 action pref=100; accept ANY
to AS1103 announce AS1125
So AS 1125 accepts all prefixes from AS 1103 and only sends its own
prefix(es) to AS 1103. pref=100 suggests that the local preference is
139
100. However, in RPSL a lower pref value is considered more pre
ferred, and it's unclear how an unspecified pref value is evaluated.
AS 1104 may send prefixes it originates itself, as well as those that orig
inate in AS 3333. So AS 3333 is a customer of AS 1104. However, should
AS 1104 gain another BGP customer, then AS 1103 would have to up
date its aut-num object to also list that new customer AS number in the
AS 1104 import line.
The ASes accepted from AS 702 are not listed here one-by-one, but
those are in the as-set object AS-UUNETEURO instead. When AS 702
connects a new customer, they can simply update this as-set object
and AS 1103 will automatically start accepting prefixes from that new
customer after rebuilding their filters from the IRR data.
The export lines show that AS 1103 will only advertise prefixes origi
nated in its own AS and its customer's ASes (as listed in AS-SURFNET)
to most of these neighboring ASes. This means those ASes are peers.
Only AS 1104 gets all prefixes and is thus a customer. Interestingly, the
140
AS 1125 import policy doesn't match the AS 1103 export policy. This is
not ideal, but doesn't create immediate problems.
Of course to create prefix filters, we need to know IPv4 and IPv6 pre
fixes, not just AS numbers. Let's find these with an inverse query:
% whois -h whois.ripe.net -- "-i origin as112"
route: 192.175.48.0/24
descr: Root Server Technical Operations Assn
origin: AS112
mnt-by: NETNOD-MNT
created: 2002-12-17T14:02:55Z
route6: 2620:4f:8000::/48
descr: Root Server Technical Operations Assn
origin: AS112
mnt-by: NETNOD-MNT
See the MANRS Implementation Guide for more some pointers on set
ting up objects in the various RIR IRRs. The RIRs also all have their
own documentation.
ARIN recently switched to a next generation IRR, leaving all the legacy
data in the old IRR behind. The new IRR uses different query com
mands. For instance:
(The \ is not part of the command but required on the MacOS / Linux
command line to keep the shell from giving the ! special treatment.)
You definitely need to read the ARIN Internet Routing Registry (IRR)
page but disregard the “RIPE-style” examples, as those don't work
anymore. And then the IRRd 4.2.5 Whois queries documentation to
really understand the new syntax.
141
num object. This way, you won't end up on the wrong side of filters
generated from IRR data. If you have BGP customers then of course
those need to be in your routing policy as well, and you'll want to
make an as-set object to list those customer's ASes along with your
own.
RPKI
The Resource Public Key Infrastructure (RPKI) is a mechanism that
allows holders of IP addresses and AS numbers to prove they are the
legitimate holder of that resource using a certificate they get from an
RIR. The RPKI architecture [RFC 6480] provides an overview of how
all of this works.
With such a certificate, an address holder can create and sign a Route
Origination Authorization (ROA) which indicates which AS is allowed
to originate the prefix in question. (If multiple ASes are allowed to
originate the prefix, multiple ROAs must be created.)
142
The RIR TALs can be downloaded from their websites. Until recently,
this was not the case for the ARIN TAL, as they used to require relying
parties to sign an indemnification agreement first.
In practice, few people create ROAs themselves and use the RIR-pro
vided RPKI certificate to sign them. The easier option is to use the RIR
portal to have the RIR generate ROAs. ARIN has a slightly different
procedure because they require the private key of the certificate that
signs the ROA to reside with the address holder.
So now we have that big list of prefixes and valid origin ASes. How do
we use this to filter?
That part is called RPKI route origin validation. Although mostly peo
ple say just “RPKI” when they actually mean RPKI route origin valida
tion. Unlike traditional filters, the RPKI-derived filter is not simply
copied to a router's configuration file. Instead, a relying party cache
server uses the RPKI-Router protocol [RFC 8210] to transmit the filter
to routers. Routers then apply the filter to the paths they have in their
BGP table, with three possible results:
• Valid: there is a ROA covering this prefix and prefix length, and
the first AS in the AS path matches the origin AS in the ROA.
• Invalid: there is a ROA covering this prefix, but either the prefix
length is longer than the maximum specified in the ROA, or the
first AS in the AS path doesn't match the origin AS in the ROA.
In the next few examples we're going to see how that works. Rather
than run an actual relying party server and validate certificates and
143
ROAs ourselves, we're going to use the GoRTR tool made by Cloud
flare that reads a filter from a JSON file and transmits it to routers us
ing the RPKI-Router protocol. By default, GoRTR downloads a JSON
file from Cloudflare with the full current RPKI filter. (That file is cur
rently 30 MB.) But we'll use our own with the following data:
{"prefix":"10.0.10.0/23","maxLength":24,"asn":"AS65010","ta":""},
{"prefix":"2001:db8:10::/44","maxLength":44,"asn":"AS65010","ta":""
},
{"prefix":"10.0.16.0/21","maxLength":21,"asn":"AS65020","ta":""},
{"prefix":"2001:db8:20::/44","maxLength":44,"asn":"AS65020","ta":""
},
{"prefix":"10.0.40.0/21","maxLength":23,"asn":"AS65040","ta":""},
{"prefix":"2001:db8:40::/44","maxLength":47,"asn":"AS65040","ta":""
}
If you're trying this out yourself with the BGP minilab, first start a
Docker image of the GoRTR tool in a separate terminal / shell window
as follows:
./run-gortr.sh
Or, on Windows:
.\run-gortr.ps1
Stop it when you're done with control-c. We can now run example 34,
which has a routine BGP setup we've seen in earlier examples coupled
with some RPKI settings.
bgpd_options=" -A 127.0.0.1"
to:
144
bgpd_options=" -A 127.0.0.1 -M rpki"
145
Router# show ip bgp
Status codes: s suppressed, d damped, h history, * valid, > best,
= multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? incomplete
RPKI validation codes: V valid, I invalid,
- N Not found
So the paths originated by ASes 65010 and 65040 are all valid (V),
while 10.0.20.0/22 from AS 65020 is invalid (I). That's because there
is a ROA for 10.0.16.0/21 with a /21 maximum length and
10.0.20.0/22 matches 10.0.16.0/21 (it's the second half of that
block), but the /22 prefix length is longer than the /21 limit in the
ROA.
However, apart from an extra letter in the show ip bgp output, the
RPKI state doesn't have any consequences. The RPKI invalid paths are
still considered valid for BGP path selection purposes:
146
Router# show ip bgp 10.0.20.0/22
BGP routing table entry for 10.0.20.0/22, version 7
Paths: (1 available, best #1, table default)
Not advertised to any peer
65030 65020
192.0.2.21 from 192.0.2.21 (198.51.100.223)
Origin IGP, valid, external, best (First path received), rpki
validation-state: invalid
Unfortunately, that's not good enough. Consider the invalid path for
10.0.42.0/23 from example 34. Even if we give this path a really low
local preference, packets to for instance 10.0.42.13 will still flow ac
cording to that invalid /23 while the path for the encompassing prefix
10.0.40.0/21 that is valid and has a high local preference is ignored
as per the longest match first rule.
147
elsewhere or because of an attack. Filtering out such prefixes accom
plishes the same thing that RPKI is supposed to protect against: be
coming unreachable.
Let's have a look at example 35, which adds route maps that imple
ment filtering and adjusting the local preference to the previous con
figuration.
Example 35. Filtering RPKI invalid paths and higher local preference for valid
paths
!
router bgp 65082
neighbor 192.0.2.21 route-map apply-rpki-transit in
neighbor 192.0.2.41 route-map apply-rpki-transit in
neighbor 203.0.113.83 route-map apply-rpki-peers in
!
route-map apply-rpki-transit deny 10
match rpki invalid
!
route-map apply-rpki-transit permit 20
match rpki valid
set local-preference 200
!
route-map apply-rpki-transit permit 30
!
route-map apply-rpki-peers deny 10
match rpki invalid
!
route-map apply-rpki-peers permit 20
match rpki valid
set local-preference 210
!
route-map apply-rpki-peers permit 30
set local-preference 110
!
There are two route maps to handle RPKI: one for BGP sessions with
transit providers and one for BGP sessions with peers. The apply
rpki-transit route map starts with a deny clause and looks for paths
with RPKI state invalid. That means that when a match happens, the
path is filtered out and doesn't get added to the BGP RIB.
The next clause looks for RPKI state valid, and if there is a match the
local preference is set to 200 and the path is (implicitly) added to the
148
BGP RIB. The last clause has neither a match nor a set and thus match
es all paths that get to this stage and adds them to the BGP RIB. This
would be paths with RPKI state not-found, but also paths without any
RPKI state at all.
The output below is the result of first starting the routers and letting
them establish their BGP sessions, and only then starting the GoRTR
cache server:
Router# show ip bgp
RPKI validation codes: V valid, I invalid, N Not found
When the router connected to the RPKI cache server, the RPKI state
was added to each path, but because the paths were already received
from the BGP neighbors and evaluated by the route maps when there
was no RPKI state yet, the invalid path wasn't filtered and the valid
ones didn't get the higher local preference.
149
This means that an invalid path that was received before the cache
server became available will remain in the BGP table until there is a
change somewhere that triggers a BGP update for such a path. New
invalid paths will be filtered as intended. Or we can can ask our BGP
neighbors to send a full set of updates:
Now the invalid path is gone and the valid ones have the intended
higher local preference.
Note that giving valid paths a higher local preference than not-found
paths will not do anything under normal circumstances, as it is impos
sible to have a valid and a not-found path for the same prefix at the
same time. After all, valid requires the presence of a matching ROA,
while not-found is only possible if there is not a matching ROA.
150
i10.0.83.0/24 192.0.2.151 110 65083 i
* i192.0.2.0/24 192.0.2.151 100 i
*> 0.0.0.0 i
Because this router doesn't have any eBGP sessions, it's not necessary
to have it perform RPKI route origin validation itself, as all the eBGP
routers will have already done that. But it's still informative to be able
to see what the RPKI validation state was by looking at the local pref
erence.
Then again, having a higher local preference for valid paths could re
sult in unexpected results when some routers are able to perform RPKI
validation and others aren't. The traffic will then flow over the routers
with working RPKI and thus the higher local preferences, possibly tak
ing a longer than necessary route in the process.
Worse, having a ROA with a maximum length longer than the prefix
length of prefixes that are actually advertised undermines one of the
most important advantages of RPKI route origin validation: the ability
to filter unwanted more specifics. More specifics originated from a dif
ferent AS will still be filtered because the origin validation fails. But if
accidental leaking of more specifics preserves the original origin AS, or
in the case of a deliberate attack where the attacker simply spoofs the
authorized origin AS, these more specifics will be considered valid and
be able to do their damage.
151
However, this doesn't necessarily require setting a large maximum pre
fix length (such as /24 for IPv4) in ROAs. An alternative is to selective
ly allow invalid paths. For instance, in example 36, AS 65083 is an anti
DDoS filtering service. AS 65040 is experiencing a DDoS attack for ad
dresses within 10.0.41.0/24. So AS 65083 advertises that prefix to
make the “dirty” traffic flow to AS 65083 so it can be filtered and then
the “clean” traffic is forwarded to AS 65040. The configuration in the
example allows the /24 advertisement from AS 65083 even though it
has RPKI state invalid.
152
route-map override-rpki-peers permit 10
match rpki invalid
match ip address prefix-list only24
set local-preference 50
!
route-map override-rpki-peers permit 20
match rpki invalid
match ipv6 address prefix-list only48
set local-preference 50
!
route-map override-rpki-peers deny 30
match rpki invalid
!
route-map override-rpki-peers permit 40
match rpki valid
set local-preference 210
!
route-map override-rpki-peers permit 50
set local-preference 110
!
For both the IPv4 and the IPv6 BGP sessions to AS 65083 the route map
override-rpki-peers is applied to incoming updates. The permit
10 clause matches if a path has both RPKI state invalid and it matches
prefix list only24. That prefix list only matches IPv4 prefixes with a
prefix length less or equal than /24 and also greater or equal than /24.
So only exactly /24. In that case, the local preference is set to 50, and
the router adds the path to the BGP RIB.
The permit 20 clause does the same thing for IPv6 /48. After that,
RPKI invalid prefixes that didn't match earlier are discarded by the
deny 30 clause and RPKI valid paths get a 210 local preference and
everything else remaining at this stage a 110 local preference. The re
sults:
153
Router# show bgp ipv4 unicast
RPKI validation codes: V valid, I invalid, N Not found
154
Last but not least, let's see how an attacker can defeat RPKI route ori
gin validation in example 37, where apparently someone nefarious has
taken over AS 65083.
155
other prefix learned from AS 65083. Because the AS 65083 router was
configured with attribute-unchanged as-path and “65040” was
already prepend to the path, it simply used “65040” as the AS path. For
the other prefix AS 65083 was sending, the AS path was still empty so
it had to insert its own AS number into the path before sending the
BGP update.
Also, there is a certain risk involved with trusting a system like RPKI
with many moving parts to inject filters into your routers that could
very well filter legitimate paths if something goes wrong. Then again,
RPKI is the best tool we have today to address issues with BGP such as
the incidents mentioned under “scary stories”. All of these incidents
(in unmodified form) would have had no or relatively little impact had
RPKI route origin validation been in place at the time.
BGPsec
AS the internet moved from a sheltered academic environment into the
wider world and gained popularity in the 1990s, it became clear that
BGP offers very few protections against someone other than the legit
imate holder of an address block announcing that address block. An
other problem is that path attributes such as the AS path don't have
any protection against manipulation as BGP updates make their way
across the globe.
The main thing that BGPsec does is replace the AS_PATH attribute with
the BGPsec_PATH. Like the regular AS_PATH attribute, the BGPsec_PATH
156
also lists all the AS numbers from the AS that originated a prefix to
current AS. In addition, after each hop there is a signature over the
path so far and also the next hop. Upon reception these signatures are
checked to make sure the path wasn't tampered with and each next AS
in the path was indeed authorized by the previous one.
However, apart from closing that RPKI loophole, it's not clear how
cryptographically protecting the AS path is all that helpful in practice.
Deliberate attacks mainly consist of announcing more specifics. With
RPKI blocking that, an attacker can no longer “shoplift” a targeted
sub-prefix of the victim's larger address block, but they have steal the
whole shop, so to say.
And for every hop in the AS path of every prefix, a signature must be
checked. Even a highly optimized implementation of the ECDSA P-256
algorithm running on a modern desktop CPU makes verifying a mil
lion prefixes times an average four hops take several minutes. Double
that for two full BGP feeds, and add more for peers that also send sig
nificant numbers of prefixes.
157
So how secure is BGP?
At the beginning of the chapter, I posited that BGP can be considered
secure if we can be confident that we're able to answer the following
questions with “yes”:
With just RPKI, we get close to a “yes” for question 2a, but we don't
quite get there. RPKI + BGPsec would do the trick, though.
With question 2b, BGPsec gets us closer to a “yes”, but again, not quite
there. The problem is that it's not the address holder who gets to de
cide which path updates follow, but that each hop gets to decide on the
next hop.
158
Making BGP faster
• Our standard router 82 / R1 to observe the BGP state and BGP table
159
(We can't run the ping on router 82 because that router would use the
address from the interface to AS 65040 that we're going to bring down
so the pings won't resume after rerouting. In general doing pings and
traceroutes to external destinations from an eBGP router can have un
expected results because the source addresses are atypical.)
And while the ping is running, we tell the AS 65040 router to shut
down the interface towards AS 65082:
ISP40# conf t
ISP40(config)# interface eth0.1401
ISP40(config-if)# shutdown
Over at R4, the ping is still running but no output. After nearly three
minutes, the output resumes:
...
64 bytes from 10.0.47.255: seq=167 ttl=60 time=0.173 ms
64 bytes from 10.0.47.255: seq=168 ttl=60 time=0.157 ms
64 bytes from 10.0.47.255: seq=169 ttl=60 time=0.135 ms
^C
With the last ping before the pause being number 3 and the first one
after 167, we missed 163 packets at a rate of one packet per second. We
can also see that the path now takes more hops as the TTL for the in
coming ping packets is now 60 while it was 63 before. This is what
show ip bgp summary and show ip bgp looked like during the simu
lated outage:
160
R1# show ip bgp sum
Neighbor V AS MsgRcvd MsgSent Up/Down State/PfxRcd
192.0.2.21 4 65030 21 14 00:07:45 4
192.0.2.41 4 65040 24 23 00:00:25 Active
192.0.2.154 4 65082 12 27 00:07:45 1
In the BGP open message, both BGP speakers announce the hold time
they'd like to use. The lower of the two will be used by both sides. (Un
less a higher bgp minimum-hold time is in effect. But then the BGP
session may not come up at all.) So with the example 39 configuration
on R1, but no such settings on the AS 65040 router, we see this:
ISP40# show ip bgp neighbors 192.0.2.42
BGP neighbor is 192.0.2.42, remote AS 65082, local AS 65040,
external link
Hostname: R1
BGP version 4, remote router ID 192.0.2.251, local router ID
10.0.47.255
161
BGP state = Established, up for 00:07:28
Last read 00:00:04, Last write 00:00:04
Hold time is 20, keepalive interval is 6 seconds
After learning the 20 second hold time from R1, the AS 65040 router
divided that number by three to arrive at the 6 second keepalive time.
If we now again run a ping from R4 and after a few seconds shut down
the eth0.1401 interface on the AS 65040 router, we get the following
results:
R4# ping 10.0.47.255
PING 10.0.47.255 (10.0.47.255): 56 data bytes
64 bytes from 10.0.47.255: seq=0 ttl=63 time=0.091 ms
64 bytes from 10.0.47.255: seq=1 ttl=63 time=0.104 ms
64 bytes from 10.0.47.255: seq=2 ttl=63 time=0.103 ms
64 bytes from 10.0.47.255: seq=18 ttl=60 time=0.108 ms
64 bytes from 10.0.47.255: seq=19 ttl=60 time=0.138 ms
64 bytes from 10.0.47.255: seq=20 ttl=60 time=0.136 ms
^C
In the early 2000s I once got bitten by too aggressive hold times. Back
in those days a router that was just big enough to handle several full
BGP feeds and also do peering over an internet exchange didn't have a
very fast CPU. One day there was some instability in BGP somewhere,
with the result that the router got a lot of updates over several BGP
sessions at the same time. All this update churn meant that the router
was so busy installing new paths in the BGP RIB, recalculating the best
162
path, and then generate BGP updates of its own, that it didn't get
around to sending keepalive messages at the appropriate intervals.
With the result that the other end would tear down the BGP session as
the hold time expired. Which lead to more work for the poor router.
Which in turn made it fail to send keepalives on time, and so on.
It's very likely that today's routers won't fall into this trap. To some
degree because the router CPUs are much faster, although that's large
ly negated by the BGP table getting so much bigger. But mostly be
cause it's unlikely that BGP implementations still use such a monolith
ic BGP update processing system that keepalives are delayed if there
are incoming update messages to process.
Also, BGP failure detection doesn't solely depend on the hold time. A
standard feature on most routers is “fast external fallover”. This means
the router tracks the up/down state of the physical interface an eBGP
session runs over. When the interface goes down, the BGP session is
then immediately taken down. For this reason, it's always good to
have a direct cable between two eBGP routers.
Having any switches in the middle may foil the fast external fallover
feature, as it's the switch that sees the interface link signal go away, but
the router doesn't know that because it only sees the state of its con
nection to the switch, not the end-to-end state of the connection to the
remote router.
163
BFD: bidirectional forwarding detection
Bidirectional forwarding detection (BFD) [RFC 5880] is a protocol for
quickly detecting link failures. It can be used with different routing
protocols, including BGP. BFD is designed to, when possible, test
whether the neighbor is still forwarding packets, not whether a routing
protocol such as BGP is still running. With BFD, it's possible to detect
failures within a few tens of milliseconds. However, this will easily de
tect failures that aren't really there, so such aggressive timing should
only be used if important applications really need it. Example 40
shows a BFD configuration.
We define two BFD peers using the addresses we also use for the BGP
sessions. In the case of the IPv6 peer, it's necessary to explicitly specify
the address on our side. For the BGP sessions, all that's needed is
neighbor ... bfd. For the IPv6 BGP session, we use the profile
2seconds which modifies the timers and detect multiplier.
BFD has two main timers: one that indicates how fast we're prepared
to receive, and one that indicates how fast we want to send. During
BFD session establishment, this information is exchanged by both
164
sides so each will limit how often it sends test packets to stay within
the receive interval of the other side. So both sides can send test pack
ets at different rates. The FRRouting default is 300 ms for transmit and
receive. The multiplier is how many packets in a row must be lost be
fore BFD declares the neighbor down. The default is 3. This means that
with the FRR default settings, a failure will be detected after between
600 and 900 milliseconds. Let's test as before:
R4# ping 10.0.47.255
PING 10.0.47.255 (10.0.47.255): 56 data bytes
64 bytes from 10.0.47.255: seq=0 ttl=63 time=0.086 ms
64 bytes from 10.0.47.255: seq=1 ttl=63 time=0.106 ms
64 bytes from 10.0.47.255: seq=2 ttl=63 time=0.104 ms
64 bytes from 10.0.47.255: seq=4 ttl=60 time=0.148 ms
64 bytes from 10.0.47.255: seq=5 ttl=60 time=0.111 ms
64 bytes from 10.0.47.255: seq=6 ttl=60 time=0.102 ms
^C
This is with the default 180-second hold time, but thanks to BFD the
BGP session was declared down and rerouting happened fast enough
that we only lost a single ping packet.
165
If BFD is enabled on a BGP session, this is also visible in the output of
the show ip bgp neighbors ... command:
Graceful restart
What happened when we created link failures earlier this chapter is
that the router on the AS 65082 end didn't notice the lost connectivity
for a while, as it was approaching the hold time. Should the link failure
be fixed before the existing BGP session times out, then AS 65040 will
initiate a new BGP session before the old one has disappeared on the
AS 65082 side.
And what does the AS 65082 router do when it hears the good news
that AS 65040 is reachable again? It removes all the paths that it had
learned from the AS 65040 router from the BGP RIB and the FIB and
then quickly proceeds to process BGP updates from AS 65040 in order
to restore those same paths. That doesn't seem like the most optimal
approach.
This is the issue that the graceful restart mechanism [RFC 4724] ad
dresses. With graceful restart enabled, if a reconnecting neighbor can
tell a router that it has retained its forwarding state (the contents of the
FIB) during the BGP session disconnect, the router will not flush its
routes. Instead, it marks the routes that would normally be removed at
this stage as “stale”. It then proceeds to process the newly reconnected
neighbor's update messages. When the neighbor is finished sending all
the prefixes eligible to be sent, it sends the “end-of-RIB marker”. This
166
is the receiving router's cue to remove any stale entries that weren't
“refreshed” by an update message from the RIB and FIB.
Things get interesting when you use BFD and graceful restart together.
The rationale would be that BFD was created to detect forwarding
plane failures. If the BFD implementation can indeed detect that the
forwarding plane of a neighbor still works even though the control
plane (the CPU that runs the routing protocols and house keeping) has
problems, then BFD has no reason to bring down a BGP session.
For instance, when the BGP implementation has crashed but the BGP
prefixes are still in the FIB. When BGP is restarted it will reconnect and
graceful restart makes sure there is no unnecessary FIB remove/rein-
stall cycle for BGP prefixes.
The helper is the router that marks previous paths as stale and re
moves remaining stale paths after the end-of-RIB marker. But that only
happens if the other router has graceful restart enabled for one or more
address families and tells the receiving router it was able to maintain
its forwarding plane state.
167
other without interrupting packet forwarding. As such, it's good to
have graceful restart enabled in helper mode to facilitate the high
availability functionality on other routers.
168
Best practices
It's one thing to run BGP, but it's another thing to do it well. In this
chapter I'll cover some topics that haven't been addresses so far that
will make BGP run better. We'll also look at BGP best practices sug
gested elsewhere.
“Black starts”
Electrical power stations can and do go down, either for planned
maintenance or because of some kind of issue that makes them “trip”.
But guess what: to start up again, they need electricity. Normally they
can simply get that over the grid from other power stations that are
still running. But what if all power stations in a large area are down?
Starting up after such a wide scale power outage is called a “black
start”. To make a black start possible, some power plants have addi
tional facilities so they can start up without relying on power from the
grid. The black start stations can then supply power to let the other
stations also start.
One thing to think about is that if the network depends on the dis
tributed RPKI repository, but at the same time, being able to get at the
distributed RPKI repository and/or the RIR portal to make changes to
169
ROAs, you now have a circular dependency that makes fixing prob
lems a big challenge. So it's extremely important to always have a way
to connect to the resources you need to diagnose and fix problems
without having to use your own infrastructure.
For instance, if you host mail servers in your own network, you will
probably not be able to send or receive email using your normal email
address during an outage. Or maybe you normally use a desktop
computer in the network operations center, but now you're using a
laptop with a 4G connection. But that laptop doesn't have access to
your normal password manager.
It's important to think through these scenarios and make sure a “black
start” can happen as quickly as possible.
What you want to do before you perform any of these potentially dis
ruptive actions is shut down the relevant eBGP sessions. If you're go
ing to bring a link down, this would be the BGP session or sessions
that run over that link. If you're going to bring down or reboot the
router, shut down all eBGP sessions first. This is with the example 11
configurations running:
Router# conf t
Router(config)# router bgp 65082
Router(config-router)# neighbor 192.0.2.41 shutdown
Router(config-router)# neighbor ix-ipv4-peers shutdown
Router(config-router)# ^Z
170
The first line shuts down the BGP session with AS 65040 and the sec
ond line shuts down the BGP sessions that are members of the peer
group ix-ipv4-peers, which are ASes 65083, 65084 and 4206508500.
In the show ip bgp summary overview it's made explicit that these
BGP sessions are administratively shut down:
171
quire support from the neighbor. See Cisco's documentation of grace
ful shutdown for more information.
For instance, we can use a limit fo 10,000 for all peers that advertise
fewer than 3000 prefixes, a limit of 100,000 for all peers that advertise
between 3000 and 30,000 prefixes, and no limit for peers that advertise
more. The IPv6 BGP table is about a tenth of the size of the IPv4 BGP
table, so there all these numbers could be a factor 10 lower.
172
The solution was a mechanism called flap damping [RFC 2439]. With
this system enabled, prefixes accrue a penalty when they flap. When
the penalty reaches a threshold, the prefix is suppressed and not ad
vertised to neighbors. The penalty decreases using an exponential de
cay, and when it falls below a threshold, the prefix is no longer sup
pressed and thus advertised to neighbors.
“As the power of routers has increased, the original needs for
BGP Flap Damping is no longer a major concern for operators
or router equipment vendors as it was in the mid-1990s when
route flapping consumed a significant percentage of the CPU
of early routers. In fact, the negative effects of RFD, as de
scribed above, have become the major concern, the cure has
become worse than the disease!”
Note that the BGP standard specifies that there is a minimum delay
between updates for the same prefix. This is the minimum route ad
vertisement interval (MRAI). The default value for the MRAI is 30 sec
onds for eBGP sessions and 5 seconds for iBGP sessions. So if a router
sends a withdraw for a certain prefix to an eBGP neighbor, but then a
few seconds later it has a new path for that prefix, it will sit out the 30
seconds before sending that next update. This has the advantage that if
more updates arrive during that 30 second period, those are coalesced
into a single update, rather than a stream of updates in quick succes
sion.
173
So the MRAI makes BGP more stable. But it also delays BGP conver
gence. You can adjust the MRAI for a neighbor with neighbor ...
advertisement-interval <seconds> where routers will typically
accept a value between 0 and 600 seconds.
174
ANSSI's BGP configuration best practices is a relatively long docu
ment, but it gets very to the point while also explaining why each prac
tice is necessary or useful, and how to implement it.
Philip Smith's BGP Best Current Practices. Philip Smith has given this
presentation around the world since at least 2005. The slides have a ton
of suggested practices, but very little in the way of explanation why
they're necessary or useful. I'm sure he covers that in the presentation,
it's just not on these slides.
175
ing from all transit providers and peers. (Obviously you only allow
exactly the customer's prefix(es) from customers, right?)
At the time of this writing, only 0.07% of IPv6 global unicast space has
been allocated by the RIRs. That's in about 60,000 prefixes, with usual
ly empty space between those prefixes so if an address holder needs
more space, their prefix can grow into that reserved space. This means
that the list of prefixes that describes the IPv6 unallocated space has no
fewer than 135,000 entries.
176
lookup in a routing table very quickly using binary search [W] or more
advanced algorithms. Filters that use these algorithms will be fast even
when they're huge, but if filters are simply evaluated line by line from
start to finish, then such large filters are too slow.
Because the RIRs allocate address space every day, it's extremely im
portant to update bogon filters very regularly.
If despite this limitations, you want to apply bogon filters, have a look
at the Team Cymru bogon reference.
177
Tools and resources
There are many online tools and resources that make running BGP for
internet routing easier. First and foremost: BGP looking glasses. These
let anyone look inside the BGP tables of remote networks in order to
see what path is used to reach a certain prefix from the vantage point
of that network. They also often support ping and traceroute. Some
useful looking glasses:
A bit more typing, but a very good way to see what's happening with
the global BGP table is Route Views. This project runs a big router with
dozens of BGP feeds from other networks. You can connect to the
Route Views router with telnet:
% telnet route-views.routeviews.org
[...]
Username: rviews
route-views>
You can then execute show ip bgp (and show bgp ipv6) commands
on that router. This includes commands like show ip bgp regexp
_112$ to see all the prefixes originated by AS 112, for example. These
commands tend to run slow but can be very illuminating.
178
There are also services that will monitor your prefixes and warn you if
their status changes, like when a new AS starts originating (parts of)
them. These are typically paid-for services. An example is BGPmon.
There is also a BGPalerter tool that you can run yourself and that will
use RIPE RIS data from 600 ASes.
PeeringDB
PeeringDB is what the name suggests: a database with information rel
evant for peering. The main interface to PeeringDB is the website, but
there's also a nAPI and a whois interface. You'll find it hard to set up
much peering with other networks without registering information
about your AS in PeeringDB. This looks like:
% whois -h whois.peeringdb.com as112
Network Information
===================
Name : DNS-OARC-112
Primary ASN : 112
Also Known As :
Website : https://www.as112.net/
IRR AS-SET : AS112
Network Type Non-Profit
Approx IPv6 Prefixes : 2
Approx IPv4 Prefixes : 2
Looking Glass :
Route Server :
Created at 2016-07-01T12:40:44Z
Updated at : 2022-07-27T05:33:16Z
URL :
General Policy : Open
Location Requirement : Not Required
Ratio Requirement : False
Contract Requirement : Not Required
179
Public Peering Points (81)
==========================
This way, it's easy to see if a certain network peers in any common lo
cations, or which potential peers are available at an internet exchange
or a private peering facility. And when setting up the BGP sessions for
peerings over an IX, a peer's AS number and neighbor addresses are
easily found on PeeringDB.
RIPE meetings in Europe go all the way back to 1989 and the first
meeting of the North American Network Operators Group (NANOG)
was in 1994. NANOG is a bit more focussed on the technical aspects of
running an internet-connected network, while at RIPE meetings there
is also focus on the activities of the RIPE NCC. These days, there are
operator groups and/or meetings in most regions of the world:
• AfNOG: Africa
• CaribNOG: Caribbean
180
• SANOG: South Asia
Many countries [W] also have Network Operator Groups. The Global
Peering Forum (GPF) is less about network operations and more about
peering.
Other resources
The NANOG mailing list is an important resource for network opera
tors: it's often the first place to hear about significant network inci
dents.
The IPv4 and IPv6 versions of the CIDR Report have daily new sta
tistics about the global BGP table. Depressingly, the same exact reacha
bility information from the current 936,780 prefixes in the IPv4 BGP
table could be expressed in 520,873 prefixes if networks wouldn't need
lessly de-aggregate their prefixes. And for IPv6 it's no better with
165,631 current prefixes vs 88,506 needed with aggregation.
A good place to see what's going on with global (and per RIR region)
RPKI deployment is the NIST RPKI Monitor. Globally, just over 40% of
IPv4 prefixes has RPKI state valid, with about a percent invalid. For
IPv6 it's 38% valid, but no less than almost 5% invalid.
181
Appendix: the router command line
This appendix is the really short version on how to use the CLI. I'll as
sume you can log in to the router's CLI with Telnet, SSH or a console
cable. You then get a prompt like:
Router>
Before you can enter configuration commands, you need to enter privi
leged or “enable” mode:
Router> enable
Password:
Router#
Note that the prompt now has a # instead of a >, indicating that the
router will now accept all commands, including configuration com
mands.
With FRRouting or Quagga, you can use the vtysh tool and you're
immediately in “enable mode”.
At this point, you can enter the examples from this book that start and
end with an exclamation mark. After entering configuration com
mands, type exit (possibly multiple times) or control-Z.
182
figuration modes. For instance, we'll find ourselves in BGP configura
tion mode pretty regularly:
Router# conf t
Router(config)# router bgp 65082
Router(config-router)#
If you need to get back to BGP configuration mode, you'll have to first
be in configuration mode (which we traditionally do with conf t
rather than configure terminal) and then use router bgp <AS>.
At any point you can type a question mark to see what options are
available. You only have to type as enough characters so the router
knows which command you have in mind, for instance, sh ip bgp
sum rather than show ip bgp summary.
In some situations it's enough to simply repeat the command with the
right parameter:
Router(config-router)# neighbor 192.0.2.21 description ISP 20
Router(config-router)# neighbor 192.0.2.21 description ISP 30
183
ip bgp-community new-format
!
no ip http server
ip classless
!
Note especially the network ... mask ... notation with a netmask
rather than a prefix length. To a Quagga router, the extra settings above
are meaningless, because it's either already the default or not even
supported.
FRRouting moves away from Quagga in a number of ways. See the full
FRRouting documentation here. If you're going to spend any time with
the FRR command line through the vtysh utility, you definitely want
to add this to your vtysh.conf files:
!
terminal paginate
!
This way, long output will pause after each page. An interesting fea
ture of FRR is [RFC 8212] support. This RFC mandates that eBGP ses
sions must have some incoming and outgoing filter to allow incoming
updates to be added to the BGP table and outgoing updates to be gen
erated, respectively. In production environments this will generally not
be a problem. But it may lead to surprises when migrating from some
thing older to FRR. Turn this off with no bgp ebgp-requires-policy
under the router bgp heading.
FRR does break compatibility with other routers and routing software
by changing the following:
to:
184
This makes it impossible to create a non-trivial BGP configuration that
works on both FRR and other routers. Also note that unlike Quagga,
FRR will load a configuration file that has configuration commands
that it doesn't recognize and simply ignore the unrecognized com
mands without an error message. So a Cisco/Quagga config file with
ip as-path access-list ... in it will seemingly work, except that
the AS path access list will be missing.
However, Cisco routers would often decide it was time to cluster the
IPv4 BGP settings in an address-family ipv4 unicast section. And
then you'd be stuck with that. Quagga, on the other hand, seems to
stick with the mixing, either much longer or always. FRR in on the
other hand, always uses an address-family ipv4 unicast section.
However, all routers accept entering commands in the mixed style. But
with FRR you don't get command completion/listing with tab and ?
respectively.
185
Appendix: BGP minilab
You can run most of the examples in a “BGP minilab” so you can see
how they work and perform your own experiments. The minilab uses
virtual FRRouting routers that run in Docker containers. The practice
network is set up as follows:
IX-AS 65090
ISP 20
AS 65020
Rserv
ISP 30
AS 65030
R2R3 R
R3 R4
• ISPs 10, 20, 30 and 40, where ISPs 10 and 40 sit at the top of the
hierarchy and peer with each other. ISP 30 is a transit customer of
ISP 20, and ISP 20 is a transit customer of ISP 10.
186
• An internet exchange with a route server.
• Three peers: networks 83, 84 and 85. These are also all customers
of ISPs 30 and 40 and connect to the internet exchange.
• Install Docker
• Start Docker
On Windows:
.\start.ps1 example 1
This will start up the required virtual routers and connect you to the
main router “Router” a.k.a. Router82. When you log out, all the virtual
187
routers are shut down. If you want to run several examples, add the
keeprunning argument when starting an example, like:
.\start.ps1 keeprunning 1
This way, when you disconnect from the virtual router, some of the
“support” virtual routers are kept running so they don't have to be
restarted when starting another example. You can use the additional
keyword detach to run router82 in the background, making it easier to
connect to different routers. Some examples do this automatically.
188
Appendix: a non-converging BGP
configuration
• B the path through C as long as that path is less than three hops
• C the path through D as long as that path is less than three hops
• D the path through B as long as that path is less than three hops
B now sees the path C A so it switches over to that path. This means
that D now sees the path B C A, which is three hops so it switches to
the direct path to A.
We’re now back to our initial situation so the cycle starts from scratch.
This cycling will go on forever.
Example A, router A:
!
router bgp 65065
network 192.0.2.0/24
neighbor 203.0.113.66 remote-as 65066
neighbor 203.0.113.66 advertisement-interval 1
neighbor 203.0.113.67 remote-as 65067
neighbor 203.0.113.67 advertisement-interval 1
neighbor 203.0.113.68 remote-as 65068
neighbor 203.0.113.68 advertisement-interval 1
!
189
Example A, router B:
!
router bgp 65066
neighbor 203.0.113.65 remote-as 65065
neighbor 203.0.113.65 advertisement-interval 1
neighbor 203.0.113.67 remote-as 65067
neighbor 203.0.113.67 route-map preferred in
neighbor 203.0.113.67 advertisement-interval 1
neighbor 203.0.113.68 remote-as 65068
neighbor 203.0.113.68 advertisement-interval 1
!
ip as-path access-list 2 permit ^[0-9]+$
ip as-path access-list 2 permit ^[0-9]+_[0-9]+$
!
route-map preferred permit 10
match as-path 2
set local-preference 200
!
route-map preferred permit 20
!
Example A, router C:
!
router bgp 65067
neighbor 203.0.113.65 remote-as 65065
neighbor 203.0.113.65 advertisement-interval 1
neighbor 203.0.113.66 remote-as 65066
neighbor 203.0.113.66 advertisement-interval 1
neighbor 203.0.113.68 remote-as 65068
neighbor 203.0.113.68 advertisement-interval 1
neighbor 203.0.113.68 route-map preferred in
!
ip as-path access-list 2 permit ^[0-9]+$
ip as-path access-list 2 permit ^[0-9]+_[0-9]+$
!
route-map preferred permit 10
match as-path 2
set local-preference 200
!
route-map preferred permit 20
!
190
Example A, router D:
!
router bgp 65068
neighbor 203.0.113.65 remote-as 65065
neighbor 203.0.113.65 advertisement-interval 1
neighbor 203.0.113.66 remote-as 65066
neighbor 203.0.113.66 advertisement-interval 1
neighbor 203.0.113.66 route-map preferred in
neighbor 203.0.113.67 remote-as 65067
neighbor 203.0.113.67 advertisement-interval 1
!
ip as-path access-list 12 permit ^[0-9]+$
ip as-path access-list 12 permit ^[0-9]+_[0-9]+$
!
route-map preferred permit 10
match as-path 12
set local-preference 200
!
route-map preferred permit 20
!
191
Appendix: IP address notes
192
Prefix Subnet mask From To Increment
length
/29 255.255.255.248 0.0.0.0 0.0.0.7 8
4 2
/30 255.255.255.252 0.0.0.0 0.0.0.3
/31 255.255.255.254 0.0.0.0 0.0.0.1
/32 255.255.255.255 0.0.0.0 0.0.0.0 1
Special addresses
IANA, the Internet Assigned Numbers Authority, has registries for
special IPv4 and IPv6 addresses. For all the details, consult those. But
I'll repeat the most notable special address ranges here. These are all
the IPv4 address blocks that are used by more than a single holder. Be
cause they're not globally unique, they have no business appearing in
the global BGP table, and packets with source addresses in those
ranges are invalid on the internet. (In some cases there is a valid local
use.)
Prefix Purpose
0.0.0.0/8 Default route
10.0.0.0/8 Private use
100.64.0.0/10 Service provider NAT
127.0.0.0/8 Loopback interface
169.254.0.0/16 Link local (self-assigned in absence of a
DHCP server)
172.16.0.0/12 Private use
192.0.2.0/24 Documentation
192.168.0.0/16 Private use
198.51.100.0/24 Documentation
203.0.113.0/24 Documentation
224.0.0.0/4 Class D: multicast
255.255.255.255/32 Local broadcast address
193
dresses would have been helpful, it turned out that many implementa
tions wouldn't allow class E addresses to be configured. Those were set
aside for future use, after all. As a result, these addresses remain un
used.
These are the IPv6 address blocks that should not appear in the global
BGP table and shouldn't be used in source addresses for packets that
flow across the internet:
Prefix Purpose
::/3 Various special purposes
2001:db8::/32 Documentation
fc00::/7 Unique-local
fe80::/10 Link-Local unicast
fec0::/10 Previously: site-local addresses
ff00::/8 Multicast
194
About the author
Iljitsch van Beijnum got his start in the Dutch Internet Service Provider
business in 1995. He soon realized that in order to maintain more than
one connection to the internet, you need something called “BGP”. In
1997, he co-founded Pine Internet (later Pine Digital Security). In 1999,
he worked for UUNET Netherlands on designing and implementing a
new Dutch high speed backbone.
In 2000, Iljitsch started his own business now called inet⁶ consult. Be
tween 2000 and 2007, he mostly did work for web hosting companies,
among other things helping them connect to internet exchanges.
In 2007, Iljitsch started writing for Ars Technica. Later that year he
moved to Spain to become a research assistant at UC3M, where he did
more IETF work, most notably on NAT64 and DNS64 as well as a sug
gested improvement to BGP. Iljitsch holds a bachelor's degree in In
formation and Communication Technology from the Haagse
Hogeschool in The Hague and a master's degree in telematics from
UC3M Madrid.
195
Copyright and acknowledgments
About 140 words from Cisco's description of the BGP path selection
algorithm were used in the description of the algorithm in this book.
Router and switch icons from Cisco were used in the figures in this
book.
196