Global MPLS Design Using Carrier Supporting Carrier
Global MPLS Design Using Carrier Supporting Carrier
Version 1.1
Authored by:
Nicholas Russo
CCDE #20160041
CCIE #42518 (EI/SP)
Change History
Version and Date Change Responsible Person
ii
Copyright 2020 Nicholas Russo – http://njrusmc.net
Contents
1. Overview ................................................................................................................................. 7
1.1. Problem Statement ........................................................................................................... 7
1.2. Solution Summary ............................................................................................................ 7
2. Architecture ........................................................................................................................... 10
2.1. Point of Presence (POP) Design..................................................................................... 10
2.1.1. Physical Connectivity ............................................................................................. 10
2.1.2. IGP Routing ............................................................................................................ 14
2.1.3. Multicast Routing.................................................................................................... 18
2.1.4. BGP VPN Services Routing ................................................................................... 19
2.1.5. MPLS Label Advertisement ................................................................................... 20
2.1.5.1. Label Distribution Protocol (LDP) .................................................................. 20
2.1.5.2. Resource Reservation Protocol for MPLS Traffic Engineering (MPLS-TE).. 22
2.1.5.3. Segment Routing (SR) ..................................................................................... 22
2.1.6. Customer Services .................................................................................................. 23
2.1.6.1. Layer-3 VPN.................................................................................................... 23
2.1.6.2. Layer-2 VPN.................................................................................................... 26
2.1.6.3. Multicast VPN ................................................................................................. 30
2.2. Carrier Supporting Carrier (CSC) Design ...................................................................... 35
2.2.1. BGP Labeled-unicast (BGP-LU) Connectivity ...................................................... 35
2.2.2. Interaction Between IGP and BGP-LU................................................................... 37
2.2.3. Inter-AS BGP VPN Servies Routing ...................................................................... 39
2.2.4. Non-CSC Transport Supplementation .................................................................... 48
2.3. Extranet Integration ........................................................................................................ 51
2.4. Quality of Service (QoS) Design ................................................................................... 52
2.4.1. Queuing and Shaping .............................................................................................. 52
2.4.2. Classification, Marking, and Policing..................................................................... 54
2.5. Network Management and Automation ......................................................................... 57
2.5.1. Global Management View (GMV) Design ............................................................. 57
2.5.2. VPN Management View (VMV) Design ................................................................ 58
iii
Copyright 2020 Nicholas Russo – http://njrusmc.net
Figures
Figure 1 - High-level CSC/Option C Architecture ......................................................................... 9
Figure 2 - Traditional POP Physical Design ................................................................................. 11
Figure 3 - Leaf/Spine POP Physical Design ................................................................................. 12
Figure 4 - Using CSC-CEs as BGP VPN Route Reflectors.......................................................... 13
Figure 5 - Using Dedicated Out-of-band BGP Route Reflectors ................................................. 14
Figure 6 - Using RRs for Transit in a POP with Link Failures..................................................... 17
Figure 7 - Preventing Transit RRs in a POP using Areas ............................................................. 18
Figure 8 - Intra-POP iBGP VPN Sessions and Link Failure Tolerance ....................................... 20
iv
Copyright 2020 Nicholas Russo – http://njrusmc.net
v
Copyright 2020 Nicholas Russo – http://njrusmc.net
Tables
Table 1 - Plausible MVPN Profile Options .................................................................................. 30
Table 2 - Core Queuing Allocations ............................................................................................. 52
Table 3 - Ingress PE Classification and Marking ......................................................................... 55
Table 4 - Global and VPN Management Outage Matrix .............................................................. 61
vi
Copyright 2020 Nicholas Russo – http://njrusmc.net
1. Overview
7
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC is seldom used in real life because other options, such as Ethernet LAN (E-LAN) services,
make it easy to connect remote POPs at layer-2. Smaller carriers can run their regular interior
gateway protocols (IGP) and MPLS label distribution protocols without any routing interacting
with the core carrier. However, such technologies require Ethernet last-mile connectivity
(notwithstanding sloppy layer-2 interworking designs) which could not be guaranteed in every
country in which we had a POP. CSC provides last-mile circuit flexibility/independence while
also improving scale as the customer and core carriers exchange routes using Border Gateway
Protocol (BGP). In this context, BGP is extended to include an MPLS label for every prefix and
is known as BGP labeled unicast (BGP-LU).
What makes this design truly unique is not only the rare deployment of a production, global scale
CSC network, but the inclusion of Inter-AS MPLS Option C. This relatively complex integration
allows two different BGP autonomous systems (AS) to exchange BGP VPN routing information
in a highly scalable way. Rather than exchanging such information through the AS boundary
routers (ASBRs) as Options A and B do, Option C peers the BGP VPN route-reflectors (RR)
instead. This allows the ASBRs to be unaware of any VPN routing, serving only as CSC
customer edge (CSC-CE) devices connecting to the core carrier’s CSC provider edge (CSC-PE)
devices. The justification for this design, instead of the more traditional internal BGP (iBGP)
VPN peerings, comes later in this document.
The term “BGP VPN” is a generic statement that represents any BGP address-family used to
carry customer VPN information, whether it is IPv4/v6 routes, MAC addresses, Virtual Private
LAN Service (VPLS) discovery/signaling messages, multicast VPN (VPN) discovery/signaling
messages, and more. This highly generic combined design leveraging CSC and Option C allows
any service to be extended between any pair of POPs in the world, regardless of their manner of
connectivity. The diagram below illustrates a high-level design L3VPN design.
8
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
EBGP IPV4/V6
(PE TO CE)
BGP ASN
65003
9
Copyright 2020 Nicholas Russo – http://njrusmc.net
2. Architecture
This section describes the solution in greater technical depth. It examines each individual
component in depth, adding new components as it progresses. This document is not a training
tutorial on the technologies, but does explain how they work within the context of the design.
10
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE
CSCCE CSCCE
PE PE PE PE
The second design was based on a leaf/spine design, effectively adding another pair of routers
between the customer facing PEs and the CSC-CEs. Both the PEs and CSC-CEs are “leaves” in
this design, with the CSC-CEs being classified as “border leaves” given their integration with an
external network. The middle tier consisted of the “spines” whereby every leaf is connected to
every spine. Leaves never connect to leaves and spines never connect to spines within the same
tier, with one exception. The border leaves can connect together because shuttling ingress/egress
traffic between edge devices is useful to improve availability or implement ingress/egress traffic
engineering in the future. The main technical advantage of leaf/spine over the traditional design
is the ability to improve scale for east/west traffic. Simply add more spines to increase
availability, capacity, or both.
This can also be viewed as a disadvantage, since the only purpose of a spine is to forward traffic.
This incurs additional cost and management burden. In real life, we never deployed leaf/spine
POPs as there was no compelling operational justification, despite their popularity at the time.
This document will discuss the details surrounding its deployment nonetheless. The diagram
below illustrates the conceptual leaf/spine POP physical design.
11
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE
LEAVES
PE PE CSCCE CSCCE
BORDER
LEAVES
SPINES
P P
We overlaid two different BGP RR strategies atop these POP designs. The first was a low-cost
approach that repurposed the CSC-CEs, whether they were aggregation routers or border leaves,
to serve as BGP RRs for the POP. Because these devices were already quite powerful in terms of
computing capacity, using them to serve as BGP RRs was a low-risk, cost-effective choice. Each
PE in the POP would peer to these RRs using internal BGP (iBGP) which is detailed later in this
document. This is the design we selected in real-life as cost concerns governed many of our
decisions. The diagram below illustrates the intra-POP iBGP VPN sessions overlaid on both the
traditional and leaf/spine physical designs. Note that the precise details over the iBGP topology
is discussed later in the document.
12
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC CSC
CORE IBGP VPN CORE
ROUTE REFLECTORS
CSCPE CSCPE CSCPE CSCPE
IBGP VPN
RR CLIENTS
BGP FREE
PE PE PE P P
The second design involved a pair of dedicated RRs outside of the forwarding path of customer
traffic. These routers would look like PEs from a physical connectivity perspective, but would
not service any customers and would never be used for traffic forwarding. This non-transit
behavior can be implemented by manipulating IGP (discussed later). In modern designs, these
BGP RRs are often low-cost virtual routers with large memory allocations, medium CPU
allocations, and low network bandwidth allocations. Additionally, we considered using a
different pair of BGP RRs for all the different VPN services we offered, such as IPv4 VPN, IPv6
VPN, multicast VPN, etc. This incurs even greater cost and management burden, but reduces fate
sharing and slightly improves availability.
Some of the largest carriers manage risk by spreading, to the maximimum extent economically
possible, different BGP address-families across different RRs. In our environment, we did not
have a general-purpose computing environment immediately available. When including the
capital investment needed to build and maintain it, this solution was prohibitively expensive and
not at all worth doing. The diagram below illustrates conceptual examples of adding dedicated
RRs to the traditional and leaf/spine POP designs at a high-level.
13
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC CSC
CORE CORE
CSCPE CSCPE CSCPE CSCPE
BGP FREE
CSCCE CSCCE PE PE CSCCE CSCCE
iBGP VPN
RR RR ROUTE REFLECTORS RR RR
14
Copyright 2020 Nicholas Russo – http://njrusmc.net
topological graph complexity. P2P links do not have a designated router (DR) and thus no DR
election. A P2P link is not a multi-access network and therefore does not benefit from a DR,
which is represented as a Link State Advertisement type 2 (LSA2) in the LSDB. As such, no
LSA2 should be present anywhere in the network, reducing the number of total graph vertices by
almost half. It is advisable to retain “stub networks” within the router LSA for OSPFv2 (LSA1)
or the intra-area prefix LSA within OSPFv3 (LSA9) to simplify troubleshooting. This allows
operators to ping transit links, to source pings from transit links, and to see at a glance which
links might be experiencing problems by checking the routing table. Given the small network
and the rarity with which these IP subnets change, there is little operational benefit to
suppressing these prefixes.
Next, consider OSPF security. Modern OSPF implementations allow for SHA-256 authentication
(some platforms offer even stronger hashes) which should be preferred instead of the older MD5
option. In addition to “classic” authentication, OSPFv3 also offers IPsec encryption, which in the
author’s experience, is overly complex, prone to breaking, and not worth deploying. OSPF TTL-
security ensures that neighbors are directly connected, preventing any long-range hijacking
attacks from external networks, such as those accessible over CSC. Protecting the OSPF LSDB
itself can be accomplished by setting maximum LSA limits to prevent accidental LSA injection
at scale, perhaps due to unfiltered BGP to OSPF redistribution. Such concerns were irrelevant in
our environment given that our global Internet connections were placed in a VPN, but some
customers may prefer to transport Internet traffic in the global routing table (more on this later).
In less symmetric networks, some operators deploy loop free alternate (LFA) technologies to
allow OSPF to inspect the LSDB in greater detail to determine if backup paths exist. When they
do, the router can preemptively install these backup paths in hardware for faster failover. In our
case, POPs are perfectly symmetric with the same IGP cost used on all links (10 in our case),
automatically resulting in equal-cost multi-path (ECMP). This feature allows for load sharing
between devices based on various hashing algorithms which are out of scope for this document.
More important than the load sharing is the high availability; because both routes are used for
forwarding, they are both programmed in hardware already. This obviates the need for complex
LFA techniques, and given our early-in-career network operators, ECMP was the best choice.
The first step in the convergence process is failure detection. Because all devices in the POP
were directly connected (i.e. no intermediate Ethernet switches), the Ethernet interface line status
was an accurate indication of a link’s up/down status. This raises the question of “carrier delay”;
how long after a failure is detected should the control-plane mark the interface as “down”? In our
3 years of operation, we observed only 2 “false-negative micro-flaps” whereby an interface loses
electrical or optical signal for a brief (a few milliseconds at most), but immediately returns.
Marking this as a link flap and starting the convergence process is more detrimental than just
waiting, so we used a relatively aggressive carrier-delay of 5 milliseconds. This delay helps the
control-plane ignore rare microflaps rather than starting the convergence process prematurely.
Note that Bidirectional Forwarding Detection (BFD) is generally unnecessary in the POP
because of the direct Ethernet connections. If, for example, Ethernet switches (or other transit
devices such as media converters) were present, using BFD with echoes enabled would be a
good design decision. Various protocols, including all IGPs, can register to BFD, which notifies
them when links go down. While BFD is always slower than using link status for failure
15
Copyright 2020 Nicholas Russo – http://njrusmc.net
detection, they can be used together; if line status stays up after a link fails, BFD will detect it
soon enough. This is a “belt and suspenders” approach that some carriers use to maximally
reduce risk, but we saw it as unnecessary, low-value complexity.
Both OSPF and IS-IS have many tunable convergence timers, but in the author’s experience, two
of them have an outsized impact and should be optimized first when optimization is deemed
necessary. Note that often times such optimization is unnecessary, but our customers had strict
performance requirements that heavily influenced our routed convergence design. These timers
are the OSPF LSA generation and SPF throttle timers.
First, we adjusted the OSPF LSA generation throttle timers. This controls how long to wait
between originating the same LSA after observing a change in the network. We selected 50 ms
to better group multiple concurrent link failures. Degrees of interface disjointedness varied
widely based on the hardware, which was not consistent network-wide due to budget limitations.
For example, some PEs had their distribution/core uplinks spread across two linecards, while
other devices did not. Rather than try to “point optimize” individual devices or POPs, we used a
relatively conservative initial LSA delay timer discussed above. Generation of successive LSAs
began at 250 ms after the initial LSA, doubling each time up to a maximum of 1000 ms. This
exponential back-off solution prevents excessive IGP flooding when successive changes keep
occurring.
Next, we adjusted the OSPF SPF throttle timers. SPF is the algorithm run each time a change to
the OSPF topology is detected. In this design, full SPF runs whenever a change to an LSA1 is
detected, while a partial SPF is run for changes in an LSA3 (inter-area routes) or LSA5 (external
routes). These timers are tuned for more rapid SPF calculations to speed convergence within the
POP. Given the relatively small OSPF topology with low prefix count (and modern routers), SPF
runtimes are not a concern. Our testing indicated that all LSA flooding can complete in less than
50 ms within any POP. Therefore, the SPF initial way timer was set to 50 ms, capturing all of the
LSA changes and running SPF only once. In the unlikely and unobserved event that SPF doesn’t
capture all the LSAs, it will run again after 300 ms, doubling up to a 1000 ms maximum.
Some readers might be curious about incremental SPF (iSPF). While academically clever, the
author’s operational experience with iSPF is largely negative. It defines various “shortcuts” that
OSPF can take in specific topologies to skip steps in the SPF process. For example, a singly-
connected router is the gateway to other routers fails, iSPF can summarily discard everything
behind it. This may have a positive impact in large networks, but simply stated, the technology is
buggy, hard to troubleshoot, and uncommonly deployed. Modern Cisco devices don’t even
support it anymore. We opted not to deploy iSPF.
Lastly, there is one case where using OSPF areas within the POP makes sense. When dedicated
RRs are used, they should never be used for forwarding. It would be better to blackhole traffic
entirely than to crash the RRs in this way, which might be servicing other satellite POPs across
the network (discussed later). If the RRs are placed in area 0, the diagram illustrates what might
happen if enough link failures occur within a POP.
16
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE AREA 0
CUSTOMER
CSCCE CSCCE
TRAFFIC
PE PE
RR RR CE CE
While it is exceedingly unlikely, the impact is severe, and is worth protecting against. Take
advantage of one of OSPF’s many loop control prevention mechanisms by putting these RRs into
a different area, perhaps area 1. No special area types or LSA filtering/summarization is
necessary. The area assignment alone will prevent two PEs in area 0 from communicating across
RRs in area 1. Now, when the link failures occur, intra-POP PE traffic simply fails as the MPLS
label switched path (LSP) between the PEs is broken due to BGP next-hop inaccessibility. Put
another way, we can leverage the dreaded “disjoint area 0” as a good thing, preferring to have
broken connectivity rather than causing damage to our BGP VPN infrastructure which is likely
servicing satellite POPs elsewhere in the world (discussed later). The diagram below summarizes
the high-level OSPF design in both traditional and leaf/spine POPs.
17
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE AREA 0
AREA 1
CSCCE CSCCE
PE PE
RR RR CE CE
NO AREA 0 TRANSIT
Although we did not deploy IS-IS, it is worth a brief discussion. Like OSPF, the same
recommendations for graph optimization, security, and performance tuning exist, with the
exception of TTL-security. Because IS-IS is not based on IP, it is inherently insulated from IP-
based hijacking attacks. Such attacks can never target IS-IS, so TTL-security is unnecessary. As
it relates to dedicated RRs and keeping them out of the transit path, IS-IS has a specific feature
named the overload bit (OL). When set on a router, the OL-bit signals to all other routers that the
device is “overloaded” and should never be used for forwarding, even if no other paths exist.
While some implementations have an OSPF “max-metric” feature, this is just a cost adjuster and
does not prevent transit traffic, but merely discourages it. In contrast, the IS-IS OL-bit is both
authoritative and effective on these dedicated route reflectors. IS-IS routers with the OL-bit set
can never be used for transit, even as a last resort.
18
Copyright 2020 Nicholas Russo – http://njrusmc.net
purpose of a PIM any-source multicast (ASM) rendezvous point (RP) is to discover multicast
sources, no RPs are needed in this network. Not needing to add any RPs significantly simplifies
the multicast design, implementation, and maintenance complexity network-wide.
As discussed earlier, BFD was not enabled in our POPs as it was unnecessary. However, if BFD
is used, PIM should be registered to BFD for fast failover on par with IGP.
19
Copyright 2020 Nicholas Russo – http://njrusmc.net
diagram below illustrates this design, as well as a common misconception regarding multiple
link failures within a POP and its impact on iBGP sessions.
Figure 8 - Intra-POP iBGP VPN Sessions and Link Failure Tolerance
CSC CSC
iBGP SESSION CORE CORE
NOT NEEDED! CSCPE CSCPE CSCPE CSCPE
FAIL!
IBGP VPN
SESSION PE PE PE PE
In addition to reducing BGP complexity, the decision to omit this inter-RR iBGP session allowed
us to create two disjoint BGP VPN meshes. One was named “mesh A” and the other was named
“mesh B”. The meshes only converged at the PEs, which were not configured as RRs, and thus
were not able to reflect iBGP routes between meshes. Such a design is conceptually similar to
storage area networks (SAN) where the SAN A/B transport networks are completely independent
for availability purposes. If one SAN becomes corrupted or otherwise fails, it would be contained
only to that SAN, and the same is likewise true for these disjoint iBGP meshes.
This A/B mesh design applies to all 5 BGP address-families, and also note that the RRs did not
have any customer VPNs configured. This improved their memory utilization as there was no
import/copy process from BGP into local routing tables on a per VPN basis. The relevance of the
A/B mesh design is explained more later in the document as it relates to inter-POP connectivity.
20
Copyright 2020 Nicholas Russo – http://njrusmc.net
the TCP header, which is a security technique we implemented. Of greater significance to our
customers are two other LDP features. These are designed to speed up convergence and prevent
forwarding black holes in MPLS networks.
First, LDP/IGP synchronization protects against two common cases: where IGP converges
before LDP can exchange label bindings or an LDP session is closed but traffic continues to
forward along the original path.
In both cases, it is a synchronization issue between IGP and LDP where the two protocols
converge at different times. In our case, LDP/IGP synchronization was enabled on all IGP-
enabled core interfaces, much like PIM. When OSPF has an adjacency on a link but LDP does
not, this feature raises the OSPF link cost to the maximum value of 65535.
This makes the link highly undesirable and, assuming other paths exist to the same destination,
forces OSPF to re-route around the LDP-incapable link. The diagram illustrates the OSPF/LDP
re-routing concept just described.
Figure 9 - LDP/IGP Synchronization with LDP Session Failures
CSC CSC
CORE CORE
CSCPE CSCPE CSCPE CSCPE
FAIL! FAIL!
IGP
NEIGHBOR CSCCE CSCCE CSCCE CSCCE
LDP GOES DOWN
OSPF COST 65535
LDP TCP
SESSION
PE PE PE PE
Next, we enabled LDP session protection. This feature is enabled for all peers with a 10 minute
hold down timer. This sends targeted hellos to all neighbor’s LDP router IDs so that if a link
fails, yet the neighbor’s loopback is reachable via IP, the session stays up.
Although IGP will determine the forwarding path, this cuts down on LDP convergence and
Label Information Base (LIB) refresh times. If the LDP peer is not reachable after 10 minutes,
the label bindings for that peer are flushed and the preserved LDP session is torn down.
Conceptually, LDP session protection is similar to carrier-delay.
Instead of being a small time period designed to tolerate short microflaps, it is a larger time
period designed to tolerate actual link failures. The diagram below illustrates how LDP sessions
remain up even after links fail. You’ll observe that the behavior is very similar to iBGP.
21
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC CSC
CORE CORE
CSCPE CSCPE CSCPE CSCPE
FAIL!
LDP TCP
SESSION PE
PE PE PE
22
Copyright 2020 Nicholas Russo – http://njrusmc.net
requires IGPs to be extended to support label distribution, making support more limited. Other
drawbacks include hardware support, as SR is relatively new and our equipment, at the time, did
not uniformly support it. We intended to migrate to SR from LDP at some unspecified point in
the future when SR supportability was universal.
There are several advantages to deploying SR over LDP. As it relates to convergence, there is
only one protocol, so problems surrounding IGP synchronization and session protection don’t
exist. Simply tune IGP to converge at the desired pace and the label switched paths will be
immediately available. Additionally, SR traffic engineering (SR-TE) is stateless in that transit
routers do not retain information for each TE tunnel as they do with RSVP-TE. In addition to
providing a scale advantage, this makes SR much more flexible. LDP has no mechanism for
point-to-point (P2P) TE-style LSPs and RSVP has no mechanism for multipoint-to-point (MP2P)
IGP-style LSPs. SR can support both without any special configuration.
Given the CSC design, each POP is an independent IGP domain. This implies that the label
distribution method used in each POP is also independent. It is possible to use LDP, RSVP-TE,
and SR at the same time but in different locations. Operationally, there is little advantage to
strategically designing the network in this manner. However, using a migratory example,
transitioning POPs from LDP to SR on a per-POP basis, organized regionally, can be a smart
approach.
23
Copyright 2020 Nicholas Russo – http://njrusmc.net
1,000 VRFs per PE. For greater scale, use additional digits in the low-order 32 bits of the RD,
but six digits is adequate for most networks.
Second, consider the Route Target (RT) design. Like the RD, each RT is 64-bits and is it
common for the first 32 bits to represent the BGP ASN. Regarding the value, these should be
determined on a per-customer basis. For a customer that needs only any-to-any connectivity
between all sites in a VPN, a rather common design, the value 000 can be used. An example RT
within AS 65001 and a customer ID of 654 would be 65001:654000. A hub/spoke VPN would
require at least two RTs:
1. 65001:654001 upstream connectivity exported by the hubs and imported by the spokes
2. 65001:654002 downstream connectivity exported by the spokes and imported by the hubs
Like the RD allocation, this allows for up to 1,000 customers and 1,000 RTs per customer. As a
brief foreshadow, this document uses inter-AS connectivity extensively, so using the BGP ASN
as part of the RT for all customers may not be suitable. For some environments, this may lead to
unnecessary RT configurations and other administrative burdens (i.e., needing to import N-1 RTs
just to form a basic inter-AS any-to-any VPN). Consider using the customer ID for the first 32
bits, or using a generic BGP ASN, such as 65000, for all RTs in the greater network. The
diagram below shows how RDs and RTs work together to form MPLS L3VPNs within a POP.
Sometimes it’s a good idea to use the customer ID from the RT as the VRF ID in the RD (orange
777 and green 888) as shown here. Assume the POP is in BGP ASN 65001.
Figure 11 - Building MPLS L3VPNs within a POP
CSC
CORE
RD 65001:111777 CSCPE CSCPE RD 65001:444888
RT EX 65001:777000 RT EX 65001:888001
RT IM 65001:777000 RT IM 65001:888002
CE CE-S CE CE-S
24
Copyright 2020 Nicholas Russo – http://njrusmc.net
As it relates to availability, using unique RDs on every PE allows the BGP RRs to retain all of
the routes. This is useful for multi-homed sites because all of the egress PEs will receive the
same routes from the customer and advertise them to all BGP RRs in the cluster. Because the
routes are distinguished (different), BGP best-path on the RR does not compare them. All of the
paths will be best in their own RD-indexed tables, so the RR can reflect all of the routes towards
the ingress PE. The ingress PE can install all of them, either for active-active load sharing or
active/standby fast failover. While there are other, more complicated solutions to this problem
(shadow session, shadow RR, and BGP additional-paths capability), years of operational
experience suggest that using unique RDs is a reliable and effective choice for MPLS L3VPNs.
The ingress PE simply needs to import both routes and install them into the routing table using
ECMP. The diagram below illustrates the active/active design.
Figure 12 - Unique L3VPN RD for Active/Active Forwarding
CSC
CORE
CSCPE CSCPE
BOTH ROUTES
ARE BEST PER RD
CSCCE CSCCE
RD 65001:111888
192.0.2.0/24
VRF CUST888:
RD 65001:222888 192.0.2.0/24
PE111 PE222 192.0.2.0/24 PE444 VIA PE111
VIA PE222
192.0.2.0/24
CE CE
Perhaps counter intuitively, the active/standby design takes more effort to design and implement.
First, a primary link must be chosen, typically by setting the BGP local-preference inbound on
the primary egress PE to be greater than the BGP local-preference applied on the alternate egress
PEs. The alternate egress PE needs to be configured to advertise its best external route to the
RRs. The eBGP route from the customer won’t be the best path as the high local-preference
iBGP route will win. The BGP RR doesn’t care, because the unique RDs ensure that these routes
are separate and thus are not compared. Both are advertised to the ingress PE, which imports
both into the VPN routing table. BGP best-path runs on the ingress PE, and the device chooses
the route through the primary egress PE with the higher BGP local-preference. The second best-
25
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE
BOTH ROUTES
RD 65001:111888 ARE BEST PER RD
192.0.2.0/24 CSCCE CSCCE
LOCPREF 100
* ADV BEST EXT*
VRF CUST888:
192.0.2.0/24
RD 65001:222888
PE111 PE222 PE444 VIA PE222
192.0.2.0/24
BACKUP PE111
LOCPREF 200
192.0.2.0/24
CE CE
26
Copyright 2020 Nicholas Russo – http://njrusmc.net
imported, a specific VPN endpoint establishes an LDP-signaled pseudowire to all PEs that
exported that RT.
An additional extended community, known as the L2VPN attachment group identifier (AGI) is
also included. This is based on the BGP ASN and the operator-specified VPN ID, and must
match in order for RTs to be imported. It’s a way of controlling high-level VPN membership
while the RT determines the precise connectivity within a given VPN. Note that the only
difference between VPWS and VPLS is the number of endpoints. VPWS is a point-to-point
connection and would likely be configured as an independent AGI with the same RT imported
and exported by both nodes. Extending this design to 3 or more nodes, using the same AGI/RT
strategy, would create a full mesh of pseudowires between all nodes in the VPN. Adjusting the
RTs to create a hub/spoke VPN or other custom topology is also possible and may provide
improvements in security and scale. The diagram below illustrates the high-level design and
operation of customer L2VPNs. Note that it is common for the AGI to be the same as the RD,
but it doesn’t have to be. More importantly, the AGI will need to be manually adjusted for inter-
AS VPNs because the AS number (first half of the AGI) will cause a mistmatch, and the VPN
cannot form.
27
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
RD 65001:111777 RD 65001:444888
AGI 65001:1 CSCPE CSCPE AGI 65001:22288
RT EX 65001:777000 RT EX 65001:888002
RT IM 65001:777000 RT IM 65001:888001
CE CE-S CE CE-S
28
Copyright 2020 Nicholas Russo – http://njrusmc.net
(VLANs are mapped to service instances). Also, note that VPWS is roughly synonymous with E-
LINE/EV-LINE and VPLS is roughly synonymous with E-LAN/EV-LAN.
Figure 15 - L2VPN Services Offered
Route-target design One RT, import and One RT, import and Two RTs, swapped
export on all nodes export on all nodes import and export on
hubs and spokes
Some readers may be wondering why we chose LDP-signaled over BGP-signaled VPLS. The
former is operationally simpler to understand and has better OAM capabilities (at least on Cisco
IOS) than the latter. Understanding how label blocks and virtual offsets are computed in BGP-
signaled VPLS requires expert-level networking skills, which was in short supply within our
organization. On a technical level, not all vendors support setting the C-bit, signaling the
inclusion of an L2VPN control word (CW). The lack of a control word has several well-known
drawbacks: no ability to include sequence numbers, packets within a given pseudowire taking
different paths in the network, and more. LDP-signaled VPLS avoids these issues entirely.
Although MTU is always worth considering in any network, it is especially important for
L2VPNs. In L3VPNs by contrast, the only additional MTU overhead is the MPLS encapsulation
which is entirely predictable and is typically 2 shim headers (8 bytes). With L2VPNs, there are
many encapsulations for which to account:
a. The MPLS encapsulation within the POP: 8 bytes for 2 MPLS shim headers
b. The MPLS control-word: 4 bytes
c. Any customer VLANs retained (i.e. not popped) over the VPN: 8 bytes for up to 2 VLAN
headers. This may not be relevant if you only offer VLAN-based services whereby all
VLANs are removed at ingress.
d. Customer standard Ethernet header: 14 bytes
The total additional overhead becomes 34 bytes for L2VPN compared to 8 bytes for L3VPN. In
our environment, we provided a full 1500 byte MTU to our customers over both L3VPN and
L2VPN by using jumbo frames both intra-POP and inter-POP over CSC. If jumbo support is not
available in your network, it is imperative that your customers know the precise MTU that is
available. Ignorning the upper-most layer-2 encapsulation (Ethernet in our case) within the POP,
the diagram below illustrates how these two services differ with respect to MTU. The diagram
also assumes the more difficult (and worse) case of only having a 1500 byte MPLS MTU.
29
Copyright 2020 Nicholas Russo – http://njrusmc.net
30
Copyright 2020 Nicholas Russo – http://njrusmc.net
This document will use the Cisco-specific profile numbers for the sake of brevity. Profile 0 is the
classic “Draft Rosen” technique that uses a dedicated, inflexible BGP address-family to advertise
PE loopbacks between devices in a given VPN. This source discovery process obviates the need
for PIM RPs in the network and enables SSM to be exclusively deployed for customer multicast
transport, even for default MDTs. In real life, this is the option we chose, as it was widely
supported, well-documented, and commonly used in production networks for years. The main
drawback of this approach is that, assuming no other multicast-related BGP sessions are
established, all VPNs must use profile 0 regardless of their connectivity requirements. This can
be limiting for future operations.
Because BGP is only used for remote PE discovery, switching over to more optimal MDTs for
high-bandwidth flows (called “data MDTs” in Cisco parlance) was handled within the PIM
overlay. These selective MDTs are a subset of the larger inclusive, default MDT. The ingress PE
(the one connected to the source) signals this using a special PIM message named “Data MDT
Join”. Each tree describes a different provider multicast service interface (PMSI). The Inclusive
PMSI (I-PMSI) represents the default MDT and the Selective PMSI (S-PMSI) represents the
individual data MDTs. The diagram below represents how they components fit together within
profile 0.
31
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE
PIM EMULATED LAN
CSCCE CSCCE
Profile 3 operates almost identically to profile 0 except it uses a different BGP address-family.
Its purpose is still limited to source discovery, but BGP is now capable of signaling more than
just IPv4 addresses to serve as SSM sources. By building the network using profile 3 in the first
place, the designer can retain the benefits of a stable, known-good solution while also being
prepared for future requirements. If we were rebuilding the network today, this is the profile we
would have likely chosen for the vast majority of customer VPNs. All of the customer multicast
signaling still uses PIM over the inter-PE emulated LAN, and while this limits scale, it is simple
to understand and operate. In this context, there are two BGP messages. One is used for I-PMSI
endpoint discovery and is used to construct the default MDT. The other is used to signal S-PMSI
switchover events as they occur. The S-PMSI message is comparable to the “Data MDT Join”
message, but contains a bit more contextual data about the tunnel itself. The diagram below
illustrates how they messages work with MVPN profile 3.
32
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE
PIM EMULATED LAN
CSCCE CSCCE
DEFAULT MDT
232.255.8.8
Suppose a customer has hundreds of routers in a multicast VPN, most of which are senders and
receivers. Having all of these neighbors on an emulated LAN exchanging soft-state PIM
messages would scale poorly. Profile 11 addresses this by using the modern BGP MVPN
address-families to signal customer multicast information. This obviates the need for a PIM
overlay, relying on BGP both for PE discovery and customer multicast signaling. This solution is
complicated and not commonly deployed, but the solution is at least technically possible when
BGP IPv4/v6 MVPN is used instead of BGP IPv4 MDT. The technical nuances behind how this
signaling works is beyond the scope of this whitepaper, but in summary, BGP uses different
messages that roughly correspond with PIM (*,G) join, PIM (S,G) join, and PIM register
messages. Withdrawing a BGP MVPN NLRI relating to (*,G) or (S,G) state is comparable to a
sending PIM prune. The diagram below illustrates the high-level operation of profile 11.
33
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE
CSCCE CSCCE
DEFAULT MDT
232.255.8.8
Some carriers do not support MVPN at all, providing only unicast transport. This could be true
for the customer carrier POPs or the broader CSC transport network. Ingress Replication (IR)
allows MVPNs to use existing unicast LSPs for multicast transport by replicating multicast
traffic at the ingress PE. While this is highly inefficient and defeats the purpose of multicast in
general, it can be useful for low-bandwidth applications. For example, in our environment, we
had an application that dynamically discovered its peers using multicast, not DNS like most
applicatons would use. This was a very low-bandwidth flow, and if our core carrier did not
support MVPN, it would have been an appropriate choice for some of our customers. Profiles 19
and 21 differ in that one uses PIM overlay emulation while one uses BGP MVPN for customer
multicast signaling. The diagram below illustrates the high-level operation of IR MVPN without
differentiating between customer multicast signaling types.
34
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE BGP MVPN SIGNALING
-OR-
PIM OVERLAY SIGNALING
CSCCE CSCCE
(MULTICAST
FREE CORE) S=192.0.2.99
CE-RX CE-RX CE-TX G=232.0.0.99
It is useful to briefly consider a future where CSC no longer exists. Suppose the entire transport
network is converted to E-LAN because all sites suddenly have access to Ethernet last-mile
uplinks. Now, the possibilities for MVPN are broadened, assuming BGP MVPN IPv4/v6 have
been deployed, providing even more service offerings for customers. This is yet another reason
to deploy the modern BGP control-plane, even if there is no immediate operational benefit.
35
Copyright 2020 Nicholas Russo – http://njrusmc.net
sites, each of which contributed only a handful of IPv4 subnets. The CSC carrier did not seem to
care, either.
To secure the BGP control-plane, we applied BGP routing filters on these CSC-CE to CSC-PE
sessions. Applied inbound, we denied any local POP networks. For example, if a POP was using
the 192.168.0.0/24 address space for its various transit links, device loopbacks, and global
management networks, we could block that entire range, including longer matches. This
guarantees that even if the BGP AS-path loop control mechanism breaks down due to a core
carrier misconfiguration, the POP will never learn its own prefixes via eBGP. Then, to prevent
any route leakage from other customers that the core carrier may be servicing, we permitted the
remaining POP networks, say 192.168.0.0/16, capturing all the other sites. We also permitted the
CSC-CE to CSC-PE transit links, which were subnets provided by the core carrier, but
aggregated nicely into an easily-matched prefix, such as 198.51.100.0/24. Again, this provided
full connectivity but significantly reduced any remote possibility of a routing problem.
Applied outbound, we permitted only local POP networks. Using the prefix example above, that
would be 192.168.0.0/24. POPs are never meant to be transit sites except in uncommon
situations where a satellite is “tethered” to a regional POP temporarily. We used this strategy
when needing to get a POP online that was physically near a region POP, but for which the CSC
circuit was not yet provisioned. In that rare case, the satellite POP’s transport prefix would be
added to the permitted outbound filter. The diagram below illustrates these simple BGP filters
and their positive impact on BGP’s stability.
Figure 21 - eBGP-LU Inbound and Outbound Filters
198.51.0.128/30
CSCCE CSCPE
CSC
REMOTE POP
192.168.5.0/24
CORE
CSCPE CSCPE
198.51.0.0/30 198.51.0.4/30
INBOUND:
OUTBOUND: - DENY 192.168.4.0/24
- PERMIT 192.168.4.0/24 - PERMIT 192.168.0.0/16
- PERMIT 198.51.100.0/24
CSCCE CSCCE
36
Copyright 2020 Nicholas Russo – http://njrusmc.net
Extending BGP-LU to the PEs using iBGP sessions is a more advanced, complex solution that
may offer benefits for some customers. First, there is no redistribution, so there is no possibility
of a routing loop. Also, it helps separate intra-POP routing from inter-POP routing from the
perspective of each PE, using IGP for the former and iBGP for the latter. This separation may
simplify deploying a non-LDP labeling method, such as Segment Routing and RSVP-TE. Minor
BGP tuning, such as whether to use next-hop-self on the CSC-CEs towards the PEs has its own
37
Copyright 2020 Nicholas Russo – http://njrusmc.net
set of trade-offs. Such a design is similar to seamless/unified MPLS. The benefits end here, but
there are many drawbacks.
Because there is an additional level of indirection (i.e., another routing lookup) on each ingress
PE for inter-POP customer traffic, a third label must be imposed in addition to the standard
“transport” and “VPN” labels. When a packet arrives at the ingress PE, the router:
c. Performs routing lookup on the VPN destination prefix and pushes BGP VPN label
d. Next-hop is iBGP-learned; performs routing lookup to push iBGP-LU transport label
e. Next-hop is IGP-learned, performs routing lookup to push IGP transport label
Consider POPs that have dedicated P routers, such as those arranged in a leaf/spine fashion. The
P routers (spines) only have IGP routes for the local POP and are unaware of CSC’s existence
entirely. It’s basically a BGP-free core within the customer POP, but this raises new problems.
The first and most obvious issue revolves around Maximum Transmission Unit (MTU).
Although an extra MPLS shim header only adds 4 bytes of encapsulation, architects must take
care to account for this between PEs (leaves) and Ps (spines). The diagram below illustrates the
label stacking process for this design.
Figure 23 - Inter-POP Flow with iBGP-LU from CSC-CE to PE
38
Copyright 2020 Nicholas Russo – http://njrusmc.net
IBGP-LU 10.1.1.1
CSC
PE P CSCCE CSCPE CORE CSCPE CSCCE
Lastly, using iBGP-LU is significantly more difficult to design, implement, operate, and
troubleshoot than CSC-CE eBGP-LU/IGP redistribution. It was not a difficult choice for us; we
chose to redistribute, knowing that the likelihood of loops was infinitesimally small.
While we had many regional POPs with dedicated CSC-CEs, Ps, and PEs, some POPs were just
a single router performing the CSC-CE and PE functionality together. These routers did not run
IGP as there was no reason, but had at least one eBGP-LU uplink to at least one CSC-PE for
connectivity to the regional POPs. The precise integration of these remote POPs and their BGP
VPN connectivity is explained later.
39
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
EBGP LU
IBGP VPN
CSCCE CSCCE
NON RR-CLIENT
PE PE
BGP AS
65001 PE
Over time, we observed a number of significant drawbacks of using iBGP over CSC in our
environment as depicted above. First, any additional transport links between POPs, such as E-
LINE/E-LAN services or dark fiber circuits, significantly complicated the design. Running IGP
between the POPs is sloppy as routers now must decide between eBGP-LU and IGP routes for
transport between PEs. This complicates redistribution (if used), filtering, flooding/failure
domain boundaries, and more. Even more complications occur if the different POPs use different
IGPs and different label distribution methods. More complex still is when multiple links exist
between sites at various speeds, whereby some are faster than CSC and some are slower. The
diagram below illustrates what this confusing situation might look like.
40
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
EBGP LU
IBGP VPN
CSCCE CSCCE
NON RR-CLIENT
PE PE
MERGED BGP/
OSPF DOMAIN PE
Using eBGP between the POPs is clearly the superior approach for selecting the arbitrary “best”
link when given a multiple options. Since all the POPs are in the same AS, this becomes
complicated. While iBGP-LU can technically be transformed to work like eBGP-LU using a
combination of local-AS and next-hop-self adjustments, it’s a sloppy and unscalable
workaround. The diagram below illustrates such an implementation, which we deployed in real-
life to overcome urgent, uncommon circumstances. This was the catalyst for considering a new
design.
41
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
IBGP LU WITH
NEXT-HOP-SELF
LOCAL-AS 65003
CSCCE REMOTE-AS 65002
IBGP VPN
NON RR-CLIENT LOCAL-AS 65002
PE REMOTE-AS 65003 PE
IBGP VPN
RR-CLIENT BGP AS
65001 PE
When iBGP is used, the core carrier will typically replace all instances of the customer carrier’s
BGP ASN with their own BGP ASN when advertising transport routes between POPs. If POPs
have direct connectivity between themselves, perhaps using backdoor links or alternative layer-2
transports, this could cause routing loops. Such loops would be rare, even with a mesh of
backdoor links. Even if the AS-path length prevents the actual routing loop, the looped prefixes
are still available where they should not be unless they are explicitly filtered elsewhere.
Placing each POP (regardless of size) into its own BGP AS alleviates all of these problems.
While it does require some additional AS number management and slightly different CSC-PE
configurations by the core carrier, the operational benefits far outweigh these administrative
inconveniences. Because each POP is in its own AS, the BGP VPN sessions between POPs will
use eBGP. One of the main features of MPLS Inter-AS Option C is that it allows these BGP
VPN speakers, often route-reflectors, to exchange eBGP routes without updating the BGP next-
hop. The assumption is that the different AS’ are already exchanging, at a minimum, all of the
prefixes necessary for MPLS transport.
It is important to understand that a BGP “autonomous system” is just a logical construct that
governs BGP behavior. It is not necessarily a different administrative or operational domain; all
of the POPs in the network remained under the control of a single organization regardless of the
BGP ASNs assigned. BGP confederations could also be used in the case where the core carrier
demands that all POPs be in the same BGP AS (for operational simplicity on their part) while
also gaining the advantage of confed-external peers between POPs. This is true for both labeled
transport on backdoor links and VPN services between RRs.
42
Copyright 2020 Nicholas Russo – http://njrusmc.net
Consider a network that has a mix of large POPs, both traditional and leaf/spine, and small,
single-router POPs. All of these POPs are connected via CSC with full connectivity over
functional LSPs. The small POPs should connect back to the closest regional POP that hosts a
pair of RRs to service the PEs in that regional POP. The small POPs logically act like “satellite
PEs”, accessible over CSC, by connecting via BGP VPN to the RRs using eBGP. Because the
RRs will not change the BGP next-hops when advertising VPN routes to these satellite PEs,
they’ll behave just like any other PE in the regional POP. The design mimics iBGP with respect
to optimal MPLS forwarding as the RRs are not forced into the transit path while also
overcoming the iBGP limitations.
The diagram below illustrates how satellite POPs connect back into their parent regions. For the
sake of brevity, this section uses the quoted phrase “route-reflection” to describe eBGP behavior
with respect to remote sites. This term is explained in greater detail later.
Figure 28 - Satellite POP Connectivity to Regional POP Using eBGP VPN
PE PE PE PE
Inter-regional BGP VPN connectivity can be designed in one of two main ways:
a. Add a second tier of “route-reflection”, except using eBGP instead of iBGP
b. Directly connecting RRs between regions using eBGP
43
Copyright 2020 Nicholas Russo – http://njrusmc.net
The advantage of a second tier of “route-reflection” is improved scale when the number of
regions is very large. In our case, we only had 6 regions, so scale in this context was not a large
concern. Additionally, these second tier “RRs” would have to be hosted somewhere, presumably
in 2 of the 6 regions, creating points of failure. If those 2 regions went offline, the remaining 4
regions would not be able to exchange any BGP VPN routes, which was an unacceptable trade-
off to gain a level of scale we didn’t need. Additionally, these routers are not technically route-
reflectors because all of their peers will be eBGP. As is common in Internet Exchange Points
(IXP), these routers can be route-servers, which operationally act like route-reflectors with some
minor technical differences regarding AS path recording. This design does, in fact, work with
eBGP VPN address-families and is technically valid, although it strains credulity and would not
be advisable to deploy in production without a compelling reason.
Figure 29 - Second-Tier "Route Reflection" with eBGP VPN Sessions
CSC
CORE
44
Copyright 2020 Nicholas Russo – http://njrusmc.net
at a steep cost in terms of design conceptualization. This same logic extends from intra-POP to
inter-POP. All of the “A” RRs can be fully-meshed over eBGP to implement Inter-AS Option C.
Likewise, the same design applies to the “B” RRs, effectively creating two separate, parallel
meshes for improved availability and fault domain isolation. Given that these eBGP sessions are
multi-hop, just like iBGP sessions, individual CSC-PE to CSC-CE uplink failures won’t affect
these meshes because each “side” can still route across alternative uplinks. Only when a POP is
completely cut off from the core carrier will these sessions fail. Unlike the second-tier of “route
reflection” (really, route-servers or just multi-hop/next-hop-unchanged eBGP behavior), any pair
of regions, including their satellites, can communicate across CSC provided they have
connectivity as there are no inter-regional dependencies. The diagram below illustrates the BGP
VPN mesh design.
45
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSCCE
PE PE CSCCE
PE PE
EBGP VPN
A-MESH
CSC
EBGP VPN
CORE
B-MESH
IBGP VPN
RR-CLIENT
PE PE PE PE
46
Copyright 2020 Nicholas Russo – http://njrusmc.net
2. Deny all outgoing inter-regional VPN routes with that community; permit all others
Filtering these routes outbound prevents over-reflection. All of the regions are fully meshed
anyway, and reflecting routes between regions creates unnecessary BGP table bloating and
general confusion among network operations. Again, this technique is a simple and scalable way
to simulate regular iBGP non-RR client advertisement rules over eBGP. The diagram below
illustrates how this filtering works in the global network.
Figure 31 - Controlling Inter Region eBGP VPN Advertisements
OUT TO INTER-REGION:
DENY COMM 65000:999 CSCCE
PE PE
BGP AS
CSCCE
PE 65008 CSCCE CSCCE
VPN ROUTE
EBGP VPN
A/B MESH
CSC DO NOT
EBGP VPN CORE “REFLECT”
“RR-CLIENT”
IBGP VPN
RR-CLIENT
SEND TO
OTHER
CSCCE CSCCE REGIONS CSCCE CSCCE
SEND TO PE PE SEND TO PE PE
LOCAL PE LOCAL PE
WEST REGION SOUTH REGION
BGP AS 65003 BGP AS 65004
47
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC E-LAN
CORE
Consider the simpler case of a point-to-point link (say, VPWS or dark fiber) between two POPs.
Using eBGP means that we don’t have to get creative with local AS spoofing and can simply
peer the POPs directly. BGP will will prefer the direct link by default as the AS path to the peer
POP will be shorter over the direct link than over CSC. As it relates to IGP redistribution, the
same design concepts apply here as they did earlier when discussing the CSC-CEs. Mutual
redistribution can occur on both ends of the link without any complexities because, assuming
OSPF is used, only internal routes are candidate for redistribution into BGP. Therefore, these
48
Copyright 2020 Nicholas Russo – http://njrusmc.net
POPs will never act as transit nodes for one another, making this a “peering” link and not a
“transit” link, to user Internet terminology.
In order for the PEs to choose the direct link over CSC, use OSPF external type-2 routes with a
lower seed metric on the ASBR terminating the direct link. This document won’t detail all the
different BGP and IGP configuration changes relating to forwarding policy, but engineering
traffic to flow over either transport is not challenging nor is it the focus of this whitepaper. The
diagram below illustrates this design.
Figure 33 - Non-CSC Direct Links Between POPs
CSC
CORE
EBGP-LU
49
Copyright 2020 Nicholas Russo – http://njrusmc.net
EAST REGION
BGP AS 65002
EBGP-LU
CSC BGP AS
CSCCE
PE 65008 CSCCE CSCCE
EBGP-LU
E-LAN
CSC E-LAN
CORE
EBGP ROUTE
SERVER
Admittedly, this second design is quite rare and has likely never been deployed in production.
The design is mostly conceptual and should be thoroughly vetted before actual deployment. In
our environment, we only consumed point-to-point links for tactical, “quick fix” reasons which
were quickly decommissioned when no longer needed. This obviated the need for any creativity
with BGP route-servers combined with labeled-unicast. Be sure to use the same community-
based filtering method described for eBGP VPN sessions with these eBGP-LU sessions between
route-servers and their clients to prevent over-advertisement and potential loops.
Instead of using route-servers, some vendors implement eBGP in such a way that the next-hop
for a given NLRI is in the same subnet as a remote peer. This makes sense for IXP connections
or any other fully-meshed layer-2 network, as is the case here. Beware that the introduction of
labeled-unicast may change this behavior, requiring some configuration workarounds (e.g.
enabling multi-hop eBGP and next-hop-unchanged despite the session being single-hop) to make
MPLS forwarding and label allocation work correctly. Cisco IOS-based devices appear to
require this workaround; be sure to test your specific platforms extensively.
50
Copyright 2020 Nicholas Russo – http://njrusmc.net
EBGP VPN
OPTION C
EBGP IP CSC
OPTION A CORE CSCCE
PE
VRF
CSCCE CSCCE
OPTIONAL
FIREWALL
OTHER
PE PE PE SP PE
51
Copyright 2020 Nicholas Russo – http://njrusmc.net
52
Copyright 2020 Nicholas Russo – http://njrusmc.net
In our network, the CSC uplinks were typically 1 Gbps Ethernet links, but the core carrier could
not guarantee this, as many circuits still used SONET/SDH. The carrier typically provisioned
(and policed) circuits at 150 Mbps, roughly the same speed as an OC-3 or STM-1 (155 Mbps),
and applied ingress policers on their CSC-PEs to enforce the contracted rate. As such, traffic
conditioning via shaping on the CSC-CE was necessary to slow down traffic to this rate, despite
the line rate of the interface being much faster.
In most cases, the customer carrier should use the minimum possible time committed (Tc) to
improve the user experience for real-time and transactional applications, such as voice,
teleconferencing, and multi-media services. Given a committed information rate (CIR) of 150
Mbps and a target Tc of 4 ms (the minimum value on our hardware), the burst committed (Bc)
would be 600 kilobits (kb). This means that every 4 ms, the interface can physically send 600 kb
sent at the physical rate of 1 Gbps, then wait until the next 4 ms interval to send more traffic.
Because carriers sometimes do not provide low-level details about the configuration of their
policers, we made moderately conservative assumptions regarding burst excess (Be). We used
600 kb for this value as well, matching Bc. If the shaper does not send any traffic for an entire
Tc, the shaper is allowed to burst up to an additional 600 kb in the next Tc, for a total of 600 kb
and a peak information rate (PIR) of 300 Mbps. In effect, this allows the customer carrier to
reclaim up to one lost Tc due to inactivity. In our experience, we did not observe any negative
effects from this assumption, although an even more conservative approach would be using a Be
value of 0 kb. This effectively creates a peak shaper, setting the PIR to 150 Mbps.
Most customer-facing PE-CE links in our environment tended to be the slowest, least stable
links, and thus deserving of the most precise QoS. At the same time, customers used a variety of
DSCP values within their own networks and their ability to remark was limited. On the CE, they
can implement their own queuing and shaping for outbound traffic. On the ingress PE receiving
that traffic, the classification and marking process preserves those values and maps them to
53
Copyright 2020 Nicholas Russo – http://njrusmc.net
MPLS EXP or DSCP tunnel and described earlier. The queuing and shaping on the egress PE is
very similar to that used in the core, except exclusively matches DSCP values as the PE-CE links
never run MPLS. To keep things consistent, we opted for a 6-queue policy with similar PHBs.
The diagram below illustrates the complete queuing design from CE to CSC-PE.
Figure 36 - Queuing and PHB Design
2% 2% 5%
NC DSCP CS6/7 EXP 6 EXP 6 EXP 6
23% 23% 10%
V/SIG DSCP CS5, EF EXP 5 EXP 5 EXP 5
15% 15% 15%
VIDEO DSCP CS/AF 3/4 EXP 4 EXP 4 EXP 4
25% 25% 30%
ELASTIC DSCP CS2, AF1/2 EXP 2 EXP 2 EXP 2
10% 10% 10%
SCAV DSCP CS1 EXP 1 EXP 1 EXP 1
25% 25% 15%
BE DSCP DF EXP 0 EXP 0 EXP 0
54
Copyright 2020 Nicholas Russo – http://njrusmc.net
DSCP CS6, CS7 Network control EXP 2 / DSCP CS2 EXP 3 / DSCP CS3
DSCP CS5, EF Voice bearer traffic EXP 5 / DSCP CS5 EXP 5 / DSCP CS5
DSCP CS3, CS4, All video traffic EXP 4 / DSCP CS4 EXP 4 / DSCP CS4
AF3x, AF4x
DSCP CS2 OAM traffic EXP 2 / DSCP CS2 EXP 2 / DSCP CS2
DSCP AF1x, AF2x Transactional/bulk data EXP 2 / DSCP CS2 EXP 2 / DSCP CS2
DSCP CS1 Scavenger data EXP 1 / DSCP CS1 EXP 1 / DSCP CS1
The diagram below illustrates this process from CE all the way to CSC-PE. It is worth noting
that both the customer and core carriers protect their own network control traffic by never
allowing customer traffic to compete directly with it. For example, an end customer’s DSCP CS6
traffic has MPLS EXP 2 imposed by the customer carrier. A customer carrier’s MPLS EXP 6
traffic has MPLS EXP 3 imposed by the core carrier (according to them, at least). This treatment
is imperfect as it mixes inelastic customer network control with elastic customer data. This trade-
off allowed us to use a simpler queuing strategy by introducing only a small risk to customer
network stability. The diagram below illustrates how DSCP and EXP are handled for upstream
flows from CE to CSC-PE.
55
Copyright 2020 Nicholas Russo – http://njrusmc.net
Additionally, customer carriers should consider policing traffic on ingress from customers.
Because packet loss must be minized, these policies can simply impose different, lower priority
MPLS EXP or DSCP tunnel values when traffic limits are exceeded. For example, a G.711
phone call, assuming it is encapsulated in Ethernet, consumes about 90 kbps per call. Each
customer is allowed 100 calls, for a total of 9 Mbps of bandwidth. Customers can manage this in
their telephony control-plane using various call admission control technologies, which is
commonly deployed. Assuming a single-rate, three-color policer is used, it should mark
conforming and exceeding traffic as EXP 5 per the table above. This provides low-latency
treatment to conforming and permissible excess burst traffic. Violating traffic beyond the CIR
for extended periods of time is marked as EXP 0. Put another way, customer voice is never
dropped (unless the aggregate link CIR is overwhelmed using hierarchical policers), but will stop
receiving low-latency treatment beyond 9 Mbps. This allows customers to make their own
risk/reward decisions regarding admission control and voice oversubscription, although this is
strongly discouraged.
At a minimum, policing voice traffic this way is particularly helpful to prevent saturation in the
core, because most LLQ implementations will stop providing LLQ treatment to traffic in excess
of the allocated percentage. Without the policer/remarker, one customer could flood DSCP EF
(and subsequently, EXP 5) into the network beyond what the carriers have forecasted. This
selfish act would harm all customers and likely have a negative business outcome for the carrier.
This same policing strategy makes sense for other queues as well, but in our network, we limited
the policer to customer voice traffic for simplicity. As a final technical point, using hierarchical
policers can be useful as well. A generic CIR policer could encapsulate the entire ingress police,
with subrate policers on a per-class basis as described above. We opted to skip this approach for
operational simplicity.
56
Copyright 2020 Nicholas Russo – http://njrusmc.net
57
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE eBGP-LU
CSCPE CSCPE PE
eBGP-LU
PING
CSCCE TRACEROUTE
OSPF/LDP
MARK CS2/EXP2
PE 10.0.99.0/24
IGP ENABLED
58
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE
CSCPE CSCPE PE
iBGP VPN
PING
TRACEROUTE
SNMP
CSCCE
SSH
iBGP VPN NETCONF
MGMT VRF
PE USERS 172.16.1.0/24
VOICE 172.16.2.0/24
SERVERS 172.16.3.0/24
Using a pair of route-targets, this VPN followed a tree design. All of the NOCs (we had three)
would import and export the “root” RT. This allowed all the NOCs to form a full mesh between
one another. Additionally, the NOCs would import the “leaf” RT, allowing them to access all of
the remote POPs. The non-NOC sites would export the “leaf” RT and import the “root” RT. This
ultimately creates hub/spoke style of network with all of the hubs being fully-meshed. The
diagram below illustrates the high-level connectivity between sites using dummy RTs.
59
Copyright 2020 Nicholas Russo – http://njrusmc.net
NOC2
RT IM 65000:1
RT EX 65000:1
NOC1 NOC3 RT IM 65000:2
CSC
CORE
HUB/SPOKE
HUB/HUB
RT IM 65000:1
PE PE PE RT EX 65000:2
A full VMV mesh everywhere was avoided because it adds unnecessary security risk with no
operational benefit. Remote sites did not have firewalls or other security appliances (discussed
later), and many were located on customer premises. If one was compromised, we did not want
attackers to perform leap-frog attacks by traversing laterally through the VMV to attack other
remote sites. Because all of the NOCs had extensive security defenses, any attacks due to
compromise would have to fight through those upstream defenses first. These defenses are
discussed later in this document.
The VMV has the additional advantage of being completely inaccessible via the global routing
table and the CSC carrier. Even if the CSC carrier accidentally adds the wrong site into your
VPN, the worst possible outcome would be a compromise of the GMV, which has no
management access to any device. As mentioned earlier, the GMV exists for visibility and
troubleshooting correlation only. The best security plans often mix control-plane, data-plane, and
management-plane techniques into a unified defense.
Because both the GMV and VMV were tracked concurrently using network visualization
software, operators could form hypotheses about network issues before touching a keyboard.
This helped reduce mean time to repair (MTTR) for common outages. The matrix below
illustrates how the global and VPN management views intersect. For example, if the GMV
reports that a remote POP is reachable, but the VMV does not, the cause is likely related to the
BGP VPN control-plane. Perhaps the BGP sessions are down completely or a route-target has
60
Copyright 2020 Nicholas Russo – http://njrusmc.net
not been properly imported/exported. The data-plane and label switched path are known-good,
otherwise the GMV would also report an outage.
Table 4 - Global and VPN Management Outage Matrix
GMV down Impossible case; misconfig Transport control or data plane issue
61
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
NOC1 CORE NOC2
NOC3
LONG-LOCAL VIA SIP
CALL CONTROL SERVERS IF SIP SIGNALING:
MARK CS5/EXP5
IF VOICE BEARER:
MARK EF/EXP5
ALL ELSE:
LOCAL NOC PHONES
MARK CS2/EXP2
62
Copyright 2020 Nicholas Russo – http://njrusmc.net
VMV VOICE
PE
63
Copyright 2020 Nicholas Russo – http://njrusmc.net
Additionally, it records the DHCP bindings for each client, including the issued IP address, client
MAC, physical interface, and VLAN. Because the DHCP servers are in a different VLAN than
the client it servers, the PEs act as DHCP relays to facilitate the DHCP messaging.
Given these DHCP snooping bindings, DAI can validate ARP messages between hosts on the
subnet. If a client tries to spoof an ARP message, effectively pretending to be another host, DAI
will block and log the offense. IPSG further reinforces security by using the DHCP snooping
bindings to ensure clients cannot spoof IP packets. The source IP and source MAC must match
the bindings, and if they do not, the packets are discarded and optionally logged. It is commonly
believed that DHCP is a security liability (and that static IP addressing is “more secure”). This is
simply false; DHCP is a security asset when combined with first-hop security techniques like
DAI and IPSG. The diagram below illustrates how these technologies work together to provide
LAN security.
Figure 43 - Layer-2 Defense in Depth Security Design
MAC: 0000.0000.0001
IP: 192.168.1.1
DHCP RELAY DHCP
MAC: 0000.0000.0002
PE IP: 192.168.2.2
NOC
POP
DHCP SCOPES
VMV USER: 192.168.1.0/24
PE VMV VOICE: 192.168.2.0/24
DHCP/DAI TRUSTED
IPSG DISABLED DHCP SNOOPING BINDINGS
GI1: MAC 0000.0000.0001, VLAN 10
IP 192.168.1.1, EXPIRY 3600 SEC
DHCP/DAI UNTRUSTED GI2: MAC 0000.0000.0002, VLAN 20
IPSG ENABLED IP 192.168.2.2, EXPIRY 3543 SEC
While powerful, these technologies do nothing to measure the authenticity of each client.
Attackers who gain physical access to our facility or NOC personnel who connect unauthorized
devices should not be able to join the network at all, even if they don’t intend on causing
mischief. To solve this problem, we deployed 802.1X for network access control. Because we
did not have a Public Key Infrastructure (PKI) nor any degree of PKI operational experience
within our team, we opted to use Protected Extensible Authentication Protocol (PEAP) with
Microsoft Challenge Handshake Authentication Protocol (MS-CHAPv2) for authentication. The
PEAP outer method relies on a one-way certificate trust (client trusts server) to establish a secure
TLS connection with the RADIUS authentication server. Once established, the supplicant
provides its credentials (in our case, a per-machine username/password) to the authentication
server using the MS-CHAPv2 inner method. In our view, this approach provided an
operationally sustainable, moderate security posture. With a proper PKI deployed, EAP-TLS
would have been a superior option as each client would have its own client certificate for
64
Copyright 2020 Nicholas Russo – http://njrusmc.net
authentication, providing a stronger access control solution in general. Some operators use MAC
Authentication Bypass (MAB) instead of 802.1X, which is significantly less secure, but often
better than nothing.
Our VoIP phones were exclusively Cisco and thus supported a variety of 802.1X EAP methods.
We opted to use EAP-TLS using the Manufacturer Installed Certificate (MIC). This is hard-
coded into each device and is signed by a a Cisco Certificate Authority (CA). Assuming the
authentication server trusts the Cisco CA, the phones can authenticate using EAP-TLS. Note that
the EAP-TLS + MIC technique only guarantees that the phone is a Cisco IP phone. An attacker
could plug in a compromised Cisco IP phone, which would pass 802.1X authentication, and
launch an attack. We saw this as an unlikely attack vector as physical security was relatively
tight in our NOCs. Combined with all the other security defenses described earlier, EAP-TLS +
MIC was a good design choice by balancing security and operational simplicity.
Note that 802.1X was not enabled for any servers, physical or virtual, and was limited to VMV
users and VMV IP phones. Because GMV users are already quite limited in what they could
accomplish on the network, 802.1X was not implemented for them.
The diagram below illustrates the high-level 802.1X design within a NOC. Note that the devices
were not able to send any traffic into the network, other than traffic relating to 802.1X, until
authentication was complete.
Figure 44 - 802.1X for NOC Users and IP Phones
VMV VOICE
PE
NOC EAP-TLS +
POP MIC
VMV SERVER
PE (RADIUS)
802.1X ENABLED
As was true for all critical services, such as DHCP, RADIUS, TACACS, and RADIUS, each
NOC had at least one of each server. This allowed clients to operate correctly at any NOC even
if the local servers at that NOC were offline.
65
Copyright 2020 Nicholas Russo – http://njrusmc.net
66
Copyright 2020 Nicholas Russo – http://njrusmc.net
CORE EDGE
67
Copyright 2020 Nicholas Russo – http://njrusmc.net
68
Copyright 2020 Nicholas Russo – http://njrusmc.net
69
Copyright 2020 Nicholas Russo – http://njrusmc.net
customers. This service is delivered by simply managing L3VPNs as one would normally do.
Our most common applications include:
a. A headquarters location hosting centralized services connecting to remote sites
b. Disparate mobile elements (such as vehicles or field expedient tents) that are part of the
same organization needing to communicate laterally across the world
The diagram below illustrates some of these examples, which due to the multi-tenant design of
MPLS VPNs, can all be supported concurrently. Because this use case has already been
extensively explained, this document will not detail it further. Note that mobile customers used
IPsec VPNs or other secure transport technologies to connect to PEs whereby the IPsec tunnel is
the logical PE-CE link. Other than MTU calculations and some platform-specific QoS
limitations, this minor deviation is not significant to the overall service offering.
Figure 46 - Use Case: Connecting Geographically Dispersed Nodes
POP POP
HQ SITE1
CSC VEH1
SITE2
CORE
HQ BACKHAUL
VEH2
MOBILE COMMS
POP
VEH3
At the time of this writing, the scourge of Coronavirus 2019 (COVID-19) was present in global
life. This latter use case regarding mobile elements could be potentially utilized by pop-up
classrooms, office spaces, legal courts, prisons, or any other disaggregated business attempting to
socially distance. This also applies to general-purpose disaster relief, emergency
communications, and various types of mobile units.
70
Copyright 2020 Nicholas Russo – http://njrusmc.net
CSC
CORE PE
SATCOM
L2VPN
GATEWAY OVER
PE MPLS
DIRECT L2 CONNECTIVITY
HQ VEH1
When there are many mobile users connecting to the same SATCOM POP to access services in
the gateway POP, creating individual point-to-point layer-2 VPNs for each customer is
burdensome and scales poorly, especially without automation. Instead, a single EV-LINE QinQ-
based L2VPN can transport multiple connections. For example, suppose there are 3 different
mobile users that use connect to a given SATCOM site using a point-to-point wireless
technology. Those SATCOM modems are each placed in different access VLANs numbered 11,
12, and 13. Thus, the higher-level IP addressing on each link is a different subnet. If all three
SATCOM connections terminate on the same gateway PE, they can all “ride together”,
somewhat analogous to carpooling. A single EV-LINE circuit could transport all of them, and
from the logical perspective of the headquarters node, it would be a hub/spoke network with
71
Copyright 2020 Nicholas Russo – http://njrusmc.net
three different Ethernet point-to-point links. The VLAN tags would be retained end-to-end
allowing the CE router to peel off each VLAN using routed subinterfaces. The diagram below
illustrates this scaling technique.
Figure 48 - Using a Single L2VPN to Connect Multiple Sites
SATCOM
RETAIN
GATEWAY VLAN
PE
TAGS
ROUTED SUBIFS:
ETH 0/0.11
ETH 0/0.12
HQ VEH1 VEH2 VEH3
ETH 0/0.13
In our experience, different POPs did not always have the same VLANs available for transport,
and individually remarking VLANs on a per L2VPN basis does not scale. Sometimes, the
VLANs overlapped between sites, implying that VLAN transparency cannot work. Furthermore,
it was a common occurrence that all 3 SATCOM clients were terminating on the same PE. We
used QinQ to add additional encapsulation to represent the “site selector”. The three SATCOM
VLANs would be trunked to the ingress PE as discussed before with their VLAN encapsulation
retained. The ingress PE adds a new “site selector” VLAN tag to the Ethernet frame before
imposing MPLS encapsulation. The egress PE would preserve (i.e., not remove) this extra QinQ
tag, ensuring that the QinQ VLAN was provisioned across its last-mile switching fabric between
PE and CE. The router terminating all of the SATCOM links could then match both the outer tag
(QinQ site selector VLAN) and the inner tag (SATCOM customer VLAN) using routed
subinterfaces. In our experience, we found this to be a suitable solution for scaling L2VPNs
while also not requiring sites to synchronize their VLAN numbering, allocation, and
consumption schemes. The diagram below illustrates this QinQ “site selector” technique.
Although not depicted for brevity, there were often many layer-2 switches between the PE and
customer headquarters node, making this technique useful.
72
Copyright 2020 Nicholas Russo – http://njrusmc.net
SATCOM PUSH
VLAN 500
ROUTED SUBIFS:
ETH 0/0.11
OUTER 500
GATEWAY
INNER 11 PE
ETH 0/0.12
OUTER 500
PRESERVE
INNER 12
VLAN 500
ETH 0/0.13
OUTER 500
INNER 13 HQ VEH1 VEH2 VEH3
73
Copyright 2020 Nicholas Russo – http://njrusmc.net
The second option has slightly less scale and still exposes the customer carrier directly to the
Internet, but is far more convenient for customers. They’ll use their BGP uplink both for VPN
routes and for Internet routes as the route leaking between tables happens on the PE. If there are
many customers on a single PE that require full Internet tables, the routes will have to be copied
in memory with new RDs, potentially taxing the router’s memory.
The third option is the most secure, easiest to conceptualize/operate, but the least scalable.
Because the Internet connectivity always exists in a VPN, there are no infrastructure Access
Control Lists (ACLs) to manage and very little chance of an Internet attack reaching the
customer carrier’s network. Scale is poor considering each customer that imports the Internet RT
will copy routes from the RD-indexed table into the VRF-specific table on a given PE. Some
platforms may implement memory optimization techniques here, but this generally isn’t a safe
assumption.
This Intenet-in-a-VPN approach was our best choice, and we decided to improve the scalability
by only allowing our customers to follow a default route towards the Internet. No customers
received any longer matches for any Internet prefix. The ASBRs touching the Internet
maintained full routing tables, but generated local 0.0.0.0/0 and ::/0 aggregates for advertisement
into BGP VPNv4/v6. This allowed us, in emergency situations, to advertise more specific
Internet routes into BGP if a customer required it (a rare occurrence indeed).
To provide Internet connectivity to a customer, we defined a pair of Internet RTs for import and
export that would grant access in a hub/spoke fashion. In a given customer VRF, the operators
would import the Internet RT to receive the default route and export the Internet RT so that the
Internet VRF would import the customer’s routes. This hub/spoke RT exchange has been
detailed earlier in the document in other contexts, but the same logic applies here.
To provide high availability, the carrier should have multiple Internet uplinks with multiple
default routes originated. The simplest and most automatic way to handle this is to rely on the
BGP AS-path length to select the shortest egress path. If every region has one Internet
connection (which was generally true for us), then the following is true:
1. Customers attached to the regional POP hosting the Internet connection will see an AS-
path length of 2:
a. ISP ASN
b. Locally-connected regional POP ASN
2. Customers attached to satellite POPs within that region will see an AS-path length of 3:
a. ISP ASN
b. Parent regional POP ASN
c. Satellite POP ASN
3. Customers attached to regional POPs that currently have a broken Internet connection
will see an AS-path length of 3:
a. ISP ASN
b. Internet-connected remote regional POP ASN
c. Locally-connected regional POP ASN
4. Customers attached to satellite POPs within a region that currently has a broken Internet
connection will see an AS-path length of 4
a. ISP ASN
74
Copyright 2020 Nicholas Russo – http://njrusmc.net
BGP AS
INET 100
0.0.0.0/0 0.0.0.0/0
AS PATH 100 AS PATH 100
EBGP VPN
REGION REGION
1 POP 2 POP
CSC
65000:999:0.0.0.0/0 CORE 65000:999:0.0.0.0/0
AS PATH 65001 100 AS PATH 65002 100
PE PE
Suppose the Region 1 Internet connection fails. Customers within that region, both within the
regional POP and connected satellite POPs, would be able to consume the Internet connection
via Region 2. Note that the Option C over CSC designs allows these satellite POPs to route
directly to Region 2 (not transiting Region 1) due to the eBGP next-hop preservation feature. The
diagram below illustrates this failover.
75
Copyright 2020 Nicholas Russo – http://njrusmc.net
BGP AS
INET 100
0.0.0.0/0
BGP AS FAIL! AS PATH 100
65001
65000:999:0.0.0.0/0
BGP IPV4/V6
AS PATH 65002 100
IN INET VRF BGP AS
65002
EBGP VPN
REGION REGION
1 POP 2 POP
CSC
65000:999:0.0.0.0/0 CORE 65000:999:0.0.0.0/0
AS PATH 65001 65002 100 AS PATH 65002 100
PE PE
For carriers offering “wires only” service, the expectation is customers have already advertised
Internet-routable prefixes to the carrier via BGP. Internet-destined traffic must be sourced from
one of these Internet-routable prefixes, whether provider aggregate (PA) or provider independent
(PI), in order for return traffic to function correctly. In managed service providers, a perimeter
security stack that inspects all Internet traffic and performs NAT for customers is sometimes
deployed. Both of these topics are out of scope for this document. Note that NAT is important as
stateful firewalls will expect traffic flows to be symmetric. That is to say, if traffic egresses
through region 1, return traffic must ingress through region 1. This becomes more challenging
with IPv6 Internet traffic unless Network Prefix Translation for IPv6 (NPTv6) is used.
76
Copyright 2020 Nicholas Russo – http://njrusmc.net
security posture while using the carrier for purely transport reasons, which is true in commercial
WAN designs.
To support multiple customers, the PEs servicing the data center would typically use a different
VLAN+VRF combination per tenant with individual BGP sessions connecting the virtual CE
router and the PE. The PE-CE links in the diagram illustrate these logical connections. Although
the vast majority of customers prefer to only advertise their IPsec tunnel source (perhaps a
loopback or just the connected network), other customers may forego IPsec and just advertise the
backend server networks directly. Both designs are supported and given the similarities with
standard L3VPN use cases, there aren’t many special design considerations with respect to
routing. The diagram below illustrates this conceptual design.
Figure 52 - Managed IaaS High-level Design
CSC STANDARD
ADVERTISE CORE L3VPN
CONNECTED
NETWORKS INTO
BGP VPNV4/V6
IPSEC TUNNEL
TO VIRTUAL ROUTER
PE PE
V-HQ CE CE
77
Copyright 2020 Nicholas Russo – http://njrusmc.net
3. Complexity Assessment
This section objectively addresses the complexity of each solution using the
State/Optimization/Surface (SOS) model. This model was formalized by White and Tantsura
(“Navigating Network Complexity: Next-generation routing with SDN, service virtualization,
and service chaining”, R. White / J. Tantsura Addison-Wesley 2016) and is used as a quantifiable
measurement of network complexity. This section is relevant when comparing this solution to
more traditional MPLS deployments, such as those not using CSC or Option C. It is also relevant
when analyzing the different options within the aforementioned design regarding routing, QoS,
and management design decisions.
3.1. State
State quantifies the amount of control-plane data present and the rate at which state changes in
the network. While generally considered something to be minimized, some network state is
always required. The manner in which a solution scales, typically with respect to time and/or
space complexity, is a good measurement of network state.
This solution has has relatively low overall state as the different layers of hierarchy contain
different sets of information:
1. Option C is highly scalable because it does not require ASBRs (CSC-CEs in our case) to
retain all VPN routes. Because our CSC-CEs happened to be RRs, and because the
number of CSC-CEs was typically equal to the number of RRs within a POP, our
particular environment did not benefit from the scaling advantage with respect to state.
2. CSC is highly scalable because it decouples transport networks, such as PE and RR
loopbacks, from customer networks. This is beneficial for both the customer carrier P
routers and the entire core carrier’s network, including the CSC-PEs.
3. In general, using IGP+LDP is a scalable approach when compared to RSVP-TE as a
strategic tool for building and maintaining LSPs. Considering the number of LSPs within
each POP was small, and all inter-POP LSPs were governed by BGP-LU, we did not
benefit much from this scaling advantage, although it did exist.
Various other decisions contributed positively towards reducing state in the network:
1. Not peering the A/B mesh RRs within each POP
2. Aggregating the Internet table to default routes on the Internet-facing PEs
3. Requiring customers to perform their own NAT for Internet-destined traffic
4. Not allowing regions to “double advertise” inter-regional routes between one another
using community-based prefix filtering
5. Use hub/spoke VPNs versus any-to-any VPNs when appropriate (e.g. VMV)
6. Building a network capable of nearly unlimited growth over CSC, compared to a flat E-
LAN/VPLS design where IGP neighbor limits would likely restrict expansion
78
Copyright 2020 Nicholas Russo – http://njrusmc.net
3.2. Optimization
Unlike state and surface, optimization has a positive connotation and is often the target of any
design. Optimization is a general term that represents the process of meeting a set of design goals
to the maximum extent possible; certain designs will be optimized against certain criteria.
Common optimization designs will revolve around minimizing cost, convergence time, and
network overhead while maximizing utilization, manageability, and user experience.
With respect to traffic forwarding, all LSPs are optimal in that there is no hair-pinning. For
example, intermediate POPs are never in the transit path, thanks to the any-to-any CSC transport
combined with Option C. Furthermore, IP multicast is efficiently transferred without ingress
replication (as would be present for any non-EVPN style E-LAN service) across CSC between
sites.
One case where totally optimal IP forwarding may be jeopardized is the Internet access use case.
If a client connected to the western regional POP needs to access a server in the east, traffic still
must egress through the western regional POP. Assuming stateful firewalls and NAT do not exist
at the Internet edge, return traffic may ingress through the eastern regional POP. Only one of
these paths can objectively be “optimal”, implying that the other is suboptimal, but probably not
by much, assuming both links are operational.
The bigger issue happens when a western customer routes through the eastern regional POP to
reach a western server. This would only happen when the western regional POP Internet
connection is offline, but since this is a failover case and not the “steady state” of the network, it
isn’t a major demerit with respect to optimization analysis.
3.3. Surface
Surface defines how tightly intertwined components of a network interact. Surface is a two-
dimensional attribute that measures both breadth and depth of interactions between said
components. The breadth of interaction is typically measured by the number of places in the
network some interaction occurs, whereas the depth of interaction helps describe how closely
coupled two components operate.
The transport and VPN architectures are highly decoupled. The former is built on IGP+LDP and
BGP-LU to establish connectivity between MPLS routers and the latter is based on a different
BGP VPN topology. This improves scale (discussed earlier) and allows the two topologies to
evolve at difference paces and in different ways.
The transport architecture itself is comprised of two tightly integrated components: IGP+LDP
within the POP and eBGP-LU from CSC-CE to CSC-PE for inter-POP connectivity. Route
redistribution occurs on the CSC-CE to connect these two different label switching
environments. This surface interaction is wide as it occurs on every regional POP and is also
deep because large quantities of data (routes and corresponding labels, etc) are
redistributed/readvertised.
79
Copyright 2020 Nicholas Russo – http://njrusmc.net
Extending iBGP-LU from CSC-CE to PE (discussed earlier) would eliminate this surface
interaction completely, but create new ones. For example, PEs would need to originate PIM
proxy vectors in leaf/spine POPs and label stack depths/MTUs would need to be recomputed.
This creates newer, and in our professional opinion, more complex surface interactions between
components.
80
Copyright 2020 Nicholas Russo – http://njrusmc.net
Appendix A – Acronyms
Acronym Definition
AC Admission Control
AD Administrative Distance
AF Assured Forwarding
AH Authentication Header
AS Autonomous System
ASN AS Number
Bc Burst Committed
Be Burst Excess
CS Class Selector
81
Copyright 2020 Nicholas Russo – http://njrusmc.net
Acronym Definition
DF Default Forwarding
EF Expedited Forwarding
IP Internet Protocol
IPP IP Precedence
IR Ingress Replication
82
Copyright 2020 Nicholas Russo – http://njrusmc.net
Acronym Definition
LU Labeled Unicast
MP2P Multipoint-to-point
83
Copyright 2020 Nicholas Russo – http://njrusmc.net
Acronym Definition
P2P Point-to-point
PA Provider Aggregate
PI Provider Independent
RD Route Distinguisher
RP Rendezvous Point
RR Route Reflector
RS Route Server
RT Route Target
84
Copyright 2020 Nicholas Russo – http://njrusmc.net
Acronym Definition
SR Segment Routing
Tc Time Committed
TE Traffic Engineering
85
Copyright 2020 Nicholas Russo – http://njrusmc.net
Appendix B – References
86