Introducing Cisco Programmable Fabric VXLAN EVPN
Introducing Cisco Programmable Fabric VXLAN EVPN
(VXLAN/EVPN)
• Introduction to VXLAN/EVPN, on page 1
Introduction to VXLAN/EVPN
Introducing IP Fabric Overlays (VXLAN)
Motivation for an overlay
An overlay is a dynamic tunnel that transports frames between two endpoints. In a switch-based overlay, the
architecture provides flexibility for spine switches and leaf switches.
• Spine switch table sizes do not increase proportionately when end hosts (physical servers and VMs) are
added to the leaf switches.
• The number of networks/tenants that can be supported in the cluster can be increased by just adding more
leaf switches.
Note For easier reference, some common references are explained below:
• End host or server refers to a physical or virtual workload that is attached to a ToR switch.
• A ToR switch is also referred as a leaf switch. Since the VTEP functionality is implemented on the ToRs,
a VTEP refers to a ToR or leaf switch enabled with the VTEP function. Note that the VTEP functionality
is enabled on all leaf switches in the VXLAN fabric and on border leaf/spine switches.
headers is handled by a functionality embedded in VXLAN Tunnel End Points (VTEPs). VTEPs themselves
could be implemented in software or a hardware form-factor.
VXLAN natively operates on a flood-n-learn mechanism where BU (Broadcast, Unknown Unicast) traffic in
a given VXLAN network is sent over the IP core to every VTEP that has membership in that network. There
are two ways to send such traffic: (1) Using IP multicast (2) Using Ingress Replication or Head-end Replication.
The receiving VTEPs will decapsulate the packet, and based on the inner frame perform layer-2 MAC learning.
The inner SMAC is learnt against the outer Source IP Address (SIP) corresponding to the source VTEP. In
this way, reverse traffic can be unicasted toward the previously learnt end host.
Other motivations include:
1. Scalability — VXLAN provides Layer-2 connectivity that allows the infrastructure that can scale to 16
million tenant networks. It overcomes the 4094-segment limitation of VLANs. This is necessary to address
today’s multi-tenant cloud requirements.
2. Flexibility— VXLAN allows workloads to be placed anywhere, along with the traffic separation required
in a multi-tenant environment. The traffic separation is done using network segmentation (segment IDs
or virtual network identifiers [VNIs]).
Workloads for a tenant can be distributed across different physical devices (since workloads are added
as the need arises, into available server space) but the workloads are identified by the same layer 2 or
layer 3 VNI as the case may be.
3. Mobility— You can move VMs from one data center location to another without updating spine switch
tables. This is because entities within the same tenant network in a VXLAN/EVPN fabric setup retain the
same segment ID, regardless of their location.
Overlay example:
The example below shows why spine switch table sizes are not increased due to VXLAN fabric overlay,
making them lean.
VM A sends a message to VM B (they both belong to the same tenant network and have the same segment
VNI). ToR1 recognizes that the source end host corresponds to segment x, searches and identifies that the
target end host (VM B) belongs to segment x too, and that VM B is attached to ToR2. Note that typically the
communication between VM A and VM B belonging to the same subnet would first entail ARP resolution.
ToR1 encapsulates the frame in a VXLAN packet, and sends it in the direction of ToR2.
The devices in the path between ToR1 to ToR2 are not aware of the original frame and route/switch the packet
to ToR2. .
ToR2 decapsulates the VXLAN packet addressed to it. It does a lookup on the inner frame. Through its end
host database, ToR2 recognizes that VM B is attached to it and belongs to segment x, forwards the original
frame to VM B.
• VXLAN semantics are in operation from ToR1 to ToR2 through the encapsulation and decapsulation at
source and destination VTEPs, respectively. The overlay operation ensures that the original frame/packet
content is not exposed to the underlying IP network.
• The IP network that sends packets from ToR1 to ToR 2 based on the outer packet source and destination
address forms the underlay operation. As per design, none of the spine switches need to learn the addresses
of end hosts below the ToRs. So, learning of hundreds of thousands of end host IP addresses by the spine
switches is avoided.
• In order to accurately route/switch packets between end hosts in the data center, each participating ToR
in a VXLAN cluster must be aware of the end hosts attached to it and also the end hosts attached to other
ToRs, in real time.
VXLAN-EVPN fabric— The overlay protocol is VXLAN and BGP uses EVPN as the address family for
communicating end host MAC and IP addresses, so the fabric is referred thus.
More details for MP-BGP EVPN are noted in the Fabric Overlay Control-Plane (MP-BGP EVPN) section
Traffic between servers in the same tenant network that is confined to the same subnet is bridged. In this case,
the VTEPs stamp the layer-2 VNI in the VXLAN header when the communication is between servers that
are below different ToRs. The forwarding lookup is based on (L2-VNI, DMAC). For communications between
servers that are part of the same tenant but belong to different networks, routing is employed. In this case, the
layer-3 VNI is carried in the VXLAN header when communication is between servers below different ToRs.
This approach is referred to as the symmetric IRB (Integrated Routing and Bridging) approach, the symmetry
comes from the fact that VXLAN encapsulated routed traffic in the fabric from source to destination and
vice-versa will carry the same layer-3 VNI. This is shown in the figure below.
In the above scenario, traffic from a server (with layer-2 VNI x) on VTEP V1 is sent to a server (with layer-2
VNI y) on VTEP V2. Since the VNIs are different, the layer-3 VNI (unique to the VRF) is used for
communication over VXLAN between the servers.
• VM Mobility Support
• The control plane supports transparent VM mobility within and across VXLAN BGP EVPN fabrics,
and quickly updates reachability information to avoid hair-pinning of east-west traffic.
• The distributed anycast gateway also aids in supporting transparent VM mobility since post VM
move, the ARP cache entry for the default gateway is still valid.
supported for access to aggregation (leaf switch to spine switch) connectivity, promoting a highly available
fabric.
• Secure VTEPs
In a VXLAN-EVPN fabric, traffic is only accepted from VTEPs whose information is learnt via the
BGP-EVPN control plane. Any VXLAN encapsulated traffic received from a VTEP that is not known
via the control plane will be dropped. In this way, this presents a secure fabric where traffic will only be
forwarded between VTEPs validated by the control plane. This is a major security hole in data-plane
based VXLAN flood-n-learn environments where a rogue VTEP has the potential of bringing down the
overlay network.
• BGP specific motivations
• Increased flexibility— EVPN address family carries both Layer-2 and Layer-3 reachability
information. So, you can build bridged overlays or routed overlays. While bridged overlays are
simpler to deploy, routed overlays are easier to scale out.
• Increased security— BGP authentication and security constructs provide more secure multi-tenancy.
• Improved convergence time— BGP being a hard-state protocol is inherently non-chatty and only
provides updates when there is a change. This greatly improves convergence time when network
failures occur.
• BGP Policies— Rich BGP policy constructs provide policy-based export and import of reachability
information. It is possible to constrain route updates where they are not needed thereby realizing a
more scalable fabric.
• Advantages of route reflectors— Increases scalability and reduces the need for a full mesh (coverage)
of BGP sessions.
A route reflector in an MP-BGP EVPN control plane acts as a central point for BGP sessions between
VTEPs. Instead of each VTEP peering with every other VTEP, the VTEPs peer with a spine device
designated as a route reflector. For redundancy purposes, an additional route reflector is designated.
• MP-BGP also distributes subnet routes and external reachability information between VTEPs. When
VTEPs obtain end host routes of remote end hosts attached to other VTEPs, they install the routes in
their RIB and FIB.
Note that the end host route distribution is decoupled from the underlay protocol.
One tenant network, one Layer-2 VNI, and one default gateway IP and MAC address
Since end hosts in a tenant network might be attached to different VTEPs, the VTEPs are made to share a
common gateway IP and MAC address for intra-tenant communication.
If an end host moves to a different VTEP, the gateway information remains the same and reachability
information is available in the BGP control plane.
All VTEPs host active default gateways for their respective configured subnets and First Hop Routing Protocols
(FHRP) such as HSRP, VRRP etc. are not needed.
A sample distributed gateway for a setup, and the associated configurations are given below:
Figure 6: Distributed Gateway
(config) #
vlan 43
vn-segment 30000
The anycast gateway MAC, inherited by any interface (SVI) using “fabric forwarding”
(config) #
(config) #
interface vlan 43
no shutdown
vrf member VRF-A
(config) #
vlan 55
vn-segment 30001
The anycast gateway MAC, inherited by any interface (SVI) using “fabric forwarding”
(config) #
(config) #
interface vlan 55
no shutdown
vrf member VRF-A
ip address 10.98.98.1/24 tag 12345
fabric forwarding mode anycast-gateway
In the above example, a gateway is created for each of the 2 tenant networks (Blue –L2 VNI 30000 and Red
– L2 VNI 30001). End host traffic within a VNI (say 30000) is bridged, and traffic between tenant networks
is routed. The routing takes place through a Layer-3 VNI (say 50000) typically having a one-on-one association
with a VRF instance.
The VNI of the source end host, Host A, and the target end host, Host B, is 30000.
1. Host A sends traffic to the directly attached VTEP V1.
2. V1 performs a lookup based on the destination MAC address in the packet header (For communication
that is bridged, the target end host’s MAC address is updated in the DMAC field).
3. VTEP V1 bridges the packets and sends it toward VTEP V2 with a VXLAN header stamped with the
Layer 2 VNI 30000.
4. VTEP V2 receives the packets, and post decapsulation, lookup, bridges them to Host B.
Sample configurations for a setup with VNIs 30000 and 30001 are given below:
Configuration Example for VLAN, VNI, and VRF
VLAN to VNI mapping (MT-Lite)
(config) #
vlan 43
vn-segment 30000
vlan 55
vn-segment 30001
(config) #
vlan 500
vn-segment 50000
(config) #
(config) #
(config) #
vrf VRF-A
address-family ipv4 unicast
advertise l2vpn evpn
(config) #
evpn
vni 30000 l2
rd auto
route-target both auto
(config-evpn) #
vni 30001 l2
rd auto
route-target both auto
1. A VLAN is configured for each segment - sending segment, VRF segment and receiving segment.
2. BGP and EVPN configurations ensure redistribution of this information across the VXLAN setup.
ARP Suppression
The following section illustrates ARP suppression functionality at VTEP V1 (Refer the ARP Suppression
image, given below). ARP suppression is an enhanced function configured under the layer-2 VNI (using the
suppress-arp command). Essentially, the IP-MACs learnt locally via ARP as well as those learnt over
BGP-EVPN are stored in a local ARP suppression cache at each ToR. ARP request sent from the end host is
trapped at the source ToR. A lookup is performed in the ARP suppression cache with the destination IP as
the key. If there is a HIT, then the ToR proxies on behalf of the destination with the destination MAC. This
is the case depicted in the below image.
In case the lookup results in a MISS, when the destination is unknown or a silent end host, the ToR re-injects
the ARP request received from the requesting end host and broadcasts it within the layer-2 VNI. This entails
sending the ARP request out locally over the server facing ports as well as sending a VXLAN encapsulated
packet with the layer-2 VNI over the IP core. The VXLAN encapsulated packet will be decapsulated by every
receiving VTEP that has membership within the same layer-2 VNI. These receiving VTEPs will then forward
the inner ARP frame toward the server facing ports. Assuming that the destination is alive, the ARP request
will reach the destination, which in turn will send out an ARP response toward the sender. The ARP response
is trapped by the receiving ToR, even though ARP response is a unicast packet directed to the source VM,
since the ARP-suppression feature is enabled. The ToR will learn about the destination IP/MAC and in turn
advertise it over BGP-EVPN to all the other ToRs. In addition, the ToR will reinject the ARP response packet
into the network (VXLAN-encapsulate it toward the IP core since original requestor was remote) so that it
will reach the original requestor.
When a new end host (Host A) is attached to VTEP V1, the following actions occur:
1. VTEP V1 learns Host A's MAC and IP address (MAC_A and IP_A).
2. V1 advertises MAC_A and IP_A to the other VTEPs V2 and V3 through the route reflector.
3. The choice of encapsulation (VXLAN) is also advertised.
Note VIP is a common, virtual VTEP IP address that is used for (unicast and multi destination) communication to
and from the two switches in the vPC setup. The VIP address represents the two switches in the vPC setup,
and is designated as the next hop address (for end hosts in the vPC domain) for reachability purposes.
Multi-Destination Traffic
There are two options to transport tenant multi-destination traffic in the Programmable Fabric:
1. Through a shared multicast tree using PIM (ASM, SSM or BiDiR).
2. Through ingress replication (Available for Cisco Nexus 9000 Series switches only).
Refer to the table for Nexus switch type-to-BUM traffic support option mapping:
If you are using this Nexus switch: Use this option for BUM traffic:
Cisco Nexus 7000 and 7700 /F3 Series PIM ASM/SSM or PIM BiDir