0% found this document useful (0 votes)
382 views434 pages

NSX Reference Design Guide 4-1 - v1.5

VMware NSX Reference Design Guide is a technical document that provides an overview of the NSX architecture and components, including the management plane, control plane, data plane, logical switching, logical routing, security features, and load balancing capabilities. It describes the NSX consumption models of policy-based APIs and manager UI/APIs. It also provides recommendations for designing and deploying NSX networking and security services in different use cases.

Uploaded by

Sudeep Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
382 views434 pages

NSX Reference Design Guide 4-1 - v1.5

VMware NSX Reference Design Guide is a technical document that provides an overview of the NSX architecture and components, including the management plane, control plane, data plane, logical switching, logical routing, security features, and load balancing capabilities. It describes the NSX consumption models of policy-based APIs and manager UI/APIs. It also provides recommendations for designing and deploying NSX networking and security services in different use cases.

Uploaded by

Sudeep Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 434

DECEMBER 2023

VMWARE NSX ®
REFERENCE DESIGN
GUIDE
Software Version 4.1
VMware NSX Reference Design Guide

VMware NSX Reference Design Guide


Table of Contents

1 Introduction 9
How to Use This Document and Provide Feedback 9
Networking and Security Today 9
NSX Architecture Value and Scope 10
Containers and Cloud Native Application Integrations with NSX 13
Role of NSX in the VMware multi-cloud platform 14
2 NSX Architecture Components 17
Management Plane and Control Plane 17
Management Plane 17
Control Plane 18
NSX Manager Appliance 18
Data Plane 19
NSX Consumption Model 20
NSX Policy API Framework 20
Policy API Usage Example 1- Templatize and Deploy 3-Tier Application Topology 21
Policy API Usage Example 2- Application Security Policy Lifecycle Management 22
When to use Policy vs Manager UI/API 22
NSX Logical Object Naming relationship between manager and policy mode 24
NSX Multi-tenancy Framework 24
3 NSX Logical Switching 36
The NSX Virtual Switch 36
Segments and Transport Zones 36
Uplink vs. pNIC 38
Teaming Policy 39

1
VMware NSX Reference Design Guide

Uplink Profile 42
Creating a transport node 43
Transport Node Profiles 45
Network I/O Control v3 45
Virtual Switch Advanced Performance Settings 46
Logical Switching 47
Overlay Backed Segments 47
Flooded Traffic 48
Unicast Traffic 51
Data Plane Learning 52
Tables Maintained by the NSX Controller 54
Overlay Encapsulation 56
Tunnel End Point High Availability 57
Multi-TEP High Availability 60
Bridging Overlay to VLAN with the Edge Bridge 65
Overview of the Capabilities 66
4 NSX Logical Routing 72
Single Tier Routing 73
Distributed Router (DR) 73
Services Router 77
Two-Tier Routing 82
Interface Types on Tier-1 and Tier-0 Gateway 83
Route Types on Tier-0 and Tier-1 Gateways 84
Fully Distributed Two Tier Routing 86
Routing Capabilities 89
Static routing 89
Border Gateway Protocol (BGP) 90
Open Shortest Path First version 2 (OSPF v2) 92
Routing protocol recommendation in the data center. 112
IPv6 Routing Capabilities 114

2
VMware NSX Reference Design Guide

Services High Availability 116


Active/Active 117
Active/Standby 119
Stateful active/active gateways 121
NSX Edge High availability failover triggers 129
VRF Lite 134
VRF Lite Generalities 134
VRF Lite HA 140
Connecting Tier-1 gateways to VRFs 144
VRF route leaking 146
VRFs shared Internet access over a common VLAN 146
Multi-Protocol BGP EVPN - VXLAN 147
MP-BGP EVPN for the Control Plane 149
Multicast routing within NSX 155
Multicast in a virtual environment 155
NSX multicast capabilities 157
Technical overview 158
NSX multicast design recommendations 169
Static RP vs. BSR 174
Edge Node 174
Types of Edge Nodes 176
Multi-TEP support on Edge Node 177
Bare Metal Edge Node 178
VM Edge Node 179
Edge Cluster 180
Failure Domain 181
Other Network Services 183
Unicast Reverse Path Forwarding (uRPF) 183
Network Address Translation 184
DHCP Services 186

3
VMware NSX Reference Design Guide

Metadata Proxy Service 186


Gateway Firewall Service 186
Proxy ARP 187
Topology Consideration 189
Supported Topologies 190
Unsupported Topologies 194
5 NSX Security 195
NSX Security Use Cases 196
NSX DFW Architecture and Components 198
Management Plane 198
Control Plane 198
Data Plane 199
NSX Data Plane Implementation 199
Data Plane Components 199
NSX DFW Policy Lookup and Packet Flow 200
NSX Security Policy - Plan, Design and Implement 202
Security Policy Methodology 202
Security Rule Model 204
Security Policy - Consumption Model 205
Intrusion Detection 215
Service Insertion 215
Additional Security Features 216
NSX Security Enforcement – Agnostic to Network Isolation 216
NSX Distributed Firewall for VLAN Backed workloads 217
NSX Distributed Firewall for Mix of VLAN and Overlay backed workloads 217
NSX Distributed Firewall for Overlay Backed workloads 218
Gateway Firewall 219
Consumption 219
Implementation 219
Deployment Scenarios 220

4
VMware NSX Reference Design Guide

Endpoint Protection with NSX 221


Recommendation for Security Deployments 223
A Practical Approach to Building Micro-Segmentation Policy 224
NSX Firewall- For All Deployment Scenario 232
6 NSX Load Balancer 235
NSX Load Balancer Roadmap 235
Load Balancing Overview 236
NSX Load Balancing Architecture 237
NSX Load Balancing deployment modes 239
In-line load balancing 239
One-arm load balancing 240
NSX load-balancing technical details 242
Load-balancer high-availability 242
Load-balancer monitor 244
Load-balancer Layer4 or Layer7 load balancing 245
Load-balancer IPv6 247
Load-balancer traffic flows 247
Load-balancing combined with SR services (NAT and Firewall) 251
7 NSX Design Considerations 252
NSX Deployment Models 252
Role of overlay networking in the modern datacenter 255
Distributed security model, dvpgs vs NSX VLAN segments 259
Physical Infrastructure of the Data Center 261
Physical Network Requirements 261
Physical fabric considerations 262
NSX Infrastructure Component Connectivity 273
Overview 273
NSX Manager Node Availability and Hypervisor interaction 274
Physical placement considerations for the NSX Manager Nodes 275
Deployment Options for NSX Management Cluster 279

5
VMware NSX Reference Design Guide

Compute Cluster Design 284


ESXi-Based Compute Hypervisor with two pNICs 285
ESXi-Based Compute Hypervisor with Four (or more) pNICs 289
Edge Node and Services Design 294
Bridging Use Case 295
Edge Connectivity Guidelines for Layer 3 Peering Use Case 310
Edge Connectivity Guidelines for Services Only Use Case 326
NSX Edge Resources Design 329
NSX Platform Design Consideration 345
NSX domains design guidelines 346
Network Topologies and Logical Design Guidelines 360
NSX Components Placement in a vSphere Deployment 368
8 NSX Performance & Optimization 388
Typical Data Center Workloads 388
Next Generation Encapsulation - Geneve 388
Performance Considerations 389
pNIC Feature Support 390
pNIC Queues 403
Number of pNICs 404
vNIC Queues 405
vCPU Count 406
Jumbo MTU for Higher Throughput 406
Checking MTU on an ESXi host 408
Performance Factors for NSX Edges 409
Data Plane Development Kit (DPDK) 409
SSL Offload 410
Key Performance Factors 410
Compute Node Performance Factors 411
VM Edge Node Performance Factors 412
Bare Metal Edge Node Performance Factors 418

6
VMware NSX Reference Design Guide

Summary of Performance – NSX Components to NIC Features 420


Results 421
Benchmarking Tools 423
Compute 423
Edges 423
VM Edge 423
Bare Metal Edge 423
Conclusion 424
9 DPU-based Acceleration for NSX 426
DPU Architecture 426
Server requirements to enable offload to DPU 427
Offloading NSX to the DPU 427
Evolution of the NSX Datapath 427
Standard Datapath 428
High-performance Datapath Models 428
VM-modes available with the introduction of DPUs 431

7
VMware NSX Reference Design Guide

Intended Audience
This document is targeted toward virtualization and network architects interested in deploying
VMware NSX® network virtualization solutions in a variety of on premise solutions.

Revision History

Version Updates Comments

1.0 None NSX 2.0

2.0 Completely Revised NSX 2.5

3.0 Completely Revised NSX 3.0

3.2 Completely Revised NSX 3.2

4.1 Completely Revised NSX 4.1

8
VMware NSX Reference Design Guide

1 Introduction
This document provides guidance and best practices for designing environments that leverage
the capabilities of VMware NSX®. It is targeted at virtualization and network architects
interested in deploying NSX solutions.

How to Use This Document and Provide Feedback


This document is organized into several chapters. Chapter 2 to 6 explain the architectural
building blocks of NSX as a full stack solution, providing a detail functioning of NSX components,
features and scope. They also describe components and functionality utilized for security use
cases. These chapters lay the groundwork to help understand and implement the design
guidance described in the design chapter.
The design chapter (Chapter 7) examines detailed use cases of network virtualization and
recommendations of either best practices or leading practices based on the type of use case or
design form factor. It offers guidance for a variety of factors including physical infrastructure
considerations, compute node requirements, and variably sized environments from small to
enterprise scale.
The performance chapter (Chapter 8) aims at clarifying the myths vs facts about NSX based SDN
performances.
The DPU chapter (Chapter 9) outlines the recent innovations in the area of Data Processing
Units (DPUs) and how NSX benefits from them.
This document does not cover installation, and operational monitoring and troubleshooting.
For further details, review the complete NSX INSTALLATION AND ADMINISTRATION GUIDES .
Finally starting with this design guide, readers are encouraged to send a feedback to
NSXDesignFeedback_AT_groups_vmware_com (convert to email format).
The latest version of the document is always available on the VMware Community site:
HTTPS :// COMMUNITIES .VMWARE . COM/ T 5/VM WARE -NSX-DOCUMENTS /VM WARE -NSX-T-REFERENCE-
DESIGN/TA-P/2778093
This version has been published in September 2023 and aligns to NSX 4.1.

Networking and Security Today


In the digital transformation era, organizations are increasingly building custom applications to
drive core business and gain competitive advantages. The speed with which development
teams deliver new applications and capabilities directly impacts the organization’s success and
bottom line. This exerts increasing pressure on organizations to innovate quickly and makes

9
VMware NSX Reference Design Guide

developers central to this critical mission. As a result, the way developers create apps, and the
way IT provides services for those apps, are evolving.
Application Proliferation
With applications quickly emerging as the new business model, developers are under immense
pressure to deliver apps in a record time. This increasing need to deliver more apps in a less
time can drive developers to use public clouds or open source technologies. These solutions
allow them to write and provision apps in a fraction of the time required with traditional
methods.
Heterogeneity
Application proliferation has given rise to heterogeneous environments, with application
workloads being run inside VMs, containers, clouds, and bare metal servers. IT departments
must maintain governance, security, and visibility for application workloads regardless of
whether they reside on premises, in public clouds, or in clouds managed by third-parties.
Cloud-centric Architectures
Cloud-centric architectures and approaches to building and managing applications are
increasingly common because of their efficient development environments and fast delivery of
applications. These cloud architectures can put pressure on networking and security
infrastructure to integrate with private and public clouds. Logical networking and security must
be highly extensible to adapt and keep pace with ongoing change.
Against this backdrop of increasing application needs, greater heterogeneity, and the
complexity of environments, IT must still protect applications and data while addressing the
reality of an attack surface that is continuously expanding.

NSX Architecture Value and Scope


VMware NSX is designed to address application frameworks and architectures that have
heterogeneous endpoints and technology stacks. In addition to vSphere, and VMware public
clouds, these environments may include containers and bare metal operating systems. NSX
allows IT and development teams to choose the technologies best suited for their applications.
NSX is also designed for management, operations, and consumption by development
organizations in addition to IT.

10
VMware NSX Reference Design Guide

Figure 1-1: NSX Anywhere Architecture

The NSX architecture is designed around four fundamental attributes. FIGURE 1-1: NSX
ANYWHERE Architecture depicts the universality of those attributes that spans from any site, to any
cloud, and to any endpoint device. This enables greater decoupling, not just at the
infrastructure level (e.g., hardware, hypervisor), but also at the public cloud and container level;
all while maintaining the four key attributes of platform implemented across the domains. NSX
architectural value and characteristics of NSX architecture include:
• Policy and Consistency: Allows policy definition once and realizable end state via RESTful
API, addressing requirements of today’s automated environments. NSX maintains unique
and multiple inventories and controls to enumerate desired outcomes across diverse
domains.
• Networking and Connectivity: Allows consistent logical switching and distributed routing
without being tied to a single compute manager/domain (e.g. vCenter server). The
connectivity is further extended across containers and clouds via domain specific
implementation while still providing connectivity across heterogeneous endpoints.
• Security and Services: Allows a unified security policy model as with networking
connectivity. This enables implementation of services such as load balancer, Edge
(Gateway) Firewall, Distributed Firewall, and Network Address Translation cross multiple

11
VMware NSX Reference Design Guide

compute domains. Providing consistent security between VMs and container workloads
in private and public clouds is essential to assuring the integrity of the overall framework
set forth by security operations.
• Visibility: Allows consistent monitoring, metric collection, and flow tracing via a common
toolset across compute domains and clouds. Visibility is essential for operationalizing
mixed workloads – VM and container-centric –typically both have drastically different
tools for completing similar tasks.
These attributes enable the heterogeneity, app-alignment, and extensibility required to support
diverse requirements. Additionally, NSX supports DPDK libraries that offer line-rate stateful
services.
Heterogeneity
In order to meet the needs of heterogeneous environments, a fundamental requirement of NSX
is to be compute-manager agnostic. As this approach mandates support for multi-workloads, a
single NSX manager’s manageability domain can span multiple vCenters and multiple workload
types. When designing the management plane, control plane, and data plane components of
NSX, special considerations were taken to enable flexibility, scalability, and performance.
The management plane was designed to be independent of any compute manager, including
vSphere. The VMware NSX® Manager™ is fully independent; management of the NSX based
network functions are accesses directly – either programmatically or through the GUI.
The control plane architecture is separated into two components – a centralized cluster and an
endpoint-specific local component. This separation allows the control plane to scale as the
localized implementation – both data plane implementation and security enforcement – is
more efficient and allows for heterogeneous environments.
The data plane was designed to be normalized across various environments. NSX introduces a
host switch that normalizes connectivity among various compute domains, including multiple
VMware vCenter® instances, containers, bare metal servers, and other off premises or cloud
implementations. This switch is referred as N-VDS. The functionality of the N-VDS switch was
fully implemented in the ESXi VDS 7.0 and later, which allows ESXi customers to take advantage
of full NSX functionality without having to change the virtual switch. Regardless of
implementation, data plane connectivity is normalized across all platforms, allowing for a
consistent experience.
App-aligned
NSX was built with the application as the key construct. Regardless of whether the app was
built in a traditional monolithic model or developed in a newer microservices application
framework, NSX treats networking and security consistently. This consistency extends across
containers and multi-hypervisors on premises, then further into the public cloud.

12
VMware NSX Reference Design Guide

Containers and Cloud Native Application Integrations with NSX


In an age where a new application is directly tied to business gains, delays in application
deployment translate to lost revenue or business opportunity. The current era of digital
transformation challenges IT in addressing directives to normalize applications and data
security, increase the speed of delivery, and improve application availability. IT administrators
realize that a new approach must be taken to support business needs and meet timelines.
Architecturally solving the problem by explicitly defining connectivity, security, and policy as a
part of the application lifecycle is essential. Programmatic and automatic creation of network
and switching segments based on application-driven infrastructure is the only way to meet the
requirements of these newer architectures.
NSX is designed to address the needs of these emerging application frameworks and
architectures with heterogeneous endpoints and technology stacks. NSX allows IT and
development teams to choose the technologies best suited for their particular applications or
use cases without compromising consistent security and operations. NSX provides a common
framework to manage and increase the visibility of environments that contain everything from
physical servers to VMs and containers. As developers embrace newer technologies like
containers and the percentage of workloads running in public clouds increases, network
virtualization must expand to offer a full range of networking and security services (e.g., LB,
NAT, DFW, etc.) native in these environments. NSX solutions provide networking, security,
visibility, and Load Balancing services for container-based application platforms based on
Kubernetes such as Tanzu Kubernetes Grid (TKG) and Openshift, as well as managed
Kubernetes offerings running in the public cloud (Amazon’s Elastic Kubernetes Service – EKS,
Amazon Kubernetes Service – AKS, and Google Kubernetes Engine GKE).
The NSX portfolio includes multiple products with tight integration to deliver end-to-end
Kubernetes network and security services.
Antrea is an open-source Kubernetes-native networking and security solution that can be
installed in clusters running in private or public clouds and bare metal servers. The Antrea data
plane implementation is based on Open vSwitch. This choice makes it highly portable across
Linux and Windows operating systems and allows hardware offloading. Antrea provides a
comprehensive security policy model that builds upon Kubernetes network policies by
introducing the concepts of policy tiering, rule priorities, and cluster-level policies. Antrea
includes troubleshooting and monitoring tools for visibility and diagnostic capabilities such as
packet tracing, policy analysis, and flow inspection. Antrea instances running on multiple
clusters can be integrated with NSX to provide a consistent policy model and centralized
visibility across clusters, clouds, and workload form factors (containers, VM, bare metal).
Antrea is the default Container Network Interface (CNI) for Tanzu guest clusters and TKG.
VMware NSX Advanced Load Balancer (formerly Avi) dispenses Kubernetes load balancing and
ingress capabilities. The NSX ALB Kubernetes Ingress Services is optimized for North-South
traffic management, local and global server load balancing (GSLB), performance monitoring,
application security (WAF), and DNS/IPAM management. The NSX ALB Kubernetes Ingress

13
VMware NSX Reference Design Guide

Services provides operational consistency regardless of which on-prem, private-cloud, or public-


cloud environment the Kubernetes cluster is running on.
The NSX Container Plug-in (NCP) provides direct integration with several vSphere based private
cloud environments where containerized applications could reside. The NSX Container Plugin
leverages the Container Network Interface (CNI) to interface with the container and allows NSX
to directly orchestrate networking, policy, and load balancing. NCP is frequently used with
application platforms and enterprise distributions of k8s, notably VCF with Tanzu, Tanzu
Application Services, and RedHat Open Shift.

Figure 1-2: Programmatic Integration with Various PaaS and CaaS

Role of NSX in the VMware multi-cloud platform


A multi-cloud deployment model relies on the use of more than one public or private cloud
service provider for compute, network, and storage resources. VMware delivers a software

14
VMware NSX Reference Design Guide

defined infrastructure, Platform-as-a-Service (PaaS) and management stack that can be layered
on top of any physical hardware layer on any cloud or data center. The software stack is based
on VMware Cloud Foundation (VCF) and includes vSphere, VSAN, and NSX as its core
components. It provides a unified approach to building, running and managing traditional and
modern apps on any cloud. This unique architectural approach provides a single platform that
can function across all application types and multiple cloud environments. NSX is a key strategic
asset for the VMware multi-cloud platform. Limited and fragmented public cloud native
network and security services are augmented by rich and uniform enterprise-grade capabilities
across any cloud.
The cloud operating model cuts across the traditional silos – networking, network security, load
balancing and endpoint protection solutions are designed, deployed and managed by an
increasingly integrated team or a set of integrated processes in the form of automation and
pipelines. In this world network and security become software that is defined in advance and is
integrated into the customer CI/CD pipeline. It is accessed programmatically by high level,
declarative or intent based APIs, and it is oriented to serving the needs of the application. The
NSX APIs allow customers to embrace the cloud operation model, where an entire workload,
including all network and security services, is launched with no human touch, while not
sacrificing key enterprise functionalities.
Virtualized networking separates the logical connectivity policies from the physical transport
layer. In a multi-cloud world, the transport layer is increasingly less available because it is
embedded into another cloud or running on the public Internet. Thus, virtual networking is
essential for making the VMware stack run on any provider’s hardware, for seamlessly
connecting the heterogeneous clouds of a modern enterprise and present a uniform
consumption model across different providers.
Organizations are looking for a multi-cloud platform that delivers best in class workload
security. Thanks to NSX, VMware clouds can transparently insert services, to protect, manage
and operationalize applications at scale, and to have an intimate understanding of the end user
and the application context. This allows for unique features such as “Virtual patching” via the
NSX distributed IPS that protects individual vulnerable workloads before the application of a
security patch.
While all the VMware clouds run the same code base, some go a step further in term of
simplification and application alignment. The NSX that is presented to VMware Cloud
customers is much simpler than the premise-based version because the underlying topology is
controlled by VMware and all that is required is specify application-level policies. Since the
same software stack is deployed on any VMware cloud, operations such as the vMotion of a
workload and its attached firewall/security policy between private and public clouds are
available.

15
VMware NSX Reference Design Guide

Figure 1-3: NSX role in the VMware Multi-cloud platform

16
VMware NSX Reference Design Guide

2 NSX Architecture Components


NSX reproduces the complete set of networking services (e.g., switching, routing, firewalling,
load balancing, QoS) in software. These services can be programmatically assembled in
arbitrary combinations to produce unique, isolated virtual networks in a matter of seconds. NSX
works by implementing three separate but integrated planes: management, control, and data.
The three planes are implemented as sets of processes, modules, and agents residing on two
types of nodes: manager appliance and transport.

Figure 2-1: NSX Architecture and Components

Management Plane and Control Plane

Management Plane
The management plane provides an entry point to the system for API as well NSX graphical user
interface. It is responsible for maintaining user configuration, handling user queries, and
performing operational tasks on all management, control, and data plane nodes.
The NSX Manager implements the management plane for the NSX ecosystem. It provides an
aggregated system view and is the centralized network management component of NSX. NSX
Manager provides the following functionality:
• Serves as a unique entry point for user configuration via RESTful API (CMP, automation)
or NSX user interface.
• Responsible for storing desired configuration in its database. The NSX Manager stores
the final configuration request by the user for the system. This configuration will be
pushed by the NSX Manager to the control plane to become a realized configuration (i.e.,
a configuration effective in the data plane).
• Retrieves the desired configuration in addition to system information (e.g., statistics).

17
VMware NSX Reference Design Guide

• Provides ubiquitous connectivity, consistent enforcement of security and operational


visibility via object management and inventory collection and for multiple compute
domains – up to 16 vCenter servers, container orchestrators (TAS/TKGI & OpenShift) and
clouds (AWS and Azure)
Data plane components or transport node run a management plane agent (MPA) that connects
them to the NSX Manager.

Control Plane
The control plane computes the runtime state of the system based on configuration from the
management plane. It is also responsible for disseminating topology information reported by
the data plane elements and pushing stateless configuration to forwarding engines.
NSX splits the control plane into two parts:
• Central Control Plane (CCP) – The CCP is implemented as a cluster of virtual machines
called CCP nodes. The cluster form factor provides both redundancy and scalability of
resources. The CCP is logically separated from all data plane traffic, meaning any failure
in the control plane does not affect existing data plane operations. User traffic does not
pass through the CCP Cluster.
• Local Control Plane (LCP) – The LCP runs on transport nodes. It is adjacent to the data
plane it controls and is connected to the CCP. The LCP is responsible for programing the
forwarding entries and firewall rules of the data plane.

NSX Manager Appliance


Instances of the NSX Manager and NSX Controller are bundled in a virtual machine called the
NSX Manager Appliance. In the releases prior to 2.4, there were separate appliances based on
the roles, one management appliance and 3 controller appliances, so total four appliances to be
deployed and managed for NSX. Starting 2.4, the NSX manager, NSX policy manager and NSX
controller elements will co-exist within a common VM. Three unique NSX appliance VMs are
required for cluster availability. NSX relies on a cluster of three such NSX Manager Appliances
for scaling out and for redundancy. This three-node cluster relies on a majority of the
appliances in the cluster to be available and as such, attention should be paid to placement of
these appliances for availability The NSX Manager stores all of its information in an in-memory
database immediately which is synchronized across the cluster and written to disk,
configuration or read operations can be performed on any appliance. The performance
requirements of the NSX Manager Appliance’s disk is significantly reduced as most operations
occur in memory.
The benefits with this converged manager appliance include less management overhead with
reduced appliances to manage. And a potential reduction in the total amount of resources

18
VMware NSX Reference Design Guide

(CPU, memory and disk). With the converged manager appliance, one only need to consider the
appliance sizing once.

Figure 2-2: NSX Manager and Controller Consolidation

Each appliance has a dedicated IP address and its manager process can be accessed directly or
through a load balancer. Optionally, the three appliances can be configured to maintain a
virtual IP address which will be serviced by one appliance selected among the three. The design
consideration of NSX Manager appliance is further discussed in CHAPTER 7.

Data Plane
The data plane performs stateless forwarding or transformation of packets based on tables
populated by the control plane. It reports topology information to the control plane and
maintains packet level statistics. The hosts running the local control plane daemons and
forwarding engines implementing the NSX data plane are called transport nodes. Transport
nodes are running an instance of the NSX virtual switch called the NSX Virtual Distributed
Switch, or N-VDS.
On ESXi platforms, the N-VDS is built on the top of the vSphere Distributed Switch (VDS). In fact,
the N-VDS is so close to the VDS that NSX 3.0 introduced the capability of installing NSX directly
on the top of a VDS on ESXi hosts. For all other kinds of transport node, the N-VDS is based on
the platform independent Open vSwitch (OVS) and serves as the foundation for the
implementation of NSX in other environments (e.g., cloud, containers, etc.).
As represented in FIGURE 2-1: NSX ARCHITECTURE AND Components, there are two main types of
transport nodes in NSX:

19
VMware NSX Reference Design Guide

• Hypervisor Transport Nodes: Hypervisor transport nodes are hypervisors prepared and
configured for NSX. NSX provides network services to the virtual machines running on
those hypervisors. NSX 4.1 supports VMware ESXi™ only.
• Edge Nodes: VMware NSX Edge™ nodes are service appliances dedicated to running
centralized network services that cannot be distributed to the hypervisors. They can be
instantiated as a bare metal appliance or in virtual machine form factor. They are
grouped in one or several clusters, representing a pool of capacity. It is important to
remember that an Edge Node does not represent a service itself but just a pool of
capacity that one or more services can consume.

NSX Consumption Model


A user can interact with the NSX platform through the Graphical User Interface or the REST API
framework. Two GUI & REST API options are available to interact with the NSX Manager:
1) Policy Mode
o Default UI mode
o Any new deployment should use the Policy Mode
o API accessed via URI which start with /policy/api
2) Manager Mode
o Manager UI is disabled by default
o Deprecated as of NSX version 3.2
o Temporarily address deployments created via manager mode and upgraded from
previous releases
o NSX version 3.2 introduces the Manager to Policy promotion tool
o API accessed via URI which start with /api

NSX Policy API Framework


NSX Policy API framework provide an outcome driven config option. This allows a single API
call to configure multiple NSX networking & security objects for an application deployment.
This is more applicable for customers using automation and for CMP plugins. Some of the main
benefits of declarative API framework are:
• Outcome driven: Reduces the number of configuration steps by allowing a user to
describe desired end-goal (the “what”), and letting the system figure out “how” to
achieve it. This allows users to utilize user-specified names, not system generated IDs
• Order Independent: create/update/delete in any order and always arrive at the same
consistent result
• Prescriptive: reduces potential for user error with built-in dependency checks
• Policy Life Cycle Management: Simpler with single API call. Toggle marked-to-delete flag
in the JSON request body to manage life cycle of entire application topology.

20
VMware NSX Reference Design Guide

The NSX API documentation can be accessible directly from the NSX Manager UI, under Policy
section within API documentation, or it can be accessed from CODE .VMWARE .COM.
The following examples walks you through the policy API examples for two of the customer
scenarios:

Policy API Usage Example 1- Templatize and Deploy 3-Tier Application


Topology
This example provides how Policy API helps user to create the reusable code template for
deploying a 3-Tier APP shown in the figure, which includes Networking, Security & Services
needed for the application.

Figure 2-3: NSX Policy API - Infra

The desired outcome for deploying the application, as shown in the figure above, can be
defined using JSON. Once JSON request body is defined to reflect the desired outcome, then
API & JSON request body can be leveraged to automate following operational workflows:
• Deploy entire topology with single API and JSON request body.
• The same API/JSON can be further leveraged to templatize and reuse to deploy same
application in different environment (PROD, TEST and DEV).
• Handle life cycle management of entire application topology by toggling the
"marked_for_delete" flag in the JSON body to true or false.

See details of this at EXAMPLE 1.

21
VMware NSX Reference Design Guide

Policy API Usage Example 2- Application Security Policy Lifecycle


Management
This example demonstrates how a security admin can leverage declarative API to manage the
life cycle of security configuration, grouping, and micro-segmentation policy for a given 3-tier
application. The following figure depicts the entire application topology and the desired
outcome to provide zero trust security model for an application.

The policy model can define the desired outcome by specifying grouping and micro-
segmentation polices using JSON. It uses single API call with a JSON request body to automate
following operational workflows:
• Deploy white-list security policy with single API and JSON request body.
• The same API/JSON can further leveraged to templatize and reuse to secure same
application in different environment (PROD, TEST and DEV).
• Handle life cycle management of entire application topology by toggling the
"marked_for_delete" flag in the JSON body to true or false.

When to use Policy vs Manager UI/API


VMware recommendation is to use NSX Policy UI going forward as all the new features are
implemented only on Policy UI/API and the Manager UI/API has been deprecated.
The following table further highlights the feature transition map to the Policy UI/API model.

Policy Mode Manager Mode

All new deployments should use Policy mode. Deployments which were created using the
advanced interface, for example, upgrades
from versions where policy mode was not
available.

22
VMware NSX Reference Design Guide

Customers should plan the transition to


Policy Mode, possibly via the manager to
policy promotion tool.

New deployments which integrate with cloud Deployments which in previous versions
automation platforms, automation tools, or were integrated with cloud automation
the NSX Container plug-in (NCP). All the platforms, NCP, or custom orchestration
northbound integrations between VMware tools via the manager api. Customers should
products and NSX now support policy mode. plan the transition to Policy Mode. Consult
the specific guidance for NCP and VMware
Integrated Openstack (VIO). Migration of
manager api objects managed by vRealize
Automation is not yet supported.

Networking features available in Policy mode Networking features available in Manager


only: mode only:
● DNS Services and DNS Zones ● N/A
● VPN
● Federation
Security features available in Policy mode Security features available in Manager mode
only: only:
● Context Profiles ● Bridge Firewall
○ L7 applications
○ FQDN
● Identity Firewall
● TLS Inspection
● IDS/IPS
● New Distributed Firewall and
Gateway Firewall Layout
○ Categories
○ Auto service rules
○ Drafts
● Endpoint Protection
● Network Introspection (East-West
Service Insertion)

It is recommended that whichever mode is used to create objects (Policy or Manager) be the
only mode used (if the Manager Mode objects are required, create all objects in Manager
mode). Do not alternate use of the modes or there will be unpredictable results. Note that the

23
VMware NSX Reference Design Guide

default mode for the NSX Manager is Policy mode. When working in an installation where all
objects are new and created in Policy mode, the Manager mode option will not be visible in the
UI. For details on switching between modes, please see the NSX DOCUMENTATION .

NSX Logical Object Naming relationship between manager and policy mode
The name of some of the networking and security logical objects in the Manager API/Data
model have changed in the new policy model. The table below provides the before and after
naming side by side for those NSX Logical objects. This change only affects the name for the
given NSX object, but conceptually and functionally it is the same as before.

Manager API/UI Object Policy API Object

Logical switch Segment


Networking
T0 Logical Router Tier-0 Gateway

T1 Logical Router Tier-1 Gateway

Centralized Service Port Service Interface

Manager API/UI Object Policy API Object

NSGroup Group

IP Sets, MAC Sets Group


Security
Firewall Section Security-Policy

Edge Firewall Gateway Firewall

NSX Multi-tenancy Framework


The NSX multi-tenancy framework introduced in version 4.1 takes multi-tenancy support in NSX
to the next level by delivering flexible resource allocation and management. Enterprise Admins
can segment the platform into Projects, and VPCs, assigning different spaces to different
tenants while maintaining full visibility and control. This extension to the NSX consumption
model allows NSX users to consume their own objects, see alarms related to their own
configurations and test connectivity between their workloads with Traceflow without

24
VMware NSX Reference Design Guide

interfering with the configurations created by other users or having visibility into it. This section
provides an overview of new multi-tenancy features in NSX.
Before discussing the new multi-tenancy features that Projects introduces, let's go over how
multi-tenancy has traditionally been available at the data-plane layer in NSX versions prior to
4.1.

2.3.6.1 Data-plane multi-tenancy


NSX supports a multi-tiered routing model with logical separation between the different
gateways within the NSX infrastructure, giving complete control and flexibility over services and
policies. This model enables simple and stable interconnection in the data center as well as
automation of complex, potentially isolated application environments.
The Tier-0 Gateway provides a gateway service between the logical and physical network. It is
traditionally set up with dynamic routing and/or services. The Tier-1 Gateway provides a tenant
or an application router with a range of services (NAT, GFW, DNS forwarder, etc.). NSX manages
their connection and route distribution to the Tier-0 Gateway.
Distributed firewall can keep tenants isolated on shared environment where the service
provider manages IP allocation. Dedicated Tier-1 gateways with NAT or VRFs can provide layer 3
separation allowing the tenants to dictate their own IP schema.

Figure 2-4: Data-plane multi-tenancy

In this models NSX allows users to apply the desired data-plane segmentation. However, prior
to the release of NSX 4.1, tenants were not explicitly defined in NSX. The multi-tenancy logic
was implemented by either the NSX Administrator or the Cloud Management Platform.
What if a security team wanted to delegate management of firewall rules, requiring role-based
access to NSX? What if the same users on that team wanted to see only the alarms relevant to
their environment? Or if they wanted to collect only their assigned firewall logs within their
tenant? These are just some theoretical scenarios that highlight the challenges teams face;
from a management and monitoring perspective, there exists a clear need for multi-tenant
constructs in NSX.

25
VMware NSX Reference Design Guide

Management plane multi-tenancy is made possible by the introduction of Projects in the


platform.

Figure 2-5: Management plane multi-tenancy. Configuration and operation

2.3.6.2 Overview of NSX projects


The Enterprise Admin (Provider) can segment the platform into defined Projects. These Projects
delegate users to different spaces and tenants with their own objects, configurations, VMs, and
monitoring (based on alarms and logs).

26
VMware NSX Reference Design Guide

Figure 2-6: Two projects (Blue and Green) deployed alongside a topology created by the EA in the default space (Purple)

Projects exist alongside the traditional data model, are optional to use, and do not break
compatibility with existing setups in any way. The Enterprise Admin can still access all features
outside of the Project (from system setup to firewall rules) but can use Projects to define
tenants for logical consumption, if desired.
In the Default space, Enterprise Admins can have a consolidated view of all Projects or switch to
view a specific Project. They can also create multiple tenants with different Projects (Project 1,
Project 2, etc.).
To do so, they must allocate:
• At least one Tier-0 or Tier-0 VRF (Multiple supported and can be shared across
projects)
• At least one Edge Cluster (Multiple supported and can be shared across projects)
• The User(s) allocated to the Project and their role within the scope of the project (i.e.,
security admin, network admin, or full project admin)
• A short log ID to be added as a label to the logs pertaining to the Project
It is important to note that Tier-0/Tier-0 VRF and Edge Clusters can be shared across Projects if
desired by the Enterprise Admin.

27
VMware NSX Reference Design Guide

Once assigned, Project Users for Project 1 can directly access NSX within the scope defined for
them. They can create Tier-1 gateways on the Edge clusters allocated to the project (DR only
Tier-1 gateways are also supported), and connect them to the Tier-0 or Tier-0 VRF allocated to
the project.
The Enterprise Admin can also assign configuration of the Project to simplify consumption or
limit the number of objects created through Quotas.
The Enterprise Admin can create firewall rules system-wide that will apply to all VMs across all
environments. Those rules are configured from the Default space and cannot to be modified
within the tenant view.

Figure 2-7: Projects - Tier-0/VRF and edge clusters allocation

2.3.6.3 Tenant View: Projects Consumption


Once the Projects have been set up, access can be delegated to the tenant. The Enterprise
Admin can assign a generic role of Project Admin or use something more targeted such as
Security Admin of Project 1 or Network Operation of Project-1. The tenant can consume NSX via
the UI or API.
Upon logging in, users will land directly in their assigned Project and see only configurations,
alarms, VMs and so on that are relevant to their Project.

28
VMware NSX Reference Design Guide

Configuration will be restricted to logical objects. Tenants cannot manage the platform setup
(installation, upgrades, etc.) because these features are kept under Enterprise Admin
management. Other features that remain under Enterprise Admin management and are not
exposed to the tenant include Tier-0 configuration and Exclusion lists.
For exposed features, the consumption under Projects is the same as it would be outside the
Project. Creation of Tier-1s, segments, and other configuration follows the same model, and
they use the allocated Tier-0(s)/Tier-0 VRF(s) and Edge Cluster(s). Information on allocated
resources (Quota) is available in the Project tab.
Project admin can create one more Tier-1 gateways inside the project and connect them to the
tier-0 gateway the enterprise admin has assigned to the project. The EA can assign one or more
Tier-0 gateway or VRFs (VRF are the same as Tier-0 gateway in the context of the multitenancy
framework) to the project. The project admin can create isolated segments or segments
connected to one of the project Tier-1 gateways. Connecting project segments to Tier-0
gateways directly is not supported. Workloads connected to project segments are considered
part of the project and will be visible in the virtual machine inventory for the project.
One of the primary goals of the Project feature is to be able to delegate security policy
management and avoid the risk of rules being applied to the wrong VMs.
When a Project is created, a group representing the Project is also created, alongside some
default rules allowing for communication inside the Project but blocking anything else. This
default project includes all the segments defined as part of the project and, consequently all
the VMs connected to them. NSX will implicitly set this group in the apply-to field of any
distributed firewall rule preventing project users to impact workloads that are not under their
purview.

29
VMware NSX Reference Design Guide

Figure 2-8: Projects' default group and default DFW policy

The Project Admin can manage their own rules by changing the default project rules, creating
new rules and so on. These rules will only apply to VMs connected to the segments for their
Project. All the other VMs (not connected to Project segments) won’t be visible from the
Project and won’t be impacted by rules configured within the Project. It is now possible to give
access to NSX Distributed Firewall while removing the risk that a user could create a rule
impacting the entire system.
Rules defined by the Enterprise Admin in Default space can apply to those VMs within a Project
and will take precedence. This allows the Enterprise Admin to set up the environment to create
global rules that apply to all workloads, or to specific Projects. These global rules cannot be
modified by Project users.
Logs from Distributed Firewall and Gateway Firewall will be labeled with the Project
information so that they can be identified and separated on a per tenant basis.
The consumption of NSX services within a project is fundamentally identical to what a user
experiences when working with NSX configurations outside of a project. This is reflected in the
Policy API structure depicted in FIGURE 2-9. The /infra tree is still present and represents all the
configurations performed in the default space. Additionally, a new /orgs/default/ hierarchy is
now present. All the projects reside under this new /orgs/default element and have their own
/infra tree mirroring the structure of the main /infra tree for the default space (i.e.,
/orgs/default/projects/Project-1).
For more information about projects and a detailed list of the supported feature please refer to
the OFFICIAL DOCUMENTATION .

30
VMware NSX Reference Design Guide

Figure 2-9: NSX Hierarchical Policy API structure with Projects

2.3.6.4 VPCs
VPC were introduced in NSX 4.1.1. They represent an additional layer of multi-tenancy within a
project and provide a simplified consumption model of networking and security services aligned
to the experience an end-user would have in a public cloud environment. As with projects, VPC
do not alter the traditional NSX consumption model and can be deployed in parallel to the
classic NSX Policy constructs, sharing the same physical infrastructure. From the API structure in
FIGURE 2-10: NSX Hierarchical Policy API structure with Projects and VPCs, we can see that VPCs
are logically part of projects and provide a more streamlined access to common network and
security features such as subnets, security policies, and NAT.
The project admin can assign to exiting users (Local or available via an Identity Provider) the
VPC admin role or more specific role such as network admin or security admin and scope them
to one or more VPCs part of the project. The user management and lifecycle
(creating/deleting/activating local users and integrating the NSX platform with external identity
providers) is an Enterprise Admin responsibility.
The project admin can share groups with the VPC so that the VPC users can consume them in
their policies. The project admin can set quotas to limit to consumption of specific configuration
objects such as firewall rules, subnets, etc.
Each VPC is associated with a Tier-1 gateway deployed in the background and connected to one
of the Tier-0 gateways or VRFs assigned to the project the VPC is part of. A VPC admin will be

31
VMware NSX Reference Design Guide

able to perform the following tasks in a self-service model without impacting other VPCs or the
shared infrastructure:
• Create a new public subnet (Workloads on this subnet will be directly accessible on their
own IP from outside the VPC). Public IPs are allocated from an external IP block assigned
by the project admin to the VPC from a set of external IP blocks assigned by the
Enterprise Admin to the project.
• Create a private subnet (Workloads on this subnet will require a NAT configuration to
communicate with any workload outside the VPC. A pre-configured SNAT on a single IP
for all the private networks in the VPC allows outbound connectivity by default). The
SNAT IP is allocated from one of the external IP blocks assigned to the VPC. Private
subnets’ IP ranges are allocated from the private IP blocks assigned to the VPC by the
project admin and created by the project admin within the scope of a project.
• Create an isolated subnet (Workloads on this subnet will not have any communication
with any other subnet, inside or outside the VPC). The IP range of isolated subnets is
arbitrary and defined by the VPC admin at subnet creation time. The range is not
allocated from any IP block.
• Create NAT configurations to allow access to workloads on private subnets.
• Configure security policies and groups for both East-West and North-South traffic.
Security configuration created in a VPC will only affect workloads connected to VPC
subnets.

32
VMware NSX Reference Design Guide

Figure 2-10: NSX Hierarchical Policy API structure with Projects and VPCs

FIGURE 2-11 presents the VPC networking model with the 3 types of supported networks: public,
private, and isolated. One of the key benefits of the VPC consumption model is the out-of-the-
box IPAM (IP address management) capabilities. The Project Admin allocates IP blocks to each
VPC, and when the VPC admin creates a new subnet, any required IP range is automatically
assigned from the available blocks.
By default, VPC workload IP allocation is managed by DHCP, but it is possible to reserve part of
the subnet range for static IP allocations.
The VPC admin can also allocate external IPs from the available public block to be used in NAT
rules to grant external clients access to workloads on VPC private networks.
The Tier-1 gateways associated with VPCs are provisioned in parallel to other Tier-1 gateways
part of the same project and can be connected to the same Tier-0 gateway or VRF.

Figure 2-11: VPC Networking Conceptual Model

33
VMware NSX Reference Design Guide

Figure 2-12: VPC Networking realization (Traditional NSX Objects view)

FIGURE 2-13 presents the VPC security model. VPC users can configure North-South and East
West security policies based on L4/L7 attributes as well as defining groups for the VPC
workloads.
North-south rules are mapped by the system to gateway firewall rules applied to the uplink
interface of the VPC Gateway. They can inspect any traffic entering or exiting the VPC, but they
cannot enforce any action on traffic between workloads in the VPC. By default, any outbound
connectivity is allowed but any inbound traffic is dropped.
East-West rules are mapped by the system to distributed firewall rules with the apply-to field
set to a group including the VPC workloads exclusively. This choice prevents conflicts between
configuration performed on different VPCs. East-West rules can inspect any traffic originated or
directed to VPC workloads, including traffic to other VPC workloads, but also N-S traffic to and
from external clients. For this reason, North-South connectivity must be allowed on both N-S
and E-W rules. E-W rules allow any traffic by default.

34
VMware NSX Reference Design Guide

Figure 2-13: VPC Security Model

35
VMware NSX Reference Design Guide

3 NSX Logical Switching


This chapter details how NSX creates virtual Layer 2 networks, called segments, to provide
connectivity between its services and the different virtual machines in the environment.

The NSX Virtual Switch


A transport node is, by definition, a device implementing the NSX data plane. On ESXi hosts, the
NSX virtual is the vSphere® Distributed Switch™ (VDS) prepare with additional software
components. With any other kind of transport node (Edges, physical servers) NSX uses a virtual
switch called the NSX Virtual Distributed Switch, or N-VDS.

Segments and Transport Zones


In NSX, virtual layer 2 broadcast domains are called segments. There are two kinds of segments:
• VLAN backed segments
• Overlay backed segments
A VLAN backed segment is a layer 2 broadcast domain that is implemented as a traditional
VLAN in the physical infrastructure. That means that traffic between two VMs on two different
hosts but attached to the same VLAN backed segment will be carried over a VLAN between the
two hosts in native IEEE encapsulation. The resulting constraint is that an appropriate VLAN
needs to be provisioned in the physical infrastructure for those two VMs to communicate at
layer 2 over a VLAN backed segment.
On the other hand, two VMs on different hosts and attached to the same overlay backed
segment will have their layer 2 traffic carried by a tunnel between their hosts. This IP tunnel is
instantiated and maintained by NSX without the need for any segment specific configuration in
the physical infrastructure, thus decoupling NSX virtual networking from this physical
infrastructure.
Starting with NSX version 3.2, it is also possible to deploy the distributed firewall (DFW)
component of NSX directly on standard vSphere dvportgroups. The security functionalities
available on dvportgroups with this model are equivalent to those offered on NSX segments.
However, it is not possible to use both standard dvportgroups with DFW and NSX dvportgroups
with DFW on the same ESXi host. Please check the release notes of the NSX software release
you are planning to deploy, as this restriction might be lifted in the future.
Note: representation of NSX segments in vCenter
This design document will only use the term “segment” when referring to the NSX virtual Layer
2 broadcast domain. In the vCenter UI, those segments will appear as NSX dvportgroups, and
the same NSX segment might appear as a dvportgroups in multiple VDS. The following

36
VMware NSX Reference Design Guide

screenshot is showing an NSX segment called “overlay-seg”, appearing as an NSX dvportgroup


on VDS1.

You’ll also notice an “opaque network” with the same name. This is the representation of the
same segment that might appear on your hosts that are running NSX with an NVDS. The use of
NVDS on ESXi hosts has been deprecated as of NSX 4.0. Check the following KB article for more
information on the impact of this difference in representation:
https://kb.vmware.com/s/article/79872. The rest of this document will assume that NSX is only
deployed using a VDS on ESXi hosts.

Segments are created as part of an NSX object called a transport zone. There are VLAN
transport zones and overlay transport zones. A segment created in a VLAN transport zone will
be a VLAN backed segment, while, as you can guess, a segment created in an overlay transport
zone will be an overlay backed segment. NSX transport nodes attach to one or more transport
zones, and as a result, they gain access to the segments created in those transport zones.
Transport zones can thus be seen as objects defining the scope of the virtual network because
they provide access to groups of segments to the hosts that attach to them, as illustrated in
FIGURE 3-1 below:

Figure 3-1: NSX Transport Zone

37
VMware NSX Reference Design Guide

In above diagram, transport node 1 is attached to transport zone “Staging”, while transport
nodes 2-4 are attached to transport zone “Production”. If one creates a segment 1 in transport
zone “Production”, each transport node in the “Production” transport zone immediately gain
access to it. However, this segment 1 does not extend to transport node 1. The span of segment
1 is thus defined by the transport zone “Production” it belongs to.
Few additional points related to transport zones and transport nodes:
• Multiple VDS (with or without NSX) or VSS, can coexist on a ESXi transport node;
o A given pNIC can only be associated with a single virtual switch, VDS or
VSS. This behavior is specific to the VMware virtual switching model, not
to NSX.
o A VDS prepared with NSX can attach to a single overlay transport zone
and multiple VLAN transport zones at the same time.
o A transport node with multiple VDS prepared for NSX can attach to
multiple overlay transport zones, one per VDS.
• A transport zone can only be attached to a single VDS on a given transport node. In other
words, two VDS prepared for NSX on the same transport node cannot be attached to the
same transport zone.
• Edge transport node-specific points:
o An edge transport node can only have one N-VDS attached to an overlay
transport zone.
o If multiple VLAN segments are backed by the same VLAN ID (across all the
VLAN transport zones of an edge N-VDS), only one of those segments will
be “realized” (i.e. working effectively).

Please see additional consideration at RUNNING A VDS PREPARED FOR NSX ON ESXI HOSTS.

Uplink vs. pNIC


NSX introduces a clean differentiation between the physical uplinks of the host (aka pNICs, or
vmnics on ESXi hosts) and the uplinks of the NSX virtual switch. The uplinks of the NSX virtual
switch are logical constructs that can be mapped to one pNIC or multiple pNICs bundled into a
link aggregation group (LAG).FIGURE 3-2 illustrates the difference between an uplink and a pNIC:

38
VMware NSX Reference Design Guide

p1 p2 p3
• p1,p2 and p3 are pNICs on the ESXi host
u1 u2
• u1,u2 are the uplinks of the VDS
VDS with NSX
– u1 is defined as a LAG bundling pNICs p1 and p2
ESXi Transport Node – u2 is directly mapped to pNICs p3

Figure 3-2: VDS Uplinks vs. Hypervisor pNICs

In this example, a single virtual switch with two uplinks is defined on the ESXi transport node.
One of the uplinks is a LAG, bundling physical port p1 and p2, while the other uplink is only
backed by a single physical port p3. Both uplinks look the same from the perspective of the
virtual switch; there is no functional difference between the two.
Note that the example represented in this picture is by no means a design recommendation, it’s
just illustrating the difference between the virtual switch uplinks and the host physical uplinks.

Teaming Policy
The teaming policy defines how the NSX uses its virtual switch uplinks for redundancy and
traffic load balancing. There are two main options for teaming policy configuration:
• Failover Order – An active uplink is specified along with an optional list of standby uplinks.
Should the active uplink fail, the next available uplink in the standby list takes its place
immediately. This policy results in an active/standby use of the uplinks.
• Load Balance Source/Load Balance Source Mac Address – Traffic is distributed across a
specified list of active/active uplinks.
o The “Load Balance Source” policy maps a VM’s virtual interface to an uplink. Traffic sent
by this virtual interface will leave the host through this uplink only, and traffic destined
to this virtual interface will necessarily enter the virtual switch via this uplink.
o The “Load Balance Source Mac Address” goes a little bit further in term of granularity
for rare scenarios where a virtual interface that can source traffic from different mac
addresses. Here, two frames sent by the same virtual interface on the VM could be
associated to different uplinks based on their source mac address. Note that this
teaming policy is only effective on VLAN segments. When used with an overlay segment,
Load Balance Source Mac Address behaves the same way as Load Balance Source.

39
VMware NSX Reference Design Guide

The teaming policy only defines how the NSX distributes traffic across the uplinks of its virtual
switches. The uplinks can in turn be individual pNICs or LAGs (as seen in the previous section.)
Note that a LAG uplink has its own hashing options, however, those hashing options only define
how traffic is distributed across the physical members of the LAG uplink, whereas the teaming
policy define how traffic is distributed between NSX virtual switch uplinks.

VM1 and VM2 attached to the same NSX segment

u1 u2 u1 u2
VM1 VDS with NSX VM2 VM1 VDS with NSX VM2

HV Transport Node HV Transport Node


Failover Order Load Balanced Source Port
with u1 active/u2 standby with u1, u2 active
VM1/VM2 use u1 VM1 use u1, VM2 use to u2

Figure 3-3: NSX Virtual Switch Teaming Policies

FIGURE 3-3 presents an example of the failover order and source teaming policy options,
illustrating how the traffic from two different VMs in the same segment is distributed across
uplinks. The uplinks of the virtual switch could be any combination of single pNICs or LAGs;
whether the uplinks are pNICs or LAGs has no impact on the way traffic is balanced between
uplinks. When an uplink is a LAG, it is only considered down when all the physical members of
the LAG have their link down. When defining a transport node, the user must specify a default
teaming policy that will be applicable by default to all the segments (VLAN or overlay) available
on this transport node.

3.1.3.1 Named Teaming Policies


The administrator can define additional teaming policies on the top of the default teaming
policy. Those policies are identified by a name and are thus called “named teaming policies”.
All overlay segments follow the same default teaming policy. By default, a given VLAN segment
also follow the default policy, but it can also be individually assigned a name teaming policy
instead. That means that different VLAN segments can use different teaming policies. This

40
VMware NSX Reference Design Guide

flexibility is illustrated in the following Figure 3-4 where dedicated named teaming policies have
been created to steer traffic to the left and right uplink.

Default Teaming Policy:


àLoad Balance Source : u1, u2 active

“left” named teaming policy


àFailover order: u1 active, u2 standby

u1 u2 “right” named teaming policy


àFailover order: u2 active, u1 standby
VDS with NSX
VLAN segment mapped to
ESXi Transport Node “right” named teaming policy
VM1
VM3 VM4
VM2
Overlay segment, VLAN segment mapped to
using default teaming policy “left” named teaming policy

Figure 3-4: Named Teaming Policy

Here, VM3 is attached to a VLAN segment using the “left” teaming policy, its traffic will get
in/out of the host via VDS uplink u1. Again, overlay traffic necessarily follow the unique default
teaming policy. The load balance source still allows VM1 and VM2 to use different uplinks.
The following FIGURE 3-5 illustrates another common use case for named teaming policies.
Supposed that we have a host with four uplinks (u1/2/3/4) and that we want overlay traffic to
only use two specific uplinks, u1 and u2, while VLAN traffic should use u3 and u4.

41
VMware NSX Reference Design Guide

u1 u2 u3 u4 Default teaming policy:


Load balance source: u1, u2 active,
VDS with NSX
“VLAN-traffic” named teaming policy:
Load balance source : u3, u4 active
ESXi Transport Node

VM1 VM2 VM3 VM4


VM1, VM2 attached to overlay segment VM3, VM4 attached to VLAN segment,
following default teaming policy following teaming policy named “VLAN-traffic”

Figure 3-5: Other named teaming policy use case

Overlay traffic only follows the default teaming policy, so here, the default teaming policy must
include VDS uplinks u1 and u2. Without additional teaming policy, VLAN traffic would also flow
on uplink u1 and u2. Mapping VLAN segments to an additional named teaming policy “VLAN-
traffic” that includes u3 and u4 will ensure that VLAN traffic uses the desired uplinks.

Uplink Profile
As mentioned earlier, a transport node includes at least one NSX virtual switch, implementing
the NSX data plane. It is common for multiple transport nodes to share the exact same NSX
virtual switch configuration. It is also very difficult from an operational standpoint to configure
(and maintain) multiple parameters consistently across many devices. For this purpose, NSX
defines a separate object called an uplink profile that acts as a template for the configuration of
a virtual switch. The administrator can this way create multiple transport nodes with similar
virtual switches by simply pointing to a common uplink profile. Even better, when the
administrator modifies a parameter in the uplink profile, it is automatically updated in all the
transport nodes following this uplink profile.
The following parameters are defined in an uplink profile:
• The transport VLAN used for overlay traffic. Overlay traffic will be tagged with the VLAN ID
specified in this field.
• The MTU of the uplinks. NSX will assume that it can send overlay traffic with this MTU on
the physical uplinks of the transport node without any fragmentation by the physical
infrastructure. This parameter not used on ESXi hosts, where the MTU is directly defined on
the VDS.
• The name of the uplinks

42
VMware NSX Reference Design Guide

• The Link aggregation groups (LAGs) used by the virtual switch. This is not used on ESXi
transport nodes, where LAGs are defined directly on the VDS.
• The teaming policies applied to the uplinks (default and optional named teaming policies)

Creating a transport node


As defined earlier, a device implementing the NSX data plane is called a transport node, and
this means that a transport node implements at least one NSX virtual switch. We have
introduced all the components that play a part in the creation of a transport node. The
following diagram is a high-level representation of the parameters that need to be provided in
order to turn an ESXi host into a transport node.
FIGURE 3-6 Configuring VDS1 on ESXi1 for NSX:

Add Uplink Profile “UP1” Specify uplink mapping:


Teaming Policy: nsx-u1 à vds-u1
Select VDS1
Active Uplinks: nsx-u1 nsx-u2 à vds-u2
Standby Uplinks: nsx-u2 Optional for overlay attachment:
VLAN 100 Specify IP assignment for tunnels

p1 p2 p3 p1 p2 p3
vds-u1/ vds-u2/
vds-u1 vds-u2 vds-u3 vds-u3
nsx-u1 nsx-u2
VDS1 VDS2 VDS1
VDS with
with NSX
NSX VDS2
ESXi1 host ESXi1 transport node

Figure 3-6: Transport Node Creation with Uplink Profile

The ESXi host above has three physical uplinks p1, p2 and p3. Two VDS have been deployed on
this host. VDS1, with two uplinks called vds-u1, vds-u2 mapped to p1 and p2, and VDS2 with
one uplink vds-u3, mapped to p3. We are preparing VDS1 for NSX. For this operation, we
provide:
• An uplink profile (called “UP1” here), defining a teaming policy and a transport VLAN
(the VLAN ID that will be used if the transport node is attached to an overlay transport
zone.)
• The mapping of the uplink names defined in the uplink profile (nsx-u1, nsx-u2) with
VDS1 uplinks. Here, we map nsx-u1 to vds-u1 and nsx-u2 to vds-u2. This operation is
very similar to the uplink mapping that was performed when deploying VDS1 on ESXi1.
At that time, the user had to associate vds-u1 to p1, and vds-u2 to p2.

43
VMware NSX Reference Design Guide

• If the VDS is going to connect to an overlay transport zone, it will need some IP
addresses for creating tunnels between peer transport nodes. A way of defining those IP
addresses must be specified in that case. The next part details how overlay networking
works.
VDS2 in was added to show that NSX does not have to be installed on all virtual switches of an
ESXi host.
Once ESXi has been turned into a transport node by installing NSX on VDS1, this VDS1 has a
dual personality, it is both:
- a standard VDS, with dvportgroups using vds-u1 and vds-u2 for their teaming policies,
- and at the same time, it is an NSX virtual switch with NSX segments, using nsx-u1 and
nsx-u2 as its uplinks.
The uplink profile allows centralizing part of the NSX configuration. A change in the uplink
profile is applied to all the transport nodes that were created using it. The uplink profile can
also be used to apply more specific NSX configuration to a set of hosts. Consider FIGURE 3-7
below:

rack1 rack2

p1 p2 p1 p2
u1 u2 u1 u2
VDS1 VDS1

ESXi1 transport node ESXi2 transport node

Uplink Profile “Rack1” Uplink Profile “Rack2”


Default teaming Policy: Source Default teaming policy: Failover order
Active uplinks: u1, u2 Active uplinks: u1
Standby uplinks: none Standby uplinks: u2
VLAN 10 VLAN 20

Figure 3-7: Leveraging Different Uplink Profiles

In this example, both ESXi transport nodes are using the same VDS1. However, the
administrator applied different uplink profiles Rack1 and Rack2 on ESXi1 and ESXi2. A realistic
scenario would be the need to use different transport VLANs for different racks. Here, both
transport VLAN and teaming policy differ. A dvportgroup defined in VDS1 would use the same

44
VMware NSX Reference Design Guide

teaming policy on ESXi1 and ESXi2. On the other hand, an NSX segment common to both
transport nodes would use different teaming policies on ESXi1 and ESXi2.

Transport Node Profiles


We have seen in FIGURE 3-6 above that the creation of a transport node requires an uplink
profile along with some host-specific information (name of the uplinks to be mapped to the
NSX teaming policy, transport VLAN etc.) Because of this host-specific information, each
transport node must be created individually.
There are scenarios where multiple hosts have the same characteristics. This is typically the
case of a vSphere cluster. A transport node profile (or TNP) is a template that includes the
combination of an uplink profile along with a set of host-specific parameters (similar to those
shown on FIGURE 3-6.) This template can then be applied directly to a vSphere cluster and will
turn all the hosts in the cluster into transport nodes, using the same uplink profile and
parameters. Even better, when a host is added to the cluster with an associated TNP, it is
automatically turned into a transport node according to the TNP specifications.
The TNP mechanism also allows making exceptions for hosts that deviate from a standard
configuration in the vSphere cluster. The administrator can put specific hosts in a “sub-cluster”
(a “sub-cluster” is an NSX object without any counterpart on vSphere) , to which a more
specific configuration (called “sub-TNP”) is applied. This is useful when a cluster spans across
multiple racks and that the transport nodes part of this cluster must use different transport
VLANs or acquire their IP addresses for their tunnels using different IP pools.

Network I/O Control v3


This vSphere feature allows managing traffic contention on the uplinks of an ESXi hypervisor.
NIOCv3 allows the creation of shares, limits, and bandwidth reservation for the different kinds
of ESXi infrastructure traffic.
• Shares: Shares, from 1 to 100, reflect the relative priority of a traffic type against the other
traffic types that are active on the same physical adapter.
• Reservation: The minimum bandwidth that must be guaranteed on a single physical
adapter. Reserved bandwidth for system traffic that is unused becomes available to other
types of system traffic. Unused system traffic reservations do NOT become available to VM
traffic.
• Limit: The maximum bandwidth that a traffic type can consume on a single physical adapter.
The pre-determined types of ESXi infrastructure traffic are:
• Management Traffic is for host management
• Fault Tolerance (FT) is for sync and recovery.

45
VMware NSX Reference Design Guide

• NFS Traffic is traffic related to a file transfer in the network file system.
• vSAN traffic is generated by virtual storage area network.
• vMotion traffic is for computing resource migration.
• vSphere replication traffic is for replication.
• vSphere Data Protection Backup traffic is generated by backup of data.
• Virtual Machine traffic is generated by virtual machines workload
• iSCSI traffic is for Internet Small Computer System Interface storage
The NIOCv3 configuration takes place directly in vCenter. In addition to system traffic
parameters, NIOCv3 provides an additional level of granularity for the VM traffic category:
share, reservation and limits can also be applied at the Virtual Machine vNIC level. Because
NIOCv3 operates before traffic is sent on the uplinks, the encapsulation of the packets (VLAN or
overlay) has no impact on the feature. NIOCv3 works the exact same way on the VDS whether it
is prepared for NSX or not.

Virtual Switch Advanced Performance Settings


In NSX 4.1 multiple virtual-switch modes are available to allocate resources on the underlying
host CPU or DPU to the NSX network and security stack. Those modes are covered in chapter 9:
DPU-BASED ACCELERATION FOR NSX. The chapter covers non-DPU based modes too.

46
VMware NSX Reference Design Guide

Logical Switching
This section on logical switching focuses on overlay backed segments due to their ability to
create isolated logical L2 networks with the same flexibility and agility that exists with virtual
machines. This decoupling of logical switching from the physical network infrastructure is one
of the main benefits of adopting NSX.

Overlay Backed Segments


FIGURE 3-8 presents logical and physical network views of a logical switching deployment.

Figure 3-8: Overlay Networking – Logical and Physical View

In the upper part of the diagram, the logical view consists of five virtual machines that are
attached to the same segment, forming a virtual broadcast domain. The physical
representation, at the bottom, shows that the five virtual machines are running on hypervisors
spread across three racks in a data center. Each hypervisor is an NSX transport node equipped

47
VMware NSX Reference Design Guide

with a tunnel endpoint (TEP). The TEPs are configured with IP addresses, and the physical
network infrastructure just need to provide IP connectivity between them. Whether the TEPs
are L2 adjacent in the same subnet or spread in different subnets does not matter. The
VMware® NSX Controller™ (not pictured) distributes the IP addresses of the TEPs across the
transport nodes so they can set up tunnels with their peers. The example shows “VM1” sending
a frame to “VM5”. In the physical representation, this frame is transported via an IP point-to-
point tunnel between transport nodes “HV1” to “HV5”.
The benefit of this NSX overlay model is that it allows direct connectivity between transport
nodes irrespective of the specific underlay inter-rack (or even inter-datacenter) connectivity
(i.e., L2 or L3). Segments can also be created dynamically without any configuration of the
physical network infrastructure.

Flooded Traffic
The NSX segment behaves like a LAN, providing the capability of flooding traffic to all the
devices attached to this segment; this is a cornerstone capability of layer 2. NSX does not
differentiate between the different kinds of frames replicated to multiple destinations.
Broadcast, unknown unicast, or multicast traffic will be flooded in a similar fashion across a
segment. In the overlay model, the replication of a frame to be flooded on a segment is
orchestrated by the different NSX components. NSX provides two different methods for
flooding traffic described in the following sections. They can be selected on a per segment
basis.

3.2.2.1 Head-End Replication Mode


In the head end replication mode, the transport node at the origin of the frame to be flooded
sends a copy to each other transport node that is connected to this segment.
FIGURE 3-9 offers an example of virtual machine “VM1” on hypervisor “HV1” attached to
segment “S1”. “VM1” sends a broadcast frame on “S1”. “HV1” floods the frame to the logical
ports local to “HV1”, then determines that there are remote transport nodes part of “S1”. The
NSX Controller advertised the TEPs of those remote interested transport nodes, so “HV1” will
send a tunneled copy of the frame to each of them.

48
VMware NSX Reference Design Guide

Figure 3-9: Head-end Replication Mode

The diagram illustrates the flooding process from the hypervisor transport node where “VM1”
is located. “HV1” sends a copy of the frame that needs to be flooded to every peer that is
interested in receiving this traffic. Each green arrow represents the path of a point-to-point
tunnel through which the frame is forwarded. In this example, hypervisor “HV6” does not
receive a copy of the frame. This is because the NSX Controller has determined that there is no
recipient for this frame on that hypervisor.
In this mode, the burden of the replication rests entirely on source hypervisor. Seven copies of
the tunnel packet carrying the frame are sent over the uplink of “HV1”. This should be
considered when provisioning the bandwidth on this uplink.

3.2.2.2 Two-tier Hierarchical Mode


In the two-tier hierarchical mode, transport nodes are grouped according to the subnet of the
IP address of their TEP. Transport nodes in the same rack typically share the same subnet for
their TEP IPs, though this is not mandatory. Based on this assumption, FIGURE 3-10 shows
hypervisor transport nodes classified in three groups: subnet 10.0.0.0, subnet 20.0.0.0 and
subnet 30.0.0.0. In this example, the IP subnet have been chosen to be easily readable; they are
not public IPs.

49
VMware NSX Reference Design Guide

Figure 3-10: Two-tier Hierarchical Mode

Assume that “VM1” on “HV1” needs to send the same broadcast on “S1” as in the previous
section on head-end replication. Instead of sending an encapsulated copy of the frame to each
remote transport node attached to “S1”, the following process occurs:
1. “HV1” sends a copy of the frame to all the transport nodes within its group (i.e., with a
TEP in the same subnet as its TEP). In this case, “HV1” sends a copy of the frame to
“HV2” and “HV3”.
2. “HV1” sends a copy to a single transport node on each of the remote groups. For the
two remote groups - subnet 20.0.0.0 and subnet 30.0.0.0 – “HV1” selects an arbitrary
member of those groups and sends a copy of the packet there a bit set to indicate the
need for local replication. In this example, “HV1” selected “HV5” and “HV7”.
3. Transport nodes in the remote groups perform local replication within their respective
groups. “HV5” relays a copy of the frame to “HV4” while “HV7” sends the frame to
“HV8” and “HV9”. Note that “HV5” does not relay to “HV6” as it is not interested in
traffic from “LS1”.
The source hypervisor transport node knows about the groups based on the information it has
received from the NSX Controller. It does not matter which transport node is selected to
perform replication in the remote groups so long as the remote transport node is up and
available. If this were not the case (e.g., “HV7” was down), the NSX Controller would update all
transport nodes attached to “S1”. “HV1” would then choose “HV8” or “HV9” to perform the
replication local to group 30.0.0.0.

50
VMware NSX Reference Design Guide

In this mode, as with head end replication example, seven copies of the flooded frame have
been made in software, though the cost of the replication has been spread across several
transport nodes. It is also interesting to understand the traffic pattern on the physical
infrastructure. The benefit of the two-tier hierarchical mode is that only two tunnel packets
(compared to the headend mode of five packets) were sent between racks, one for each
remote group. This is a significant improvement in the network inter-rack (or inter-datacenter)
fabric utilization - where available bandwidth is typically less than within a rack. That number
that could be higher still if there were more transport nodes interested in flooded traffic for
“S1” on the remote racks. In the case where the TEPs are in another data center, the savings
could be significant. Note also that this benefit in term of traffic optimization provided by the
two-tier hierarchical mode only applies to environments where TEPs have their IP addresses in
different subnets. In a flat Layer 2 network, where all the TEPs have their IP addresses in the
same subnet, the two-tier hierarchical replication mode would lead to the same traffic pattern
as the source replication mode.
The default two-tier hierarchical flooding mode is recommended as a best practice as it
typically performs better in terms of physical uplink bandwidth utilization.

Unicast Traffic
When a frame is destined to an unknown MAC address, it is flooded in the network. Switches
typically implement a MAC address table, or filtering database (FDB), that associates MAC
addresses to ports in order to prevent flooding. When a frame is destined to a unicast MAC
address known in the MAC address table, it is only forwarded by the switch to the
corresponding port.
The NSX virtual switch maintains such a table for each segment/logical switch it is attached to.
A MAC address can be associated with either a virtual NIC (vNIC) of a locally attached VM or a
remote TEP (when the MAC address is located on a remote transport node reached via the
tunnel identified by that TEP).
FIGURE 3-11 illustrates virtual machine “Web3” sending a unicast frame to another virtual
machine “Web1” on a remote hypervisor transport node. In this example, the NSX virtual
switch on both the source and destination hypervisor transport nodes are fully populated.

51
VMware NSX Reference Design Guide

Web1 Web2 Web3

mac1
1
mac1 ?
LS

mac1
2
MAC@ TEP IP (N-)VDS
4 mac1 ? (N-)VDS
MAC@ TEP IP
HV1 HV3
mac1 à vnic1 mac1 à TEP1
TEP1 TEP3
mac2 à TEP2 mac2 à TEP2

c1
3

ma
mac3 à TEP3 mac3 à vnic3

P1
TE
Figure 3-11: Unicast Traffic between VMs

1. “Web3” sends a frame to “Mac1”, the MAC address of the vNIC of “Web1”.
2. “HV3” receives the frame and performs a lookup for the destination MAC address in its
MAC address table. There is a hit. “Mac1” is associated to the “TEP1” on “HV1”.
3. “HV3” encapsulates the frame and sends it to “TEP1”.
4. “HV1” receives the tunnel packet, addressed to itself and decapsulates it. TEP1 then
performs a lookup for the destination MAC of the original frame. “Mac1” is also a hit
there, pointing to the vNIC of “VM1”. The frame is then delivered to its final destination.
This mechanism is relatively straightforward because at layer 2 in the overlay network, all the
known MAC addresses are either local or directly reachable through a point-to-point tunnel.
In NSX, the MAC address tables can be populated by the NSX Controller or by learning from the
data plane. The benefit of data plane learning, further described in the next section, is that it is
immediate and does not depend on the availability of the control plane.

Data Plane Learning


In a traditional layer 2 switch, MAC address tables are populated by associating the source MAC
addresses of frames received with the ports where they were received. In the overlay model,
instead of a port, MAC addresses reachable through a tunnel are associated with the TEP for
the remote end of this tunnel. Data plane learning is a matter of associating source MAC
addresses with source TEPs. Ideally data plane learning would occur through the NSX virtual
switch associating the source MAC address of received encapsulated frames with the source IP
of the tunnel packet. The mechanism, represented in Figure 3-12 below: in that example, mac
address mac1 has been received on a tunnel with a source IP address of “TEP1”. It is logical to
assume that the device using mac1 is running on the host with TEP1.

52
VMware NSX Reference Design Guide

web1
mac1 MAC@ TEP IP
TEP1 TEP2
L2 Payload src mac1: dest mac FF src IP:TEP1àdest IP:TEP2 mac1 à TEP1
HV1 HV2

inner mac addresses tunnel header

Figure 3-12 Data Plane Learning Using Tunnel Source IP Address

But this common method used in traditional overlay networking would not work for NSX with
the two-tier replication model. Indeed, as shown in section TWO-TIER HIERARCHICAL MODE , it is
possible that flooded traffic gets replicated by an intermediate transport node. In that case, the
source IP address of the received tunneled traffic represents the intermediate transport node
instead of the transport node that originated the traffic. Figure 3-13 below illustrates this
problem by focusing on the flooding of a frame from VM1 on HV1 using the two-tier replication
model (similar to what was described earlier in FIGURE 3-10: TWO-TIER HIERARCHICAL Mode.) When
intermediate transport node HV5 relays the flooded traffic from HV1 to HV4, it is actually
decapsulating the original tunnel traffic and re-encapsulating it, using its own TEP IP address as
a source.
The problem is thus that, if the NSX virtual switch on “HV4” was using the source tunnel IP
address to identify the origin of the tunneled traffic, it would wrongly associate Mac1 to TEP5.
web1
mac1 MAC@ TEP IP
TEP1 TEP5 TEP4 mac1 à TEP5
HV1 L2 Payload mac1à FF TEP1àTEP5 HV5 L2 Payload mac1à FF TEP5àTEP4 HV4
incorrect

Figure 3-13: Data plane learning from source tunnel IP leads to an incorrect association

To solve this problem, some metadata identifying the source TEP is inserted in the tunnel
header. Metadata is a piece of information that is carried along unchanged with the payload of
the tunnel. Figure 3-14 below is illustrating the same tunneled frame from “Web1” on “HV1”, this
time carried with a metadata field identifying “TEP1” as the origin.

web1
mac1 MAC@ TEP IP
TEP1 Metadata: source TEP = TEP1
TEP5 Metadata: source TEP = TEP1
TEP4 mac1 à TEP1
HV1 L2 Payload mac1à FF TEP1àTEP5 HV5 L2 Payload mac1à FF TEP5àTEP4 HV4

Figure 3-14: Data Plane Learning Leveraging Metadata

With this additional piece of information, “HV4” can correctly identify the origin of the tunneled
traffic on replicated traffic and populate its mac address table with the appropriate source TEP
IP address.

53
VMware NSX Reference Design Guide

Tables Maintained by the NSX Controller


While NSX can populate the filtering database of a segment/logical switch from the data plane
just like traditional physical networking devices, the NSX Controller is also building a central
repository for some tables that enhances the behavior of the system. These tables include:
• Global MAC address to TEP table
• Global ARP table, associating MAC addresses to IP addresses

3.2.5.1 MAC Address to TEP Tables


When the vNIC of a VM is attached to a segment/logical switch, the NSX Controller is notified of
the MAC address as well as the TEP by which this MAC address is reachable. Unlike individual
transport nodes that only learn MAC addresses corresponding to received traffic, the NSX
Controller has a global view of all MAC addresses declared in the NSX environment.
The global MAC address table can proactively populate the local MAC address table of the
different transport nodes before they receive any traffic. Also, in the rare case when transport
node receives a frame from a VM destined to an unknown MAC address, it will send a request
to look up this MAC address in the global table of the NSX Controller while simultaneously
flooding the frame.
Not all the MAC addresses present in the data plane tables are reported to the NSX Controller.
If a VM is allowed to send traffic on a segment/logical switch from several source MAC
addresses, those secondary MAC addresses are not pushed to the NSX Controller. Similarly, the
NSX Controller is not notified of MAC addresses learned from an Edge bridge connected to a
physical layer 2 network. This behavior was implemented in order to protect the NSX Controller
from an injection of an arbitrarily large number of MAC addresses into in the network.

3.2.5.2 ARP Tables


The NSX Controller also maintains an ARP table in order to help implement an ARP suppression
mechanism. The NSX virtual switch snoops DHCP and ARP traffic to learn MAC address to IP
associations. Those associations are then reported to the NSX Controller. An example of the
process is summarized in FIGURE 3-15: ARP Suppression.

54
VMware NSX Reference Design Guide

IP@ Mac@
Central Control Plane Cluster IP A à mac vmA 1
IP B à mac vmB 2

2 3 4
IPAàmac vmA 1 IPAàmac vmA
IPBàmac vmB IPA
mac?
vmA vmB
IP@ Mac@ IP@ Mac@
DHCP ack

ARP Req IPA

ARP Reply
1 IPA à mac vmA IPB à mac vmB 2
IPA à mac vmA 4
segment
(N-)VDS (N-)VDS
HV1 HV2
TEP1 TEP2

Figure 3-15: ARP Suppression

1. Virtual machine “vmA” has just finished a DHCP request sequence and been assigned IP
address “IPA”. The NSX virtual switch on “HV1” reports the association of the MAC
address of virtual machine “vmA” to “IPA” to the NSX Controller.
2. Next, a new virtual machine “vmB” comes up on “HV2” that must communicate with
“vmA”, but its IP address has not been assigned by DHCP and, as a result, there has been
no DHCP snooping. The virtual switch will be able to learn this IP address by snooping
ARP traffic coming from “vmB”. Either “vmB” will send a gratuitous ARP when coming
up or it will send an ARP request for the MAC address of “vmA”. The virtual switch then
can derive the IP address “IPB” associated to “vmB”. The association (vmB -> IPB) is then
pushed to the NSX Controller.
3. The NSX virtual switch also holds the ARP request initiated by “vmB” and queries the
NSX Controller for the MAC address of “vmA”.
4. Because the MAC address of “vmA” has already been reported to the NSX Controller,
the NSX Controller can answer the request coming from the virtual switch, which can
now send an ARP reply directly to “vmB” on the behalf of “vmA”. Thanks to this
mechanism, the expensive flooding of an ARP request has been eliminated. Note that if
the NSX Controller did not know about the MAC address of “vmA” or if the NSX

55
VMware NSX Reference Design Guide

Controller were down, the ARP request from “vmB” would still be flooded by the virtual
switch.

Overlay Encapsulation
NSX uses Generic Network Virtualization Encapsulation (Geneve) for its overlay model. Geneve
is currently an IETF Internet Draft that builds on the top of VXLAN concepts to provide
enhanced flexibility in term of data plane extensibility.

Figure 3-16: Geneve Encapsulation (from IETF Draft)

VXLAN has static fields while Geneve offers flexible field. This capability can be used by anyone
to adjust the need of typical workload and overlay fabric, thus NSX tunnels are only setup
between NSX transport nodes. NSX only needs efficient support for the Geneve encapsulation
by the NIC hardware; most NIC vendors support the same hardware offload for Geneve as they
would for VXLAN.
Network virtualization is all about developing a model of deployment that is applicable to a
variety of physical networks and diversity of compute domains. New networking features are
developed in software and implemented without worry of support on the physical
infrastructure. For example, the data plane learning section described how NSX relies on
metadata inserted in the tunnel header to identify the source TEP of a forwarded frame. This
metadata could not have been added to a VXLAN tunnel without either hijacking existing bits in
the VXLAN header or making a revision to the VXLAN specification. Geneve allows any vendor
to add its own metadata in the tunnel header with a simple Type-Length-Value (TLV) model.
NSX defines a single TLV, with fields for:

56
VMware NSX Reference Design Guide

• Identifying the TEP that sourced a tunnel packet


• A version bit used during the intermediate state of an upgrade
• A bit indicating whether the encapsulated frame is to be traced
• A bit for implementing the two-tier hierarchical flooding mechanism. When a transport
node receives a tunneled frame with this bit set, it knows that it must perform local
replication to its peers
• Two bits identifying the type of the source TEP
These fields are part of a VMware specific TLV. This TLV can be changed or enlarged by VMware
independent of any other vendors. Similarly, other vendors or partners can insert their own
TLVs. Geneve benefits from the same NIC offloads as VXLAN (the capability is advertised in the
VMware compatibility list for different NIC models.) Because overlay tunnels are only setup
between NSX transport nodes, there is no need for any hardware or software third party
vendor to decapsulate or look into NSX Geneve overlay packets. Thus, networking feature
adoption can be done in the overlay, isolated from underlay hardware refresh cycles.

Tunnel End Point High Availability


TEPs are associated with a physical uplink and transport nodes can have one or several TEPs,
depending on their default teaming policy. With a failover order teaming policy, traffic only
flows through a single active uplink and as a result, the transport node only needs a single TEP.
With source teaming policies (source mac address or source port ID), traffic will be distributed
across multiple active uplinks listed in the teaming policy and one TEP needs to be instantiated
for each of those physical uplinks. TEP high availability works the same, whether there is one or
multiple TEPs. When the link of the interface associated to a TEP goes down, the TEP migrates
to another interface in the teaming policy with a link up. The following diagram illustrates this
model. ESXi1 is using a source teaming policy (vmnic0 and vmnic1 active), resulting in the
creation of a TEP for each of its physical uplinks vmnic0 and vmnic1. ESXi2 is configured with a
failover order teaming policy (vmnic0 active, vmnic1 standby), and only the first physical uplink
listed in this policy (initially vmnic0) is associated to a TEP.

57
VMware NSX Reference Design Guide

Figure 3-17: TEP high availability

When top of rack switch ToR1 fails, vmnic0’s link goes down on both ESXi1 and ESXi2. For both
ESXi hosts, the impacted TEP is migrated to vmnic1, the remaining operational uplink part of
the teaming policy. Note that, the TEP itself is never considered as failed, but depending on the
status of its “preferred” interface, it might move to a secondary interface. This implies that all
the uplinks part of a teaming policy are expected to be layer 2 adjacent.
The following diagrams goes a little bit further into the architecture of an ESXi host featuring
multiple TEPs. In this example, the virtual machines VM1/2/3/4 are attached to an overlay
segment. The virtual machines can communicate directly between each other within the host
using this segment. The source teaming policy also associate those four VMs to the different
TEPs: here VM1/VM2 are associated to TEP1, while VM3/VM4 are associated to TEP2. What
this VM to TEP association means is that traffic going from say VM1 will be forwarded out of
the ESXi transport node on the overlay, using TEP1 as a source. Overlay traffic from remote
hosts toward VM1 will also be received by TEP1. This association is reflected in the mac address
table for the segment, represented in the right side of the diagram. This table is synchronized to
all remote transport nodes.

58
VMware NSX Reference Design Guide

Figure 3-18: VM to TEP association

The diagram in FIGURE 3-18 also shows how the mac address of the TEPs are learnt in the
physical infrastructure. For example, top of rack switch ToR1 has learnt that TEP1’s mac address
is on the downlink toward vmnic0 and TEP2’s mac address is reachable via the inter switch link.
Now, let’s suppose the link between ToR2 and vmnic1 goes down. The PHY of vmnic1 declares
the link as failed, and TEP2 is immediately moved to vmnic0. To update the mac address tables
in the physical infrastructure, a RARP is sent on vmnic0, using TEP2’s mac address as a source.
This convergence is represented in FIGURE 3-19 below:

59
VMware NSX Reference Design Guide

Figure 3-19: TEP HA behavior in case of an uplink failure

Following the flooding of the RARP packet, the mac address tables in the physical infrastructure
direct traffic for TEP2 directly toward the downlink of ToR1. This mechanism is extremely
efficient. As soon as the link of vmnic1 is declared down, the transmission of a single packet on
vmnic0 potentially recovers connectivity for all the VMs impacted by vmnic1’s failure. FIGURE
3-19 represents the updated mac addresses in the physical infrastructure in orange. Notice that
the mac address table of the segment, in the overlay, is left completely unaffected by the
reconvergence.

Multi-TEP High Availability


The uplink failure recovery implemented for transport node uplinks provided by TEP HA is as
efficient as can be, assuming that a link failure is detected by the hardware. Multi-TEP high
availability is an optional feature built on the top of TEP HA that covers some additional failure
scenarios on ESXi hosts with multiple TEPs. More specifically, multi-TEP HA will consider a TEP
failed:
• When it cannot acquire and IP address via DHCP.
• when the physical interface associated to the TEP cannot send or receive traffic while its link
remains up.

60
VMware NSX Reference Design Guide

The failed TEP will be replaced by a healthy one. This section details the failure detection, the
recovery mechanism involved and the way multi-TEP HA behaves when the initial failure
condition has cleared.
Multi-TEP HA is only available for ESXi transport nodes. It is not available for Edge Nodes.

3.2.8.1 Failure detection: TEP fails to acquire an IP address via DHCP


TEPs IP addresses are used to establish overlay tunnels between transport nodes. As a result, a
TEP cannot function properly without an appropriate IP address. With NSX, TEPs can be
assigned an IP address in three different ways:
• Via explicit configuration, when setting up the transport node.
• Picking one available entry in an IP pool, specified when setting up the transport node.
• Dynamically, via the Dynamic Host Configuration Protocol (DHCP.)
With the latter option, it’s possible that a TEP initially gets a valid IP address and operates
successfully for an arbitrary period before failing when trying to renew its IP address via DHCP.
With multi-TEP HA, this very specific scenario will now lead to the failure of the TEP.

3.2.8.2 Failure detection: physical uplink up but not able to transmit/receive traffic
The TEPs of NSX transport nodes are running BFD (bidirectional forwarding detection) sessions
between each other. Those sessions are used to evaluate the health of the TEPs for the various
NSX dashboards. Those BFD sessions can also give an indication of the health of the physical
interface associated to the TEP. Once a TEP has established a BFD session with a peer, it
indirectly gets the indication that the interface to which it is associated has bidirectional
connectivity. When all previously established BFD sessions go down, the TEP can also safely
infer that its associated interface has failed, even if its physical link is still up.
FIGURE 3-20 below presents the scenario where the link between ToR2 and vmnic1 stops
forwarding traffic while remaining physically up. TEPs are running BFD sessions between them
at a 1 packet per second fixed rate, not configurable by the user. About 3 seconds following the
failure, all BFD sessions to TEP2 are down.

61
VMware NSX Reference Design Guide

Figure 3-20: multi-TEP HA, failure detection

After all BFD sessions are down, multi-TEP HA adds another configurable delay before declaring
the TEP as failed.

3.2.8.3 Reconvergence
Let’s carry on using the example in FIGURE 3-20 above, where TEP2 has been declared as failed.
The virtual machines that had vNICs associated to this TEP2 need to move to TEP1, the only
remaining TEP available. Here, VM3 and VM4 are now going to switch to TEP1 to communicate
with the outside world.

62
VMware NSX Reference Design Guide

Figure 3-21: Multi-TEP HA reconvergence

After moving the VMs to TEP1, there is no mac address change in the physical network: TEP1
and TEP2 are still associated to vmnic0 and vmnic1 respectively. However, some VMs that used
to be reachable via TEP2 are now behind TEP1 and the overlay mac address tables of all the
remote transport nodes need an update. This is achieved in two ways: one for a fast update,
the other for a reliable update:
• Using the data path, multiple RARPs packets with the source mac address matching the
VMs vNIC’s mac addresses are flooded on the overlay segment.
• The updated mac address table is also advertised to the central control plane
component, which in turn propagates the changes to the remote transport nodes.
This mac address table update is an expensive operation compared to the TEP HA equivalent
that only required moving a single mac address in the physical infrastructure. Furthermore, the
failure of a top of rack switch, as depicted FIGURE 3-17 for example, could result in many hosts
simultaneously switching TEP for some of their VMs. But keep in mind that multi-TEP HA is a
secondary failure detection mechanism, that is covering for some corner cases where TEP HA
would not converge at all. Most regular ToR failures are addressed by moving the TEP IP/MAC
to a different pNIC.

3.2.8.4 Failure recovery


Failing TEP2 instead of moving it to vmnic0 is an important design choice for the multi-TEP HA
feature. Because TEP2 remain associated to vmnic1, it can be used to re-establish BFD sessions

63
VMware NSX Reference Design Guide

to remote peers and validate vmnic1’s availability. However, for TEP2 to even attempt
establishing BFD sessions with the remote transport nodes, our current implementation
requires some vNICs to be associated to TEP2. The recovery mechanism will thus require
moving the VMs that had been evacuated to TEP1 back to TEP2. This is a disruptive operation,
and furthermore, if the failure was due to a physical uplink problem, TEP2 might not be able to
come back to an operational state and VMs would have to be associated back to TEP1. Multi-
TEP HA thus provide different options for attempting to recover a failed TEP:
• Automatic recovery. After a configurable timer, an attempt is made to move back the
evacuated VMs to their original TEP. If the TEP fails to establish at least a BFD session,
VMs are evacuated again. Additional periodic recovery attempts are made, following an
exponential back-off algorithm. This model has some implication on the connectivity of
the impacted VMs. Its benefits are the capability of falling back to a configuration where
all the physical uplinks are used, without user intervention.
• Manual recovery. The user, by CLI or API, notifies that they want to attempt a recovery
of the affected TEP. This is the best approach for administrators who want to minimize
traffic disruption, as they can check the problem in the physical infrastructure is solved
before moving the VMs back. The drawback is that the VM traffic distribution across the
physical uplinks remains affected until this manual step is initiated.

3.2.8.5 Multi-TEP HA Summary


Multi-TEP HA is a new feature introduced in NSX 4.1.
• It is an optional feature, that needs to be enabled explicitly,
• It only works on ESXi transport nodes with multiple TEPs.
• It builds on the top of TEP HA and provides coverage for some specific failure scenarios:
o TEP cannot retrieve and IP address via DHCP.
o the physical link associated to the TEP remains up while failing to transmit or receive
traffic.
• Multi-TEP HA convergence time is in the order of few seconds, and relatively resource
intensive, but it covers cases where TEP HA would not have converged at all.
• Recovery is disruptive, the administrator can choose between a preemptive or non-
preemptive mode, where the recovery is manually triggered.
• TEP HA protects workloads connected to overlay networks. Workloads connected to
VLANs or the ESXi VMKernel traffic (Management, vMotion, VSAN) are not in scope.

64
VMware NSX Reference Design Guide

Bridging Overlay to VLAN with the Edge Bridge


Even in highly virtualized environments, customers often have workloads that cannot be
virtualized, because of licensing or application-specific reasons. Even for the virtualized
workload some applications have embedded IP that cannot be changed or legacy application
that requires layer 2 connectivity. Those VLAN backed workloads typically communicate with
overlay backed VMs at layer 3, through gateways (Tier-0 or Tier-1) instantiated on the NSX
Edges. However, there are some scenarios where layer 2 connectivity is required between VMs
and physical devices. For this functionality, NSX introduces the NSX Bridge, a service that can be
instantiated on an Edge for the purpose of connecting an NSX logical segment with a traditional
VLAN at layer 2.
The most common use cases for this feature are:
• Physical to virtual/virtual to virtual migration. This is generally a temporary scenario where
a VLAN backed environment is being virtualized to an overlay backed NSX. The NSX Edge
Bridge is a simple way to maintain connectivity between the different components during
the intermediate stages of the migration process.

Virtual to virtual
Edge
Bridge
Physical to virtual

Virtualized workload Physical workload

Overlay VLAN

Figure 3-22: physical to virtual or virtual to virtual migration use case

• Integration of physical, non-virtualized appliances that require L2 connectivity to the


virtualized environment. The most common example is a database server that requires L2
connectivity, typically because L3 connectivity has not been validated and is not supported
by the vendor. This could also be the case of a service appliance that need to be inserted
inline, like a physical firewall or load balancer.

65
VMware NSX Reference Design Guide

Edge
Bridge

Non-virtualized appliances

Overlay VLAN

Figure 3-23: integration of non-virtualized appliances use case

Whether it is for migration purposes or for integration of non-virtualized appliances, if L2


adjacency is not needed, leveraging a gateway on the Edges (L3 connectivity) is typically more
efficient, as routing allows for Equal Cost Multi Pathing, which results in higher bandwidth and a
better redundancy model. A common misconception exists regarding the usage of the edge
bridge, from the fact that modern SDN based adoption must not use bridging. In fact, that is not
the case, the Edge Bridge can be conceived as a permanent solution for extending overlay-
backed segments into VLANs. The use case of having a permeant bridging for set of workloads
exist due to variety of reasons such as older application cannot change IP address, end of life
gear does not allow any change, regulation, third party connectivity and span of control on
those topologies or devices. However, as an architect if one desired to enable such use case
must consider some level of dedicated resources and planning that ensue, such as bandwidth,
operational control and protection of bridged topologies.

Overview of the Capabilities


The following sections present the capabilities of the NSX Edge Bridge.

3.3.1.1 DPDK-based performance


One of the main benefits of running a Bridge on the NSX Edge is the data plane performance.
Indeed, the NSX Edge is leveraging the Data Plane Development Kit (DPDK), providing low
latency, high bandwidth and scalable traffic forwarding performance.

3.3.1.2 Extend an Overlay-backed Segment/Logical Switch to a VLAN


In its most simple representation, the only thing the NSX Edge Bridge achieves is to convert an
Ethernet frame between two different L2 representations: overlay and VLAN. In the overlay
representation, the L2 frame and its payload are encapsulated in an IP-based format (as
described above, NSX currently leverages Geneve, Generic Network Virtualization
Encapsulation). In the VLAN representation, the L2 frame may include an 802.1Q trunk VLAN
tag, depending on the desired connectivity model. In its most simple deployment, the Edge

66
VMware NSX Reference Design Guide

Bridge is making a one-to-one association between an overlay-backed Segment (identified by a


Virtual Network Identifier X in the overlay header) and a specific VLAN ID Y, as represented
below:

Edge L2 VNI: Virtual Network Identifier


VNI: X Smac:Dmac L2 payload Bridge Smac:Dmac VLAN ID: Y
payload Smac: source mac address
VNI X«VLAN Y
Dmac: destination mac address

Overlay Encapsulation VLAN

Figure 3-24: One-to-one association between segment and VLAN ID

The same segment can be attached to several bridges on different Edges. A common use case is
when non virtualized servers in the same broadcast domain exist in different racks. Adding an
edge bridge on each rack allow connecting those servers to the same segment without
requiring the physical infrastructure to extend a VLAN between racks. The Edge Bridge also
supports bridging 802.1Q tagged traffic carried in an overlay backed segment (Guest VLAN
Tagging.) For more information about this feature, see the BRIDGING WHITE PAPER on the
VMware communities website.

3.3.1.3 High Availability with Bridge Instances


The Edge Bridge operates as an active/standby service. The Edge bridge active in the data path
is backed by a unique, pre-determined standby bridge on a different Edge. NSX Edges are
deployed in a pool called an Edge Cluster. Within an Edge Cluster, the user can create a Bridge
Profile, which essentially designates two Edges as the potential hosts for a pair of redundant
Bridges. The Bridge Profile specifies which Edge would be primary (i.e. the preferred host for
the active Bridge) and backup (the Edge that will host the backup Bridge). At the time of the
creation of the Bridge Profile, no Bridge is instantiated yet. The Bridge Profile is just a template
for the creation of one or several Bridge pairs.

Figure 3-25: Bridge Profile, defining a redundant Edge Bridge (primary and backup)

67
VMware NSX Reference Design Guide

Once a Bridge Profile is created, the user can attach a segment to it. By doing so, an active
Bridge instance is created on the primary Edge, while a standby Bridge is provisioned on the
backup Edge. NSX creates a Bridge Endpoint object, which represents this pair of Bridges. The
attachment of the segment to the Bridge Endpoint is represented by a dedicated Logical Port,
as shown in the diagram below:

LP Segment

Logical Port

Edge Cluster
Bridge Profile
Bridge Endpoint
Primary Backup
BE

Active Standby
Uplink, derived
from a VLAN
Transport Zone Edge1 Edge2 EdgeN

VLAN

Figure 3-26: Primary Edge Bridge forwarding traffic between segment and VLAN

When associating a segment to a Bridge Profile, the user can specify the VLAN ID for the VLAN
traffic as well as the physical port that will be used on the Edge for sending/receiving this VLAN
traffic. At the time of the creation of the Bridge Profile, the user can also select the failover
mode. In the preemptive mode, the Bridge on the primary Edge will always become the active
bridge forwarding traffic between overlay and VLAN as soon as it is available, usurping the
function from an active backup. In the non-preemptive mode, the Bridge on the primary Edge
will remain standby should it become available when the Bridge on the backup Edge is already
active.

3.3.1.4 Edge Bridge Firewall


The traffic leaving and entering a segment via a Bridge is subject to the Bridge Firewall. Rules
are defined on a per-segment basis and are defined for the Bridge as a whole, i.e. they apply to
the active Bridge instance, irrespective of the Edge on which it is running.

68
VMware NSX Reference Design Guide

Segment
Bridge Profile

Active Standby

Edge1 Edge2

VLAN
Figure 3-27: Edge Bridge Firewall

The firewall rules can leverage existing NSX grouping constructs, and there is currently a single
firewall section available for those rules.

3.3.1.5 Edge Bridge Teaming Policies


In a bridge configuration, when the NVDS carrying the VLAN traffic has more than one uplink, it
is possible to associate a named teaming policy to the bridging instance to control which uplink
will be used. The uplink policy directing the bridge VLAN traffic should have a single active
uplink and no standby uplinks. Any other teaming configuration is not supported and may lead
to traffic loss (SEE KB ARTICLE ). The default teaming policy will govern the overlay traffic. When
load balancing based on source port ID is selected for the default teaming policy, the NVDS will
load balance the bridge overlay traffic per segment. Each segment will be associated with one
of the TEPs on the NVDS based on a hash value.
The figure below shows an example of a bare metal edge where named teaming policies are
applied to the VLAN traffic while the overlay traffic follows the default teaming policy. Load
balancing of the VLAN traffic over different pNICs is achieved by associating different named
teaming policies to the two bridging instances for the X and Y overlays. More details on the
recommended design for bare metal and edge node VMs are provided in chapter 7 (BRIDGING
USE CASE ).

69
VMware NSX Reference Design Guide

Figure 3-22: Edge Bridge Teaming Policies

3.3.1.6 Seamless integration with NSX gateways


This part requires understanding of Tier-0 and Tier-1 gateway and refer to the LOGICAL ROUTING
chapter for further understanding about Tier-0 and Tier-1 gateways.
Routing and bridging seamlessly integrate. Distributed routing is available to segments
extended to VLAN by a Bridge. The following diagram is a logical representation of a possible
configuration leveraging T0 and T1 gateways along with Edge Bridges.

70
VMware NSX Reference Design Guide

Figure 3-23: Integration with routing

In this above example, VM1, VM2, Physical Servers 1 and 2 have IP connectivity. Remarkably,
through the Edge Bridges, Tier-1 or Tier-0 gateways can act as default gateways for physical
devices. Note also that the distributed nature of NSX routing is not affected by the introduction
of an Edge Bridge. ARP requests from physical workload for the IP address of an NSX router
acting as a default gateway will be answered by the local distributed router on the Edge where
the Bridge is active.

71
VMware NSX Reference Design Guide

4 NSX Logical Routing


The logical routing capability in the NSX platform provides the ability to interconnect both
virtual and physical workloads deployed in different logical L2 networks. NSX enables the
creation of network elements like segments (Layer 2 broadcast domains) and gateways
(routers) in software as logical constructs and embeds them in the hypervisor layer, abstracted
from the underlying physical hardware. Since these network elements are logical entities,
multiple gateways can be created in an automated and agile fashion.
The previous chapter showed how to create segments; this chapter focuses on how gateways
provide connectivity between different logical L2 networks. FIGURE shows both logical and
physical view of a routed topology connecting segments/logical switches on multiple
hypervisors. Virtual machines “Web1” and “Web2” are connected to “overlay Segment 1” while
“App1” and “App2” are connected to “overlay Segment 2”.

Figure 4-1: Logical and Physical View of Routing Services

In a data center, traffic is categorized as East-West (E-W) or North-South (N-S) based on the
origin and destination of the flow. When virtual or physical workloads in a data center
communicate with the devices external to the data center (e.g., WAN, Internet), the traffic is
referred to as North-South traffic. The traffic between workloads confined within the data
center is referred to as East-West traffic. In modern data centers, more than 70% of the traffic
is East-West.

72
VMware NSX Reference Design Guide

For a multi-tiered application where the web tier needs to talk to the app tier and the app tier
needs to talk to the database tier and, these different tiers sit in different subnets. Every time a
routing decision is made, the packet is sent to the router. Traditionally, a centralized router
would provide routing for these different tiers. With VMs that are hosted on same the ESXi
hypervisor, traffic will leave the hypervisor multiple times to go to the centralized router for a
routing decision, then return to the same hypervisor; this is not optimal.
NSX is uniquely positioned to solve these challenges as it can bring networking closest to the
workload. Configuring a Gateway via NSX Manager instantiates a local distributed gateway on
each hypervisor. For the VMs hosted (e.g., “Web 1”, “App 1”) on the same hypervisor, the E-W
traffic does not need to leave the hypervisor for routing.

Single Tier Routing


NSX Gateway provides optimized distributed routing as well as centralized routing and services
like NAT, DHCP server etc. A single tier routing topology implies that a Gateway is connected to
segments southbound providing E-W routing and is also connected to physical infrastructure to
provide N-S connectivity. This gateway is referred to as Tier-0 Gateway.
Tier-0 Gateway consists of two components: distributed routing component (DR) and
centralized services routing component (SR).

Distributed Router (DR)


A DR is essentially a router with logical interfaces (LIFs) connected to multiple subnets. It runs
as a kernel module and is distributed in hypervisors across all transport nodes, including Edge
nodes. The traditional data plane functionality of routing and ARP lookups is performed by the
logical interfaces connecting to the different segments. Each LIF has a vMAC address and an IP
address being the default IP gateway for its connected segment. The IP address is unique per
LIF and remains the same everywhere the segment exists. The vMAC associated with each LIF
remains constant in each hypervisor, allowing the default gateway IP and MAC addresses to
remain the same during vMotion.
The left side of FIGURE 4-2 shows a logical topology with two segments, “Web Segment” with a
default gateway of 172.16.10.1/24 and “App Segment” with a default gateway of
172.16.20.1/24 are attached to Tier-0 Gateway. In the physical topology view on the right, VMs
are shown on two hypervisors, “HV1” and “HV2”. A distributed routing (DR) component for this
Tier-0 Gateway is instantiated as a kernel module and will act as a local gateway or first hop
router for the workloads connected to the segments. Please note that the DR is not a VM and
the DR on both hypervisors has the same IP addresses.

73
VMware NSX Reference Design Guide

Figure 4-2: E-W Routing with Workloads on the same Hypervisor

East-West Routing - Distributed Routing with Workloads on the Same Hypervisor


In this example, “Web1” VM is connected to “Web Segment” and “App1” is connected to “App-
Segment” and both VMs are hosted on the same hypervisor. Since “Web1” and “App1” are
both hosted on hypervisor “HV1”, routing between them happens on the DR located on that
same hypervisor.
FIGURE 4-3 presents the logical packet flow between two VMs on the same hypervisor.

Figure 4-3: Packet Flow between two VMs on same Hypervisor

1. “Web1” (172.16.10.11) sends a packet to “App1” (172.16.20.11). The packet is sent to


the default gateway interface (172.16.10.1) for “Web1” located on the local DR.
2. The DR on “HV1” performs a routing lookup which determines that the destination
subnet 172.16.20.0/24 is a directly connected subnet on “LIF2”. A lookup is performed
in the “LIF2” ARP table to determine the MAC address associated with the IP address for
“App1”. If the ARP entry does not exist, the controller is queried. If there is no response
from controller, an ARP request is flooded to learn the MAC address of “App1”.

74
VMware NSX Reference Design Guide

3. Once the MAC address of “App1” is learned, the L2 lookup is performed in the local
MAC table to determine how to reach “App1” and the packet is delivered to the App1
VM.
4. The return packet from “App1” follows the same process and routing would happen
again on the local DR.
In this example, neither the initial packet from “Web1” to “App1” nor the return packet from
“App1” to “Web1” left the hypervisor.
East-West Routing - Distributed Routing with Workloads on Different Hypervisor
In this example, the target workload “App2” differs as it rests on a hypervisor named “HV2”. If
“Web1” needs to communicate with “App2”, the traffic would have to leave the hypervisor
“HV1” as these VMs are hosted on two different hypervisors. FIGURE shows a logical view of
topology, highlighting the routing decisions taken by the DR on “HV1” and the DR on “HV2”.
When “Web1” sends traffic to “App2”, routing is done by the DR on “HV1”. The reverse traffic
from “App2” to “Web1” is routed by DR on “HV2”. Routing is performed on the hypervisor
attached to the source VM.

Figure 4-4: E-W Packet Flow between two Hypervisors

FIGURE shows the corresponding physical topology and packet walk from “Web1” to “App2”.

75
VMware NSX Reference Design Guide

Figure 4-5: End-to-end E-W Packet Flow

1. “Web1” (172.16.10.11) sends a packet to “App2” (172.16.20.12). The packet is sent to


the default gateway interface (172.16.10.1) for “Web1” located on the local DR. Its L2
header has the source MAC as “MAC1” and destination MAC as the vMAC of the DR.
This vMAC will be the same for all LIFs.
2. The routing lookup happens on the HV1 DR, which determines that the destination
subnet 172.16.20.0/24 is a directly connected subnet on “LIF2”. A lookup is performed
in the “LIF2” ARP table to determine the MAC address associated with the IP address for
“App2”. This destination MAC, “MAC2”, is learned via the remote HV2 TEP 20.20.20.20.
3. HV1 TEP encapsulates the original packet and sends it to the HV2 TEP with a source IP
address of 10.10.10.10 and destinations IP address of 20.20.20.20 for the encapsulating
packet. The destination virtual network identifier (VNI) in the Geneve encapsulated
packet belongs to “App Segment”.
4. HV2 TEP 20.20.20.20 decapsulates the packet, removing the outer header upon
reception. It performs an L2 lookup in the local MAC table associated with “LIF2”.
5. Packet is delivered to “App2” VM.
The return packet from “App2” destined for “Web1” goes through the same process. For the
return traffic, the routing lookup happens on the HV2 DR. This represents the normal behavior
of the DR, which is to always perform routing on the DR instance running in the kernel of the
hypervisor hosting the workload that initiates the communication. After the routing lookup, the
packet is encapsulated by the HV2 TEP and sent to the remote HV1 TEP. The HV1 decapsulates
the Geneve packet and delivers the encapsulated frame from “App2” to “Web1”.

76
VMware NSX Reference Design Guide

Services Router
East-West routing is completely distributed in the hypervisor, with each hypervisor in the
transport zone running a DR in its kernel. However, some services of NSX are not distributed,
due to its locality or stateful nature such as:
• Physical infrastructure connectivity (BGP/OSPF/Static Routing/BFD)
• NAT
• DHCP server
• VPN
• Gateway Firewall
• Bridging
• Service Interface
• Metadata Proxy for OpenStack
A services router (SR) is instantiated on an edge cluster when a service is enabled that cannot
be distributed on a gateway.
A centralized pool of capacity is required to run these services in a highly available and scaled-
out fashion. The appliances where the centralized services or SR instances are hosted are called
Edge nodes. An Edge node is the appliance that provides connectivity to the physical
infrastructure.
Left side of FIGURE 4-6 shows the logical view of a Tier-0 Gateway showing both DR and SR
components when connected to a physical router. Right side of FIGURE 4-6 shows how the
components of Tier-0 Gateway are realized on Compute hypervisor and Edge node. Note that
the compute host (i.e. HV1) has just the DR component and the Edge node shown on the right
has both the SR and DR components. SR/DR forwarding table merge has been done to address
future use-cases. SR and DR functionality remains the same after SR/DR merge in NSX 2.4
release, but with this change SR has direct visibility into the overlay segments. Notice that all
the overlay segments are attached to the SR as well.

77
VMware NSX Reference Design Guide

Figure 4-6: Logical Router Components and Interconnection

A Tier-0 Gateway can have following interfaces:


• External Interface – Interface connecting to the physical infrastructure/router. Static and
dynamic routing are supported on this interface. This interface was referred to as uplink
interface in previous releases. This interface can also be used to extend a VRF (Virtual
routing and forwarding instance) from the physical networking fabric into the NSX
domain.
• Service Interface: Interface connecting VLAN segments to provide connectivity and
services to VLAN backed physical or virtual workloads. Service interface can also be
connected to overlay segments for Tier-1 standalone load balancer use-cases explained
in Load balancer Chapter 6 . This interface was referred to as centralized service port
(CSP) in previous releases. Note that a gateway must have a SR component to realize
service interface. NSX 3.0 does not support dynamic routing over this interface. Static
routing is supported
• Linked Segments – Interface connecting to an overlay segment. This interface was
referred to as downlink interface in previous releases. Static routing is supported over
that interface
• Intra-Tier Transit Link (Internal link between the DR and SR). A transit overlay segment is
auto plumbed between DR and SR and each end gets an IP address assigned in
169.254.0.0/24 subnet by default. This address range is configurable only when creating
the Tier-0 gateway.

As mentioned previously, connectivity between DR on the compute host and SR on the Edge
node is auto plumbed by the system. Both the DR and SR get an IP address assigned in
169.254.0.0/24 subnet by default. The management plane also configures a default route on
the DR with the next hop IP address of the SR’s intra-tier transit link IP. This allows the DR to
take care of E-W routing while the SR provides N-S connectivity.

78
VMware NSX Reference Design Guide

North-South Routing by SR Hosted on Edge Node


From a physical topology perspective, workloads are hosted on hypervisors and N-S
connectivity is provided by Edge nodes. If a device external to the data center needs to
communicate with a virtual workload hosted on one of the hypervisors, the traffic would have
to come to the Edge nodes first. This traffic will then be sent on an overlay network to the
hypervisor hosting the workload. FIGURE 4-7 shows the traffic flow from a VM in the data center
to an external physical infrastructure.

Figure 4-7: N-S Routing Packet Flow

FIGURE 4-8 shows a detailed packet walk from data center VM “Web1” to a device on the L3
physical infrastructure. As discussed in the E-W routing section, routing always happens closest
to the source. In this example, eBGP peering has been established between the physical router
interface with the IP address 192.168.240.1 and the Tier-0 Gateway SR component hosted on
the Edge node with an external interface IP address of 192.168.240.3. Tier-0 Gateway SR has a
BGP route for 192.168.100.0/24 prefix with a next hop of 192.168.240.1 and the physical router
has a BGP route for 172.16.10.0/24 with a next hop of 192.168.240.3.

79
VMware NSX Reference Design Guide

Figure 4-8: End-to-end Packet Flow – Application “Web1” to External

1. “Web1” (172.16.10.11) sends a packet to 192.168.100.10. The packet is sent to the


“Web1” default gateway interface (172.16.10.1) located on the local DR.
2. The packet is received on the local DR. DR doesn’t have a specific connected route for
192.168.100.0/24 prefix. The DR has a default route with the next hop as its
corresponding SR, which is hosted on the Edge node.
3. The HV1 TEP encapsulates the original packet and sends it to the Edge node TEP with a
source IP address of 10.10.10.10 and destination IP address of 30.30.30.30.
4. The Edge node is also a transport node. It will encapsulate/decapsulate the traffic sent
to or received from compute hypervisors. The Edge node TEP decapsulates the packet,
removing the outer header prior to sending it to the SR.
5. The SR performs a routing lookup and determines that the route 192.168.100.0/24 is
learned via external interface with a next hop IP address 192.168.240.1.
6. The packet is sent on the VLAN segment to the physical router and is finally delivered to
192.168.100.10.
Observe that routing and ARP lookup happens on the DR hosted on the HV1 hypervisor to
determine that the packet must be sent to the SR. On the edge node, the packet is directly sent
to the SR after the tunnel encapsulation has been removed.
FIGURE 4-9 follows the packet walk for the reverse traffic from an external device to “Web1”.

80
VMware NSX Reference Design Guide

Figure 4-9: End-to-end Packet Flow – External to Application “Web1”

1. An external device (192.168.100.10) sends a packet to “Web1” (172.16.10.11). The


packet is routed by the physical router and sent to the external interface of Tier-0
Gateway hosted on Edge node.
2. A single routing lookup happens on the Tier-0 Gateway SR which determines that
172.16.10.0/24 is a directly connected subnet on “LIF1”. A lookup is performed in the
“LIF1” ARP table to determine the MAC address associated with the IP address for
“Web1”. This destination MAC “MAC1” is learned via the remote TEP (10.10.10.10),
which is the “HV1” host where “Web1” is located.
3. The Edge TEP encapsulates the original packet and sends it to the remote TEP with an
outer packet source IP address of 30.30.30.30 and destination IP address of 10.10.10.10.
The destination VNI in this Geneve encapsulated packet is of “Web Segment”.
4. The HV1 host decapsulates the packet and removes the outer header upon receiving the
packet. An L2 lookup is performed in the local MAC table associated with “LIF1”.
5. The packet is delivered to Web1.
This time routing and ARP lookup happened on the merged SR/DR hosted on the Edge node. No
such lookup was required on the DR hosted on the HV1 hypervisor, and packet was sent directly
to the VM after removing the tunnel encapsulation header.
FIGURE 4-9 showed a Tier-0 gateway with one external interface that leverages Edge node to
connect to physical infrastructure. If this Edge node goes down, N-S connectivity along with
other centralized services running on Edge node will go down as well.
To provide redundancy for centralized services and N-S connectivity, it is recommended to
deploy a minimum of two edge nodes. High availability modes are discussed in the section
SERVICES HIGH AVAILABILITY .

81
VMware NSX Reference Design Guide

Two-Tier Routing
In addition to providing optimized distributed and centralized routing functions, NSX supports a
multi-tiered routing model with logical separation between different gateways within the NSX
infrastructure. The top-tier gateway is referred to as a Tier-0 gateway while the bottom-tier
gateway is a Tier-1 gateway. This structure gives complete control and flexibility over services
and policies. Various stateful services can be hosted on the Tier-1 while the Tier-0 can operate
in an active-active manner.
Configuring two tier routing is not mandatory. It can be single tiered as shown in the previous
section. FIGURE 4-10 presents an NSX two-tier routing architecture.

Figure 4-10: Two Tier Routing and Scope of Provisioning

Northbound, the Tier-0 gateway connects to one or more physical routers/L3 switches and
serves as an on/off ramp to the physical infrastructure. Southbound, the Tier-0 gateway
connects to one or more Tier-1 gateways or directly to one or more segments as shown in
North-South routing section. Northbound, the Tier-1 gateway connects to a Tier-0 gateway
using a RouterLink port. Southbound, it connects to one or more segments using downlink
interfaces.
Concepts of DR/SR discussed in the SECTION 4.1 remain the same for multi-tiered routing. Like
Tier-0 gateway, when a Tier-1 gateway is created, a distributed component (DR) of the Tier-1
gateway is intelligently instantiated on the hypervisors and Edge nodes. Before enabling a
centralized service on a Tier-0 or Tier-1 gateway, an edge cluster must be configured on this
gateway. Configuring an edge cluster on a Tier-1 gateway, instantiates a corresponding Tier-1
services component (SR) on two Edge nodes part of this edge cluster. Configuring an Edge
cluster on a Tier-0 gateway does not automatically instantiate a Tier-0 service component (SR),

82
VMware NSX Reference Design Guide

the service component (SR) will only be created on a specific edge node along with the external
interface creation.

Unlike the Tier-0 gateway, the Tier-1 gateway does not support northbound connectivity to the
physical infrastructure. A Tier-1 gateway can only connect northbound to:
• a Tier-0 gateway
• a service interface, this is used to connect a one-arm load-balancer to a segment. More
details are available in CHAPTER 6.
Note that connecting Tier-1 to Tier-0 is a one click configuration or one API call configuration
regardless of components instantiated (DR and SR) for that gateway.

Interface Types on Tier-1 and Tier-0 Gateway


External and Service interfaces were previously introduced in the services router section. Figure
4-11 shows these interfaces types along with a new “RouterLink” interface in a two-tiered
topology.

Figure 4-11: Anatomy of Components with Logical Routing

• External Interface: Interface connecting to the physical infrastructure/router. Static


routing and BGP are supported on this interface. This interface only exists on Tier-0
gateway. This interface was referred to as Uplink interface in previous releases. This
interface type will also be used to extend a VRF (Virtual Routing and Forwarding) from
the physical networking fabric into the NSX domain.

83
VMware NSX Reference Design Guide

• Router Link Interface/Linked Port: Interface connecting Tier-0 and Tier-1 gateways. Each
Tier-0-to-Tier-1 peer connection is provided a /31 subnet within the 100.64.0.0/16
reserved address space (RFC6598). This link is created automatically when the Tier-0 and
Tier-1 gateways are connected. This subnet can be changed when the Tier-0 gateway is
being created. It is not possible to change it afterward.
• Service Interface: Interface connecting VLAN segments to provide connectivity to VLAN
backed physical or virtual workloads. Service interface can also be connected to
overlay/VLAN segments for standalone load balancer use cases explained in load
balancer Chapter 6. Service Interface supports static. It is supported on both Tier-0 and
Tier-1 gateways configured in Active/Standby high-availability configuration mode
explained in section 4.6.2. Note that a Tier-0 or Tier-1 gateway must have an SR
component to realize service interfaces. This interface was referred to as centralized
service interface in previous releases. Dynamic Routing is not supported on Service
Interfaces.
• Loopback: Tier-0 gateway supports the loopback interfaces. A Loopback interface is a
virtual interface, and it can be redistributed into a routing protocol.

Route Types on Tier-0 and Tier-1 Gateways


There is no dynamic routing between Tier-0 and Tier-1 gateways. The NSX platform takes care
of the auto-plumbing between Tier-0 and Tier-1 gateways. The following list details route types
on Tier-0 and Tier-1 gateways.
• Tier-0 Gateway:
o Connected – Connected routes on Tier-0 include external interface subnets, service
interface subnets, loopback and segment subnets connected to Tier-0. In FIGURE 4-12,
172.16.20.0/24 (Connected segment), 192.168.20.0/24 (Service Interface) and
192.168.240.0/24 (External interface) are connected routes for the Tier-0 gateway.
o Static – User configured static routes on Tier-0.
o NAT IP – NAT IP addresses owned by the Tier-0 gateway discovered from NAT rules
configured on Tier-0 Gateway.
o BGP – Routes learned via BGP neighbors.
o IPsec Local IP – Local IPsec endpoint IP address for establishing VPN sessions.
o DNS Forwarder IP – Listener IP for DNS queries from clients. Also used as the source
IP to forward DNS queries to the upstream DNS server.
o Inter SR. SRs of a same Tier-0 gateway in the same edge cluster will create an
automatic iBGP peering adjacency between them to exchange routing information.
This topology is only supported with Active/Active topologies and with NSX
Federation.

84
VMware NSX Reference Design Guide

• Tier-1 Gateway
o Connected – Connected routes on Tier-1 include segment subnets connected to Tier-
1 and service interface subnets configured on Tier-1 gateway.
o In FIGURE 4-12, 172.16.10.0/24 (Connected segment) and 192.168.10.0/24 (Service
Interface) are connected routes for Tier-1 gateway.
o Static– User configured static routes on Tier-1 gateway.
o NAT IP – NAT IP addresses owned by the Tier-1 gateway discovered from NAT rules
configured on the Tier-1 gateway.
o LB VIP – IP address of load balancing virtual server.
o LB SNAT – IP address or a range of IP addresses used for Source NAT by load balancer.
o IPsec Local IP – Local IPsec endpoint IP address for establishing VPN sessions.
o DNS Forwarder IP – Listener IP for DNS queries from clients. Also used as the source
IP to forward DNS queries to the upstream DNS server.

Route Advertisement on the Tier-1 and Tier-0 Logical Router


The Tier-0 gateway could use static routing or dynamic routing to connect to the physical
routers. The Tier-1 gateway cannot connect to physical routers directly; it must connect to a
Tier-0 gateway to provide N-S connectivity to the subnets attached to it. When a Tier-1 gateway
is connected to a Tier-0 gateway, a default route is automatically created on the Tier-1. That
default route is pointing to the RouterLink IP address that is owned by the Tier-0.
FIGURE 4-12 explains the route advertisement on both the Tier-1 and Tier-0 gateway.

85
VMware NSX Reference Design Guide

Figure 4-12: Routing Advertisement

“Tier-1 Gateway” advertises connected routes to Tier-0 Gateway. Figure 4-12 shows an
example of connected routes (172.16.10.0/24 and 192.168.10.0/24). If there are other route
types, like NAT IP etc. as discussed in section 4.2.2, a user can advertise those route types as
well. As soon as “Tier-1 Gateway” is connected to “Tier-0 Gateway”, the management plane
configures a default route on “Tier-1 Gateway” with next hop IP address as RouterLink interface
IP of “Tier-0 Gateway” i.e. 100.64.224.0/31 in the example above.

Tier-0 Gateway sees 172.16.10.0/24 and 192.168.10.1/24 as Tier-1 Connected routes (t1c) with
a next hop of 100.64.224.1/31. Tier-0 Gateway also has Tier-0 “Connected” routes
(172.16.20.0/24) in Figure 4-12.
Northbound, “Tier-0 Gateway” redistributes the Tier-0 connected and Tier-1 connected routes
in BGP and advertises these routes to its BGP neighbor, the physical router.

Fully Distributed Two Tier Routing


NSX provides a fully distributed routing architecture. The motivation is to provide routing
functionality closest to the source. NSX leverages the same distributed routing architecture
discussed in distributed router section and extends that to multiple tiers.

86
VMware NSX Reference Design Guide

FIGURE 4-13 shows both logical and per transport node views of two Tier-1 gateways serving
two different tenants and a Tier-0 gateway. Per transport node view shows that the distributed
component (DR) for Tier-0 and the Tier-1 gateways have been instantiated on two hypervisors.

Figure 4-13: Logical Routing Instances

If “VM1” in tenant 1 needs to communicate with “VM3” in tenant 2, routing happens locally on
hypervisor “HV1”. This eliminates the need to route of traffic to a centralized location to route
between different tenants or environments.
Multi-Tier Distributed Routing with Workloads on the same Hypervisor
The following list provides a detailed packet walk between workloads residing in different
tenants but hosted on the same hypervisor.
1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM3” (172.16.201.11) in tenant 2.
The packet is sent to its default gateway interface located on tenant 1, the local Tier-1
DR.
2. Routing lookup happens on the tenant 1 Tier-1 DR and the packet is routed to the Tier-0
DR following the default route. This default route has the RouterLink interface IP
address (100.64.224.0/31) as a next hop.
3. Routing lookup happens on the Tier-0 DR. It determines that the 172.16.201.0/24
subnet is learned from the tenant 2 Tier-1 DR (100.64.224.3/31) and the packet is
routed there.
4. Routing lookup happens on the tenant 2 Tier-1 DR. This determines that the
172.16.201.0/24 subnet is directly connected. L2 lookup is performed in the local MAC
table to determine how to reach “VM3” and the packet is sent.
The reverse traffic from “VM3” follows the similar process. A packet from “VM3” to destination
172.16.10.11 is sent to the tenant-2 Tier-1 DR, then follows the default route to the Tier-0 DR.
The Tier-0 DR routes this packet to the tenant 1 Tier-1 DR and the packet is delivered to “VM1”.
During this process, the packet never left the hypervisor to be routed between tenants.

87
VMware NSX Reference Design Guide

Multi-Tier Distributed Routing with Workloads on different Hypervisors


Figure 4-14 shows the packet flow between workloads in different tenants which are also
located on different hypervisors.

Figure 4-14: Logical routing end-to-end packet Flow between hypervisor

The following list provides a detailed packet walk between workloads residing in different
tenants and hosted on the different hypervisors.
1. “VM1” (172.16.10.11) in tenant 1 sends a packet to “VM2” (172.16.200.11) in tenant 2.
VM1 sends the packet to its default gateway interface located on the local Tier-1 DR in
HV1.
2. Routing lookup happens on the tenant 1 Tier-1 DR and the packet follows the default
route to the Tier-0 DR with a next hop IP of 100.64.224.0/31.
3. Routing lookup happens on the Tier-0 DR which determines that the 172.16.200.0/24
subnet is learned via the tenant 2 Tier-1 DR (100.64.224.3/31) and the packet is routed
accordingly.
4. Routing lookup happens on the tenant 2 Tier-1 DR which determines that the
172.16.200.0/24 subnet is a directly connected subnet. A lookup is performed in ARP
table to determine the MAC address associated with the “VM2” IP address. This
destination MAC is learned via the remote TEP on hypervisor “HV2”.
5. The “HV1” TEP encapsulates the packet and sends it to the “HV2” TEP, finally leaving the
host.
6. The “HV2” TEP decapsulates the packet and recognize the VNI in the Geneve header. A
L2 lookup is performed in the local MAC table associated to the LIF where “VM2” is
connected.
7. The packet is delivered to “VM2”.

88
VMware NSX Reference Design Guide

The return packet follows the same process. A packet from “VM2” gets routed to the local
hypervisor Tier-1 DR and is sent to the Tier-0 DR. The Tier-0 DR routes this packet to tenant 1
Tier-1 DR which performs the L2 lookup to find out that the MAC associated with “VM1” is on
remote hypervisor “HV1”. The packet is encapsulated by “HV2” and sent to “HV1”, where this
packet is decapsulated and delivered to “VM1". It is important to notice that in this use case,
routing is performed locally on the hypervisor hosting the VM sourcing the traffic.

Routing Capabilities
NSX supports static routing as well dynamic routing protocols to provide connectivity to the
IPv4 and IPv6 workloads.

Static routing
In a multi-tier routing architecture and as described previously, NSX automatically creates the
RouterLink segment, ports and static routes to interconnect a Tier-0 gateway with one or
several Tier-1 gateways as depicted previously.
By default, the RouterLink segment is in the 100.64.0.0/16 IPv4 address range which is a shared
IPv4 address space that is compliant with the RFC 6598. A default static route pointing to the
Tier-0 gateway DR RouterLink port is automatically installed on the Tier-1 gateway routing
table.
Northbound, static routes can be configured on Tier-1 gateways with the next hop IP as the
RouterLink IP of the Tier-0 gateway (100.64.0.0/16 range or a range defined by user for
RouterLink interface when the Tier-0 gateway is being created). Southbound, static routes can
also be configured on Tier-1 gateway with a next hop as a layer 3 device reachable via a service
or downlink interface.
Tier-0 gateways can be configured with a static route toward external subnets with a next hop
IP of the physical router. Southbound, static routes can be configured on Tier-0 gateways with a
next hop of a layer 3 device reachable via Service interface.
ECMP is supported with static routes to provide load balancing, increased bandwidth, and fault
tolerance for failed paths or Edge nodes. Figure 4-4-15 shows a Tier-0 gateway with two
external interfaces leveraging Edge node, EN1 and EN2 connected to two physical routers. Two
equal cost static default routes configured for ECMP on Tier-0 Gateway. Up to eight paths are
supported in ECMP.
Tier-0 northbound static routes can be protected via BFD. BFD timers depend on the Edge node
type. Bare metal Edge supports a minimum of 50ms TX/RX BFD keep alive timer while the VM
form factor Edge supports a minimum of 500ms TX/RX BFD keep alive timer.
It is recommended to always implement BFD when configuring static routing between the Tier-
0 gateway and the physical network. BFD will detect an upstream physical network failures in
the absence of a dynamic routing protocol.

89
VMware NSX Reference Design Guide

First hop redundancy protocols on the physical network and a virtual IP (VIP) on the Tier-0
Gateway can also be implemented to provide high availability, but they are less reliable and
slower than BFD, cannot provide ECMP, and VIP only works with active/standby Tier-0
gateways. They should only be considered if enabling BFD was not an option.

Figure 4-15: Static Routing Configuration

Border Gateway Protocol (BGP)


BGP is the de facto protocol on the WAN and in most modern data centers. A typical leaf-spine
topology has eBGP running between leaf switches and spine switches.
Tier-0 gateways support eBGP and iBGP on the external interfaces with physical routers. BFD
can also be enabled per BGP neighbor for faster failover. BFD timers depend on the Edge node
type. Bare metal Edge supports a minimum of 50ms TX/RX BFD keep alive timer while the VM
form factor Edge supports a minimum of 500ms TX/RX BFD keep alive timer.
With NSX 3.0 release, the following BGP features are supported:
• Two and four bytes AS numbers in asplain, asdot and asdot+ format.
• eBGP multi-hop support, allowing eBGP peering to be established on loopback
interfaces.
• iBGP

90
VMware NSX Reference Design Guide

• eBGP multi-hop BFD


• ECMP support with BGP neighbors in same or different AS numbers. (Multi-path relax)
• BGP Allow AS in
• BGP route aggregation support with the flexibility of advertising a summary route only to
the BGP peer or advertise the summary route along with specific routes. A more specific
route must be present in the routing table to advertise a summary route.
• Route redistribution in BGP to advertise Tier-0 and Tier-1 Gateway internal routes as
mentioned in section 4.2.2.
• Inbound/outbound route filtering with BGP peer using prefix-lists or route-maps.
• Influencing BGP path selection by setting Weight, Local preference, AS Path Prepend, or
MED.
• Standard, Extended and Large BGP community support.
• BGP well-known community names (e.g., no-advertise, no-export, no-export-subconfed)
can also be included in the BGP route updates to the BGP peer.
• BGP communities can be set in a route-map to facilitate matching of communities at the
upstream router.
• Graceful restart (Full and Helper mode) in BGP.
• BGP peering authentication using plaintext or MD5.
• MP-BGP as the control plane protocol for VXLAN overlays or IPv6 address families.
Active/active ECMP services supports up to eight paths. The ECMP hash algorithm is 5-tuple
northbound of Tier-0 SR. ECMP hash is based on the source IP address, destination IP address,
source port, destination port and IP protocol. The hashing algorithm determines how incoming
traffic is forwarded to the next-hop device when there are multiple paths. By default, the ECMP
hashing algorithm from DR to multiple SRs is 2-tuple and is based on the source IP address and
destination IP address. It can be set globally to 5-tuple hashing via the API.
Graceful Restart (GR)
Graceful restart in BGP allows a BGP speaker to preserve its forwarding table while the control
plane restarts. It is recommended to enable BGP Graceful restart when the BGP peer has
multiple supervisors. A BGP control plane restart could happen due to a supervisor switchover
in a dual supervisor hardware, planned maintenance, or active routing engine crash. As soon as
a GR-enabled router restarts (control plane failure), it preserves its forwarding table, marks the
routes as stale, and sets a grace period restart timer for the BGP session to reestablish. If the
BGP session reestablishes during this grace period, route revalidation is done, and the
forwarding table is updated. If the BGP session does not reestablish within this grace period,
the router flushes the stale routes.

91
VMware NSX Reference Design Guide

The BGP session will not be GR capable if only one of the peers advertises it in the BGP OPEN
message; GR needs to be configured on both ends. GR can be enabled/disabled per Tier-0
gateway. The GR restart timer is 180 seconds by default and cannot be change after a BGP
peering adjacency is in the established state, otherwise the peering needs to be negotiated
again.

Open Shortest Path First version 2 (OSPF v2)


OSPFv2 is a supported dynamic routing protocol supported in NSX starting with version 3.1.1.
Based on RFC 2328, OSPFv2 provides fast convergence, scalability and allows NSX for vSphere
domain to be migrated seamlessly to NSX.
OSPF is considered as an Interior Gateway Protocol (IGP) as routes are advertised within an
autonomous system. E-BGP on the other side is considered as an Exterior Gateway Protocol
since the routes are exchanged between different autonomous systems.
OSPF is a routing protocol where routers exchange their link status and information using
specific messages called Link State Updates (LSU) to all neighbors on an OSPF enabled segment.
The data structure of the LSU contains Link State Advertisements (LSA).
The routers in the OSPF domain flood the LSAs throughout the area and inject them into a Link
State Database (LSDb). Each LSDb in the area is identical and the routers compute the SPF
calculation using the Dijkstra algorithm.
OSPFv2 is currently not supported on the following architectures:
• Tier-0 VRF
• Federation
• EVPN
• Tier-0 or Tier-1 Service Interfaces.
OSPFv3 is currently not supported (no IPv6 support in the current implementation).

4.3.3.1 OSPF Adjacency


Routers need to establish an OSPF adjacency to exchange LSAs. An OSPF Router-ID (RID) is a 32
bits dotted decimal value that is selected when the protocol is enabled to identify itself in the
OSPF domain. A Router ID must be unique within the OSPF domain.
In NSX 3.1.1, the RID is identical between BGP and OSPF.
In NSX 3.1.2, choosing the RID follows the following logic:
• If no loopback is configured, RID will be equal to the highest numeric interface IPv4
address.

92
VMware NSX Reference Design Guide

• If a loopback is configured the RID will be equal to the highest numeric loopback IPv4
address.
NSX does not support to manually configure an OSPF router-ID. There is no preemption when it
comes to calculating the Router ID otherwise that will trigger a reset in the OSPF adjacency and
disrupt traffic forwarding.
To change the RID, the OSPF process must be restarted either using the UI or an API call. An
NSX edge reboot will not change the RID. If the NSX edge is being redeployed, the OSPF RID will
be recalculated.
Once the OSPF RID is chosen, OSPF hello messages are sent on the OSPF enabled external
uplinks using the multicast address “224.0.0.5” as represented on FIGURE 4-16.

Figure 4-16 - OSPF Adjacency between the Tier-0 SR and the physical network fabric

OSPF hellos perform multiple duties in the OSPF process:


• Discover the neighbors on a segment and try to establish an adjacency if one OSPF
router is detected.
• Check the OSPF configuration between neighbors.
• Monitor the adjacency.
OSPF can be enabled on 2 external uplinks interfaces per Tier-0 Service Router as demonstrated
on FIGURE 4-17.

93
VMware NSX Reference Design Guide

Figure 4-17: OSPF enabled on uplinks

As explained earlier, Hello messages are sent on the OSPF enabled external uplink to the
multicast address 224.0.0.5 IP address (ALL SPF Routers). The hello message will check the
following parameters before establishing the adjacency:
• Interfaces between the OSPF routers must be in the same subnet.
• Interfaces between the OSPF routers must belong to the same OSPF area.
• Interfaces between the OSPF routers must have the same OSPF area type.
• Router ID must be unique between the OSPF Routers.
• OSPF timers (HelloInterval and RouterDeadInterval) must match between the OSPF
routers.
• Authentication process must be validated.

OSPF Authentication
The Tier-0 gateway supports the following authentication method:
• None
• Password: Password is sent in clear text over the external uplink interface
• MD5: Password is a message digest and sent over the external uplink interface
NSX supports area-wide authentication and does not support authentication per interface.
VMware recommends authenticating every dynamic routing protocol adjacency using the MD5
method to exchange routes in a secure way.

94
VMware NSX Reference Design Guide

NSX currently supports up to 8 characters for the authentication password

Maximum Transmission Unit (MTU) and NSX


Even though the MTU is not verified by the OSPF Hello process, interfaces involved in an OSPF
adjacency must have the same MTU value configured to exchange LSUs. While an adjacency is
being established, the database description is exchanged between peers using DB Description
packets. MTU is verified in the DB Description packet and if the MTU values do not match, the
OSPF adjacency is brought down. NSX does not support the “ignore MTU mismatch” feature.
OSPF Timers
Hello packets will be sent on OSPF enabled uplink interface at a specific and configurable value
(HelloInterval). It acts as a keepalive to make sure that the neighbors are still present on the
segment and can exchange their network topology with each other. A neighbor will be declared
down if a router fails to receive OSPF hellos during the time specified by the DeadInterval timer.
By default, in OSPF, HelloInterval and DeadInterval are respectively 10s and 40s on both Point-
To-Point and Broadcast OSPF network type interface.

Figure 4-18: OSPF Timers

Bidirectional Forwarding Detection (BFD) and NSX


BFD can also be enabled per OSPF neighbor for faster failover. BFD timers depend on the Edge
node type. Starting with NSX version 3.0, bare metal Edge supports a minimum of 50ms TX/RX
BFD keep alive timer while the VM form factor Edge supports a minimum of 500ms TX/RX BFD
keep alive timer. VMware strongly recommends configuring BFD on top of every OSPF
interfaces to minimize north-south traffic disruption.
FIGURE 4-19 represents the BFD timers supported by NSX:

95
VMware NSX Reference Design Guide

Figure 4-19: BFD Timers for OSPF

OSPF Adjacency States


An OSPF router can have multiple OSPF enabled interfaces and therefore different neighbors
identified by their OSPF Router-ID (ID). The routers need to keep track of each adjacency state
and data structure as specified in RFC 2328 – SECTION 10.1:
• Down State:
o Initial OSPF state. No information has been received from the neighbor.
• Init State:
o Hello packet has been seen from the neighbor and his Router ID has been
identified. (Router has not seen its own router ID in the OSPF Hello received
from the neighbor).
• 2-Way:
o The OSPF router has seen its own OSPF Router ID in the received Hello packet.
Bidirectional communication between the two peers is established. It is critical
to understand that on an OSPF Broadcast network type adjacency, OSPF
DROther routers can stay in this state indefinitely since the adjacency will be
fully established with the Designated Router (DR) and Backup Designated Router
(BDR).
• ExStart:
o In this state, the OSPF routers will try to elect an OSPF Designated Router and
Backup Designated Router. The router with the highest OSPF Router-ID will be
elected DR. The NSX Tier-0 will never be elected DR or BDR since its OSPF Priority
is equal to “0”. The Tier-0 SR is considered as an OSPF DROther router.
• Exchange:
o Each router is sending a complete description of its LSDb by sending Database
Description packets. LSAs request based on the current knowledge is requested
at this step. The Link State Update flooding process can start after the requests
have been made.
• Loading:
o Link State Request and Link-State-Update packets are exchanged between OSPF
neighbors so that they can have the same knowledge of the OSPF topology.

96
VMware NSX Reference Design Guide

• Full:
o Routers have exchanged their LSDb and are fully adjacent.

After both routers have exchanged their LSDb, they run the Dijkstra algorithm. Since their LSDb
is identical they possess the same knowledge of the OSPF topology. The Dijkstra algorithm is
run so that each router can calculate their own best way to reach every destination.
Graceful Restart (GR)
Graceful restart in OSPF allows a neighbor to preserve its forwarding table while the control
plane restarts. It is recommended to enable OSPF Graceful restart when the OSPF peer has
multiple supervisors. An OSPF control plane restart could happen due to a supervisor
switchover in a dual supervisor hardware, planned maintenance, or active routing engine crash.
As soon as a GR-enabled router restarts (control plane failure), it preserves its forwarding table,
marks the routes as stale, and sets a grace period restart timer for the OSPF adjacency to
reestablish. If the OSPF adjacency reestablishes during this grace period, route revalidation is
done, and the forwarding table is updated. If the OSPF adjacency does not reestablish within
this grace period, the router flushes the stale routes.
OSPF Graceful restart helper mode is supported.

4.3.3.2 OSPF Network Types


NSX supports both Point to Point and Broadcast OSPF network types.
Point to multipoint, Point to Multipoint Non-Broadcast and Non-Broadcast network types are
considered deprecated in any modern Data Center architecture and are not supported.
Figure 4-20represents the different OSPF network types supported in NSX 3.1.1

Figure 4-20: Supported OSPF Network types

OSPF Point to Point Network Type:


OSPF Point-to-Point network type is the recommended network type for most OSPF
deployment. It simplifies the overall OSPF peering topology. On a given ethernet segment, the

97
VMware NSX Reference Design Guide

maximum number of routers fully establishing an OSPF adjacency using the Point-to-Point
network type is two. No DR or BDR election is performed using this OSPF network type. The
Tier-0 SR OSPF Routers are considered “DROther” (OSPF priority of 0).
It is recommended to configure the Tier-0 SR uplink interfaces using a /31 network mask as
depicted on FIGURE 4-20:

Figure 4-20: OSPF Point to Point network type topology

In this example, the Tier-0 SR hosted on both the Edge Node 01 and Edge Node 02 will establish
an adjacency with both the physical router 1 and the physical router 2.
From a global Tier-0 perspective, all four adjacencies to the physical routers will be in the “Full”
state.
OSPF Broadcast Network Type:
On an ethernet segment using the OSPF broadcast type, more than two OSPF router can
exchange their LSAs. To reduce the LSAs flooding on that segment, the OSPF process will elect a
Designated Router and a Backup Designated Routers. The router with the highest OSPF priority
(1-255) will be elected DR. The router with the second highest OSPF priority will be elected BDR.
All other routers on the segment will be considered OSPF DROthers.
If the priorities for either the DR or BDR role are equal, the router with the highest OSPF
RouterID will be elected DR.

98
VMware NSX Reference Design Guide

All DROther routers will establish an adjacency with both the DR and the BDR as demonstrated
on FIGURE 4-21. These adjacencies will be in the “Full” state and LSUs can be exchanged. The DR
and BDR also establish an adjacency between themselves.

Figure 4-21: OSPF adjacency in the Full State – Broadcast network type

The DROthers routers will see each other’s in the 2-Way state but will not exchange their LSAs
directly between themselves, the DR and BDR will perform that function.

Figure 4-22: OSPF adjacency in the “2-Way” State – Broadcast network type

99
VMware NSX Reference Design Guide

In this example, each router will have 2 adjacencies in the Full State and 2 adjacencies in the “2-
Way” state.
From an NSX perspective, the Tier-0 gateway is always a DROther as its OSPF priority is hard-
coded to 0. It is not possible to change that priority.
FIGURE 4-23 demonstrates the OSPF adjacencies in the “Full” state between the physical
networking fabric and the Tier-0 Service Routers

Figure 4-23: OSPF adjacency in the Full State – Broadcast network type

FIGURE 4-24 demonstrates the OSPF adjacencies in the “2-Way” state between the Tier-0
Service Routers.

100
VMware NSX Reference Design Guide

Figure 4-24: OSPF adjacencies in the “2-Way” state

When OSPF uses the broadcast network type, the Database Description packets are sent from
the Tier-0 SR (DROthers) to the Designated Router in the networking fabric using the multicast
address 224.0.0.6 (ALL OSPF DR Router multicast address). The DR sends its Database
Description packets using the 224.0.0.5 multicast address (ALL OSPF Router multicast address).
The number of adjacencies is increased using this network type which can increase the
complexity of the OSPF topology.
The OSPF LSDb is identical throughout the entire area therefore, there is no need to establish
an adjacency between the Tier-0 SR.
As explained earlier, VMware recommends using the Point-to-Point OSPF network type over
the Broadcast network type to simplify the routing topology.

4.3.3.3 OSPF Areas


As mentioned previously, all the routers in an OSPF area must have the same knowledge of the
networking topology. Therefore, their LSDb must be identical. It contains all the LSA
information flooded by all the OSPF routers in the area. When there is a network change in the
area (e.g link failure or cost changed), the router that experienced the change will send newer
LSAs to update all the OSPF routers. This will trigger all the OSPF routers in the area to re-run
the Dijkstra algorithm and recalculate the best path to reach the destinations.

101
VMware NSX Reference Design Guide

This can be problematic in a very large-scale environment where it would be unnecessary for all
routers to recompute the Dijkstra algorithm when a networking factor has changed.
OSPF routers and links can be compartmented into different areas to limit the LSA flooding and
reduce the time the SPF algorithm is run.
Routers with links in both a backbone area and a standard area are considered as Autonomous
Border Router. The Tier-0 gateway cannot perform ABR’s duties since only one area per Tier-0
is supported.
FIGURE 4-25 represents the OSPF LSAs that are supported in an NSX OSPF architecture.

Figure 4-25: OSPF LSAs support

The OSPF areas supported by NSX are listed in the FIGURE 4-26 and represented in FIGURE 4-27.

Figure 4-26: OSPF Area support table

102
VMware NSX Reference Design Guide

Figure 4-27: OSPF Area support representation

Since a Tier-0 service router redistributes the routes into the OSPF domain, it can be considered
as an Autonomous System Boundary Router. An ABR in the OSPF topology will inject LSA type 4
in other areas to make sure the OSPF routers know how to reach the ASBR. When a Tier-0 SR
redistributes a prefix into OSPF, it uses an external metric of type 2:
• Total cost to reach the prefix is always the redistributed metric. Internal cost to reach the
ASBR is ignored.
The OSPF External type 1 is not supported by NSX when a prefix is redistributed.
FIGURE 4-28 represents the different external metric types used by the Tier-0 when different
area types are used:
• Standard and Backbone areas: Routes are redistributed using LSAs type 5 with an
external type of E2 and a cost of 20.
• When the Tier-0 gateway is connected to a not so stubby area, it redistributes its routes
using LSAs type 7 with an external type of N2 and a cost of 20.

The OSPF metric cannot be changed in NSX and will have a hard coded value of 20 throughout
the OSPF domain.

103
VMware NSX Reference Design Guide

Figure 4-28: OSPF LSA Types injected by a Tier-0 gateway

Not So Stubby Areas are recommended for very large topologies. Instead of redistributing the
prefixes using LSA type 5 in the OSPF domain, LSA type 7 will be used. This type of LSA is
originated by the ASBR and flooded in the NSSA only. Another ABR in the OSPF domain will
translate that LSA type 7 into an LSA type 5 that will be transmitted throughout the OSPF
domain.
With the advent of OSPF support in NSX 3.1.1, it is now possible to choose which protocol the
routes can be redistributed into. It is possible to redistribute prefixes in BGP or OSPF.
It is possible to use OSPF to learn BGP peers IP addresses in the case of E-BGP multi-hop
topology.
FIGURE 4-29 represents an example where OSPF and BGP can be used on the same Tier-0
gateway. OSPF is used as an IGP to provide connectivity between the loopback interfaces.
BGP can be setup between the physical routers’ loopback interfaces and the Tier-0 loopback
interfaces. This solution should only be considered when direct single hop eBGP peering is not
an option. Chapter 7 provides guidance on the recommended routing configuration for BGP
peering on directly connected segments.

104
VMware NSX Reference Design Guide

Figure 4-29: OSPF used to learn Loopback reachability to establish a E-BGP multi-hop peering

4.3.3.4 OSPF Topologies and ECMP


NSX supports OSPF on both the Active/Active and Active/Standby topologies.
Equal cost multipath is supported on OSPF towards the physical fabric on both the
Active/Active and Active/Standby topologies.
To leverage ECMP between the NSX Tier-0 gateway and the physical networking fabric, OSPF
routes must be learned with the same metric. When this conditions is met, the Tier-0 SR can
install all the routes as “multipath” in the routing table if ECMP is enabled in the routing
protocol configuration.
With BGP and static routes, each Tier-0 SR can leverage up to 8 different links and next-hop to
load balance efficiently the traffic. 2 uplinks can be enabled for OSPF per Tier-0 SR. The number
of ECMP routes that can be installed per Tier-0 for each prefix learned by OSPF is 2 as
demonstrated in FIGURE 4-27.
The edge node network kernel module support hashing with 5 tuples (Source IP – Destination IP
– IP protocol – L4 Source Port – L4 Destination Port).

105
VMware NSX Reference Design Guide

Figure 4-27: OSPF and ECMP

OSPF in an Active / Standby topology


In the active/standby topology represented in FIGURE 4-28, the Tier-0 gateway running in
standby mode is running the OSPF process and establishes an adjacency with the top of rack
switches. The interface between the Standby Tier-0 SR and Tier-0 DR is not active and as a
consequence, the DR will forward all the traffic to the SR on the active Edge Node (EN1 in the
example provided on FIGURE 4-28).

106
VMware NSX Reference Design Guide

Figure 4-28: OSPF Architecture in Active/Standby Topology

Since the internal transit interface on the standby Tier-0 SR is administratively shutdown, the
physical fabric needs to send the traffic to the NSX domain via the active Tier-0 SR.
The standby Tier-0 SR will redistribute the internal NSX prefixes in OSPF and advertise them
with a cost of 65534 (External Type E2 or N2).
The active Tier-0 SR redistributes the same prefixes with a cost of 20, the networking fabric will
logically prefer the prefixes with a lower cost and therefore use the links via the active Tier-0
SR.
In this topology, ECMP with the top of rack switches can still be leveraged as demonstrated on
FIGURE 4-29. There is no ECMP between the Tier-0 DR and the Tier-0 SR.

107
VMware NSX Reference Design Guide

Figure 4-29: ECMP between the physical fabric and the Active Tier-0 SR

OSPF in an Active / Active topology


When no stateful services are running on the Tier-0 gateway, the Active/Active high availability
mode can be selected. ECMP can be leveraged between the top of rack switches and the Tier-0
gateway as well as between the Tier-0 DR and the Tier-0 SR in a single tier routing architecture.
The topology demonstrated in FIGURE 4-30 shows the different level of ECMP available in this
topology:
• ECMP between the Tier-0 DR and the Tier-0 SR. The Tier-0 DR has two default routes
pointing to both internal transit segment interfaces on the Tier-SRs.
• ECMP between the Tier-0 SR and the physical networking fabric. OSPF can support up to
two ECMP paths.
This topology is recommended when no services are running on the Tier-0 gateway or when
predictable traffic flow is preferred.

108
VMware NSX Reference Design Guide

Figure 4-30:ECMP architecture in Active/Active topology

4.3.3.5 OSPF Summarization


The Tier-0 Service Routers is considered as an ASBR as it redistributes routes into the OSPF
domain. In OSPF, Summarization at the ABR and ASBR level is possible.
In networking, summarization usually solve some scalability issues as it decreases the number
of routes installed in the routing table.
There are other benefits to summarization directly related to OSPF as if one network disappears
in one area, OSPF routers in subsequent area will not have to remove that prefix from the
routing table. This rationale will lower the number of CPU cycles needed for OSPF to run
efficiently.
Lowering the number of routes in the routing table has some beneficial effects on the physical
resources consumed by the networking appliances (CPU, Memory)
In FIGURE 4-31, a large number of Tier-1 gateways are instantiated in the NSX domain. Each
gateway contains a single /24 network for each tenant. Without summarization, each Tier-0 SR
will advertise the Tier-1 connected segments with 256x Type 5 LSAs. Each Type 5 LSA will
advertise a /24 network to the physical fabric.

109
VMware NSX Reference Design Guide

Figure 4-31: LSA type 5 advertisement without summarization

If the IP Addressing schema has been designed properly for summarization (contiguous
subnets), these large number of Type 5 LSAs can be advertised as a single summary prefix
embedded in a single LSA Type 5 as depicted in figure 4-32.

Figure 4-32: LSA type 5 advertisement with route summarization

110
VMware NSX Reference Design Guide

When the area connected to the Tier-0 gateway is an NSSA, route summarization is supported
and the type of LSA injected in the area for summarization is an LSA Type 7.

4.3.3.6 OSPF Redistribution


The administrators have the choice to use route maps to redistribute certain prefixes when
performing routes redistribution into OSPF. Routes maps in OSPF are used to allow or filter
specific prefixes and it is not possible to set any action in these route maps, such as alter the
metric, set the metric-type, or apply a custom tag. The T0 attaches to the redistributed routes a
system generated tag that cannot be modified.
Redistributing prefixes between OSPF and BGP is currently not supported.

4.3.3.7 OSPF Default information originate


Advertising a default route from the Tier-0 gateway is supported as demonstrated in FIGURE
4-32.
Enabling the default information originate feature on a Tier-0 gateway is not a common
requirement because few designs require NSX to connect the data center to external resources.
When required, it should be performed carefully as it may impact the entire data center traffic.

Figure 4-32: Originating a default route into OSPF

111
VMware NSX Reference Design Guide

Routing protocol recommendation in the data center.


While NSX support static routing, HA VIP and Proxy ARP, VMware recommends the use of a
dynamic routing protocol to provide scalability and better redundancy planning in the data
center.
The data center will be able to grow and adapt to any change automatically using a dynamic
routing protocol. Statically configuring the routing components in the data center will
complexify the overall architecture as the data center scale increase.

From a dynamic routing perspective, there are two options to when it comes to designing a
modern data center routing architecture with NSX:
• OSPFv2
• BGP
BGP is well known in the networking industry to be the best routing protocol when it comes to
interoperability as it exchanges prefixes between autonomous systems on the internet.
Modern and scalable data centers architecture rely on BGP to provide connectivity as
multiprotocol support is not possible with OSPFv2. OSPFv2 implementation require a different
routing protocol or static routing to route IPv6 packets.
VMware recommends using BGP over OSPF as it provides more flexibility, had a better feature
set and support address families.
OSPF should be considered when it is the only routing protocol running in the physical fabric
and no feature available only with BGP is required. In such a scenario OSPF avoids the
implementation of routing redistribution leading to simpler and more scalable design.
A comparison of features between OSPF and BGP is described in the Table 4-6.

112
VMware NSX Reference Design Guide

Table 4-6: OSPF vs BGP comparison in the Data Center.

113
VMware NSX Reference Design Guide

IPv6 Routing Capabilities


NSX also supports dual stack for the interfaces on a Tier-0 or Tier-1 Gateway. Users can
leverage distributed services like distributed routing and distributed firewall for East-West
traffic in a single tier topology or multi-tiered topology for IPv6 workloads now. Users can also
leverage centralized services like Gateway Firewall for North-South traffic.
NSX Datacenter supports the following unicast IPv6 addresses:
• Global Unicast: Globally unique IPv6 address and internet routable
• Link-Local: Link specific IPv6 address and used as next hop for IPv6 routing protocols
• Unique local: Site specific unique IPv6 addresses used for inter-site communication but
not routable on internet. Based on RFC4193.
The following table shows a summarized view of supported IPv6 unicast and multicast address
types on NSX Datacenter components.

Figure 4-33: Type of IPv6 addresses supported on Tier-0 and Tier-1 Gateway components

FIGURE 4-34 shows a single tiered routing topology on the left side with a Tier-0 Gateway
supporting dual stack on all interfaces and a multi-tiered routing topology on the right side with
a Tier-0 Gateway and Tier-1 Gateway supporting dual stack on all interfaces. A user can either
assign static IPv6 addresses to the workloads or use a DHCPv6 relay supported on gateway
interfaces to get dynamic IPv6 addresses from an external DHCPv6 server.
For a multi-tier IPv6 routing topology, each Tier-0-to-Tier-1 peer connection is provided a /64
unique local IPv6 address from a pool i.e. fc5f:2b61:bd01::/48. A user has the flexibility to

114
VMware NSX Reference Design Guide

change this subnet range and use another subnet if desired. Similar to IPv4, this IPv6 address is
auto plumbed by system in background.

Figure 4-34: Single tier and Multi-tier IPv6 routing topology

Tier-0 Gateway supports following IPv6 routing features:


• Static routes with IPv6 Next-hop
• MP-eBGP with IPv4 and IPv6 address families
• Multi-hop eBGP
• IBGP
• ECMP support with static routes, EBGP and IBGP
• Outbound and Inbound route influencing using Weight, Local Pref, AS Path prepend and
MED.
• IPv6 Route Redistribution
• IPv6 Route Aggregation
• IPv6 Prefix List and Route map
• IPv6 Loopback Interfaces

Tier-1 Gateway supports following IPv6 routing features:

115
VMware NSX Reference Design Guide

• Static routes with IPv6 Next-hop


IPv6 routing between Tier-0 and Tier-1 Gateway is auto plumbed similar to IPv4 routing. As
soon as Tier-1 Gateway is connected to Tier-0 Gateway, the management plane configures a
default route (::/0) on Tier-1 Gateway with next hop IPv6 address as Router link IP of Tier-0
Gateway (fc05:2b61:bd01:5000::1/64, as shown in F IGURE 4-35). To provide reachability to
subnets connected to the Tier-1 Gateway, the Management Plane (MP) configures routes on
the Tier-0 Gateway for all the LIFs connected to Tier-1 Gateway with a next hop IPv6 address as
Tier-1 Gateway Router link IP (fc05:2b61:bd01:5000::2/64, as shown in figure 4-32). 2001::/64
& 2002:/64 are seen as “Tier-1 Connected” routes on Tier-0.
Northbound, Tier-0 Gateway redistributes the Tier-0 connected and Tier-1 Connected routes in
BGP and advertises these to its eBGP neighbor, the physical router.

Figure 4-35: IPv6 Routing in a Multi-tier topology

Services High Availability


NSX Edge nodes run in an Edge cluster, hosting centralized services, and providing connectivity
to the physical infrastructure. Since the services are run on the SR component of a Tier-0 or
Tier-1 gateway, the following concept is relevant to SR. This SR service runs on an Edge node
and has two modes of operation – active/active or active/standby. When a Tier-1 gateway is
configured to be hosted on an Edge cluster, an SR is automatically instantiated even if no
services are configured or running on the Tier-1. When a Tier-1 SR is instantiated, the Tier-0 DR
is removed from the hypervisors and located on the edge nodes only.

116
VMware NSX Reference Design Guide

Active/Active
Active/Active – This is a high availability mode where all SRs hosted on Edge nodes act as active
forwarders. Stateless services such as layer 3 forwarding are IP based, so it does not matter
which Edge node receives and forwards the traffic. All the SRs configured in active/active
configuration mode are active forwarders. This high availability mode has been traditionally
only available on Tier-0 gateway, but staring with NSX 4.0 T1 gateways support active/active
with stateful services. See SECTION 4.5.3 for more details.

Stateful services typically require tracking of connection state (e.g., sequence number check,
connection state), thus traffic for a given session needs to go through the same Edge node. Tier-
0 gateways can be implemented in stateful or stateless mode. In stateless mode a reduced
number of services are available. Such services include reflexive NAT and stateless firewall.
Left side of FIGURE 4-36 shows a Tier-0 gateway (configured in active/active high availability
mode) with two external interfaces leveraging two different Edge nodes, EN1 and EN2. Right
side of the diagram shows that the services router component (SR) of this Tier-0 gateway
instantiated on both Edge nodes, EN1 and EN2. A Compute host, ESXi is also shown in the
diagram that only has distributed component (DR) of Tier-0 gateway.

Figure 4-36: Tier-0 gateway configured in Active/Active HA mode

Note that Tier-0 SR on Edge nodes, EN1 and EN2 have different IP addresses northbound
toward physical routers and different IP addresses southbound towards Tier-0 DR.
Management plane configures two default routes on Tier-0 DR with next hop as SR on EN1
(169.254.0.2) and SR on EN2 (169.254.0.3) to provide ECMP for overlay traffic coming from
compute hosts.

117
VMware NSX Reference Design Guide

North-South traffic from overlay workloads hosted on Compute hosts will be load balanced and
sent to SR on EN1 or EN2, which will further do a routing lookup to send traffic out to the
physical infrastructure.

A user does not have to configure these static default routes on Tier-0 DR. Automatic plumbing
of default route happens in background depending upon the HA mode configuration.

Inter-SR Routing

To provide redundancy for physical router failure, Tier-0 SRs on both Edge nodes must establish
routing adjacency or exchange routing information with different physical router or TOR. These
physical routers may or may not have the same routing information. For instance, a route
192.168.100.0/24 may only be available on physical router 1 and not on physical router 2.

For such asymmetric topologies, users can enable Inter-SR routing. This feature is only available
on Tier-0 gateway configured in active/active high availability mode. FIGURE 4-37 shows an
asymmetric routing topology with Tier-0 gateway on Edge node, EN1 and EN2 peering with
physical router 1 and physical router 2, both advertising different routes.

When Inter-SR routing is enabled by the user, an overlay segment is auto plumbed between SRs
(similar to the transit segment auto plumbed between DR and SR) and each end gets an IP
address assigned in 169.254.0.128/25 subnet by default. An IBGP session is automatically
created between Tier-0 SRs and northbound routes (EBGP and static routes) are exchanged on
this IBGP session.

Figure 4-37: Inter-SR Routing

118
VMware NSX Reference Design Guide

As explained in previous figure, Tier-0 DR has auto plumbed default routes with next hops as
Tier-0 SRs and North-South traffic can go to either SR on EN1 or EN2. In case of asymmetric
routing topologies, a particular Tier-0 SR may or may not have the route to a destination. In
that case, traffic can follow the IBGP route to another SR that has the route to destination.

FIGURE 4-37 shows a topology where Tier-0 SR on EN1 is learning a default WAN route 0.0.0.0/0
and a corporate prefix 192.168.100.0/24 from physical router 1 and physical router 2
respectively. If “External 1” interface on Tier-0 fails and the traffic from compute workloads
destined to WAN lands on Tier-0 SR on EN1, this traffic can follow the default route (0.0.0.0/0)
learned via IBGP from Tier-0 SR on EN2.Traffic is being sent to EN2 through the Geneve overlay.
After a route lookup on Tier-0 SR on EN2, this N-S traffic can be sent to physical router 1 using
“External interface 3”.

Graceful Restart and BFD Interaction with Active/Active – Tier-0 SR Only


If an Edge node is connected to a TOR switch that does not have the dual supervisor or the
ability to retain forwarding traffic when the control plane is restarting, enabling GR does not
make sense. There is no value in preserving the forwarding table on either end or sending
traffic to the failed or restarting device. In case of an active SR failure (i.e., the Edge node goes
down), physical router failure, or path failure, forwarding will continue using another active SR
or another TOR. BFD should be enabled with the physical routers for faster failure detection.
It is recommended to enable GR If the Edge node is connected to a single dual supervisor
system that supports forwarding traffic when the control plane is restarting. This will ensure
that forwarding table data is preserved and forwarding will continue through the restarting
supervisor or control plane. Enabling BFD with such a system would depend on the device-
specific BFD implementation. If the BFD session goes down during supervisor failover, then BFD
should not be enabled with this system. If the BFD implementation is distributed such that the
BFD session would not go down in case of supervisor or control plane failure, then enable BFD
as well as GR.
Suppose the edge node connects to multiple dual supervisor systems. In that case, the network
architect will have to choose what type of failover mechanism to prioritize between graceful
restart and traffic rerouting over a different link.

Active/Standby
Active/Standby – This is a high availability mode where only one SR act as an active forwarder.
This mode is required when stateful services are enabled. Services like NAT are in constant
state of sync between active and standby SRs on the Edge nodes. This mode is supported on
both Tier-1 and Tier-0 SRs. Preemptive and Non-Preemptive modes are available for both Tier-0
and Tier-1 SRs. Default mode for gateways configured in active/standby high availability
configuration is non-preemptive.

119
VMware NSX Reference Design Guide

A user can select the preferred member (Edge node) when a gateway is configured in
active/standby preemptive mode. When enabled, preemptive behavior allows a SR to resume
active role on preferred edge node as soon as it recovers from a failure.

For Tier-1 Gateway, active/standby SRs have the same IP addresses northbound. Only the active
SR will reply to ARP requests, while the standby SR interfaces operational state is set as down
so that they will automatically drop packets.

For Tier-0 Gateway, active/standby SRs have different IP addresses northbound and both have
BGP sessions or OSPF adjacencies established on their uplinks. Both Tier-0 SRs (active and
standby) receive routing updates from physical routers and advertise routes to the physical
routers; however, the standby Tier-0 SR prepends its local AS three times in case of BGP or
advertise routes with a higher metric in case of OSPF so that traffic from the physical routers
prefer the active Tier-0 SR.

Southbound IP addresses on active and standby Tier-0 SRs are the same and the operational
state of standby SR southbound interface is down. Since the operational state of southbound
Tier-0 SR interface is down, the Tier-0 DR does not send any traffic to the standby SR. FIGURE
4-38 shows active and standby Tier-0 SRs on Edge nodes “EN1” and “EN2”.

Figure 4-38: Active and Standby Routing Control with eBGP

The placement of active and standby SR in terms of connectivity to TOR or northbound


infrastructure becomes an important design choice, such that any component failure should
not result in a failure of both active and standby service. Diversity of connectivity to TOR for
bare metal edge nodes and host-specific availability consideration for hosts where Edge node
VMs are hosted, becomes an important design choice. These choices are described in the
section EDGE NODE AND SERVICES DESIGN in chapter 7.

120
VMware NSX Reference Design Guide

4.5.2.1 Graceful Restart and BFD Interaction with Active/Standby


Active/standby services have an active/active control plane with active/standby data
forwarding. In this redundancy model, BGP or OSPF is established on active and standby Tier-0s
SR with their respective TORs. If the Edge node is connected to a system that does not have the
dual supervisor or the ability to keep forwarding traffic when the control plane is restarting,
enabling GR does not make sense. There is no value in preserving the forwarding table on
either end as well as no point sending traffic to the failed or restarting device. When the active
Tier-0 SR goes down, the route advertised from standby Tier-0 becomes the best route and
forwarding continues using the newly active SR. If the TOR switch supports BFD, it is
recommended to run BFD on the both eBGP neighbors for faster failure detection.
It is recommended to enable GR If the Edge node is connected to a single dual supervisor
system that supports forwarding traffic when the control plane is restarting. This will ensure
that the forwarding table is table is preserved and forwarding will continue through the
restarting supervisor or control plane. Enabling BFD with such system depends on BFD
implementation of hardware vendor. If the BFD session goes down during supervisor failover,
then BFD should not be enabled with this system; however, if the BFD implementation is
distributed such that that the BFD session would not go down in case of supervisor or control
plane failure, then enable BFD as well as GR.

Suppose the edge node connects to multiple dual supervisor systems. In that case, the network
architect will have to choose what type of failover mechanism to prioritize between graceful
restart and traffic rerouting over a different link.

Stateful active/active gateways


By definition, a stateful service maintains a context, a “state” for the traffic it is dealing with.
The service also needs to see both directions of a flow in order to maintain the state, and this
requirement is typically not met when a gateway is implemented using multiple active service
routers (SRs), as illustrated IN FIGURE 4-39:

121
VMware NSX Reference Design Guide

Figure 4-39: stateful service and A/A service routers

Here, VM A attached to an NSX overlay segment is establishing a connection to a physical host


B. Northbound traffic is routed through T0SR3 on edge3. However, because of ECMP in the
physical network, the return traffic from B to A is presented to T0SR1, on edge1. This SR does
not have the state for the flow initiated by A and the stateful service inspection cannot be
performed.
In most cases, NSX solves this issue by simply requiring that gateways hosting stateful services
run in active/standby. With this redundancy model, a single SR is responsible for forwarding all
the traffic, hence seeing both directions for all flows.
In some scenarios, it is desirable to scale out the number of edges forwarding traffic while
running stateful services, meaning that we need to make the topology described in FIGURE 4-39
work. NSX 4.1 introduces the capability of running stateful active/active gateway for this
purpose. This section describes at a high level how this feature is implemented as well as its
design implications.

4.5.3.1 Stateful Active/Active gateways, technical overview


When running a gateway in active/active, multiple SRs are active for this gateway. As we saw
with FIGURE 4-39, ECMP in NSX and the physical infrastructure make it impossible to predict
which SR is going to receive to receive the packet for a given flow. It is not scalable to
synchronize the state for all services in real time across all active SRs, so the solution will be to
redirect flows so that their traffic hit the same SR consistently in both directions. We’ll also see
the impact this model has on redundancy.

4.5.3.1.1 Redirecting traffic to the appropriate SR


The following diagram introduces the additional switching infrastructure that was added to NSX
in order to allow redirecting traffic to the appropriate SR.

122
VMware NSX Reference Design Guide

Figure 4-40: punt logical switches

The Tier0 represented in FIGURE 4-40 is running in a stateful active/active mode. Its SRs are
spread across four different edges. NSX has created two “punt” overlay segments and each SR
is attached to those two logical switches with a shadow port. In the above example, the shadow
port of SR4 connecting to the north punt logical switch is highlighted. SR4 will treat traffic
received on this shadow port as if it had been received on the Tier0 uplink of SR4. A similar
shadow port connects SR4 to the south punt logical switch. Traffic received on this shadow port
will be treated as if it had been received on the internal port (sometimes referred to as the
“backplane port”) connecting SR4 to other distributed routers (DRs).
This additional switching infrastructure will be used to redirect traffic for a specific flow to a
deterministic SR. The following FIGURE 4-41 illustrates how this is achieved in the “northbound”
direction. VM A sends traffic towards B and ECMP inside NSX forward the packets of this flow to
T0SR3, on edge3.

123
VMware NSX Reference Design Guide

Figure 4-41: northbound traffic

When edge3 receives this packet from A, a hash is taken based on the destination IP address of
the traffic. The resulting value is used as an index that selects SR4 in a table including all the SRs
for the Tier0 gateway. Edge3 uses the south punt logical switch to switch the packet to the
shadow port on SR4. Here, SR4 treats this packet as it had been received on its backplane port
and routes it directly to its destination. Any packet to destination IP address B will be redirected
to SR4 with this model. Of course, the same mechanism is applied in the southbound direction,
as represented in the next diagram.

Figure 4-42: southbound traffic

In FIGURE 4-42, host B replies to VM A. Its packets are routed by the physical infrastructure to an
arbitrary active SR, here SR1. When entering SR1, a hash is taken on the source IP address this
time. This allows selecting the same SR in the table as the one that was selected using the
destination IP address in the northbound direction. SR1 immediately switches those packets on
the northbound punt logical switch to the shadow port on SR4, where it’s accepted as if they
had been received on SR4 uplink.
This example is of course the worst-case scenario where traffic needs to be redirected in each
direction. Still, thanks to scaling out to four active SRs, the overall forwarding capability in/out
NSX is greatly enhanced compared to a single active SR.

4.5.3.2 Stateful Active/Active Tier1 gateways


So far, we have only represented active/active Tier0 gateways, with uplinks pointing to the
physical infrastructure. The new active/active stateful gateways feature also enables
active/active stateful Tier1 gateways. NSX typically provides great flexibility in the deployment

124
VMware NSX Reference Design Guide

of Tier1 SRs. They could be on a dedicated edge cluster for example, and the user can explicitly
designate on which edges active and standby SRs will be located. Active/active stateful Tier1
gateways will only run in the very specific configuration where they share their edges with a
corresponding active/active Tier0 gateway, as represented below:

Figure 4-43: active/active stateful Tier1 gateway

This model simplifies the redundancy model, by sharing the fate of the Tier1 SR and the Tier0
SR, it also allows using the same pair of north/south punt logical switch for both gateways.

4.5.3.3 Redundancy model


The hashing mechanism ensures that a given flow always hit the same SR. Now, what if this SR
goes down? The state associated to the flow must be saved continuously to the SR that will
take over for the failed one. To achieve that, NSX automatically splits the edge cluster into sub-
clusters, each sub-cluster comprising two edges. Edges in the same sub-cluster sync their state
and are the backup to each other. Of course, for this model to work properly, the feature
expects an even number of edges in the cluster.

Failure Domain 1

T0 SR1 T0 SR2 T0 SR3 T0 SR4


State State State state
State
sync sync
SR1/ SR2 SR1/ SR2 SR3/ SR4 SR3/
AàB SR4

Edge 1 Edge 2 Edge 3 Edge 4


Sub-Cluster Sub-Cluster
Edge Cluster

Failure Domain 2

Figure 4-44: sub-clusters

The user cannot explicitly select which edge is part of a sub-cluster, however NSX attempt to
group in a sub-clusters edges from a different FAILURE DOMAIN.

125
VMware NSX Reference Design Guide

4.5.3.4 Design considerations


Services are configured on interface groups.
In FIGURE 4-41, we could see that traffic received on an uplink of SR1 could be re-injected into
SR4 as if it had been received on the uplink of SR4. But what if SR4 had multiple uplinks?
Suppose that the Tier0 had two uplinks, one connecting to the Internet, and one leading to a
DMZ. The administrator configures SNAT, a stateful service, for the traffic toward the Internet
only. When the traffic for a flow subject to SNAT is received on the “Internet” uplink of SR1 and
redirected to SR4, how can NSX identify which of the two uplinks of SR4 correspond to the
connection to the Internet? There is no obvious relation between the “Internet” uplink of SR1
and the “Internet” uplink of SR4, they can be in completely different subnets. The administrator
will thus have to provide explicitly this information, by grouping uplinks into “interface groups”.
The following example represent the two interface groups that would be required when having
separate uplinks going to the Internet and to a DMZ:

Internet DMZ

Interface Group DMZ


Interface Group Internet

T0 SR1 T0 SR2 T0 SR3 T0 SR4

Edge 1 Edge 2 Edge 3 Edge 4

Figure 4-45: interface groups

When using a stateful active/active Tier1, an internal group representing the connectivity
between the Tier1 and the Tier0 is automatically created by the system.

T0 SR1 T0 SR2 T0 SR3 T0 SR4

Interface Group (system created)

T1SR1 T1SR2 T1SR3 T1SR4

Edge 1 Edge 2 Edge 3 Edge 4

Figure 4-46: auto-created interface group between Tier1 and Tier0

SRs must have identical connectivity: don’t count on inter SR routing.


Service interfaces are not available on stateful active/active gateways.

126
VMware NSX Reference Design Guide

When a service is configured on an active/standby gateway, it is obvious that the active SR


must be capable of providing this service. The same applies to a stateful active/active gateway
with multiple active SRs. Traffic can be arbitrarily distributed across all the SRs of that gateway.
As a result, each SR is supposed to have the capability of providing the service. This is also a
reason for the use of interface groups: a specific external uplink is only instantiated on one SR
and thus, it is not an appropriate object to which one can apply a service in the stateful
active/active model. Services can only refer to interface groups, ensuring the availability of an
appropriate uplink on all SRs. This is also a reason why service interfaces are not available on
stateful active/active gateways.
Another important consequence is that inter-SR routing with iBGP will not work for flow
associated to stateful services. This is because the state for a given stateful flow is only
maintained on a single SR. Inter SR routing can redirect traffic to another SR if there is no local
external uplink leading to the destination, but this other SR will not have a state for the flow.

4.5.3.5 Design constraints


When a Tier0 is configured as stateful active/active, multiple design options for connected
Tier1s exist.
Suppose that a stateful active/active has its SRs deployed on an edge cluster. A Tier1
connected to this Tier0 can either:
• be distributed only, meaning that it has no SR and can’t run stateful services.
• be stateful active/active, and its SRs must be deployed on the same edge cluster as the
Tier0. The Tier1 SRs will end up running on the exact same edges as the Tier0 SRs.
• Can be active/standby, and its SRs must be deployed in a separate edge cluster from the
Tier0.
The following table represent all the possible SR placement combinations for a Tier1 connected
to a Tier0, including those that are not supported:

127
VMware NSX Reference Design Guide

Tier1A/S Tier1A/A Tier1A/S Tier1A/A


Tier1 in Same Cluster as Tier0 Tier1 in Different Cluster from Tier0
Tier0 A/S Supported Not Supported Supported Not supported

Stdby Stdby Stdby Stdby

Stdby Stdby

stateful stateful

Tier0 A/A Supported Not supported Supported Not supported


stateless

Stdby Stdby

stateful stateful

Tier0 A/A stateful Not Supported Supported Supported Not supported


stateful stateful stateful stateful

Stdby Stdby

stateful stateful
New Combinations

Figure 4-47: possible SR placements for Tier0 and Tier1

Introducing a stateful active/active gateway in an existing environment is subject to the


following rules:
• An active/standby Tier0 cannot be converted to stateful active/active
• An active/standby Tier1 cannot be converted to stateful active/active
• A stateless active/active Tier0 with no attached Tier1 can be converted to stateful
active/active
The possible steps for introducing stateful active/active gateways is represented in the
following diagram:

Stdby

stateful
stateful

stateful

Tier0 A/ A stateless Tier0 A/ A stateful Tier0 A/ A stateful


+Tier1 A/ A (same cluster)
+Tier1 A/ S (other cluster)

Tier0 Tier1

Figure 4-48: Introducing stateful active/active gateways.

Not all features are supported with this model.


Be aware that not all stateful services can be deployed on a stateful active/active edge. Please,
consult the release notes for the NSX version you are intending to deploy in your environment
for an updated list of features that are compatible with stateful active/active gateway.

128
VMware NSX Reference Design Guide

4.5.3.6 Performance considerations


The drawback with the stateful active/active model is the additional overhead introduced by
redirecting traffic on the punt logical switches. If ECMP distributes the traffic evenly across N
edges, roughly (1/N) % of the traffic received by an edge will not be redirected. You can see
that this percentage, and thus the efficiency of the model, will decrease with additional edges
involved. But even in the worst-case scenario where all the traffic would be redirected, the
model would provide the aggregate bandwidth of N/2 edges for traffic in/out NSX. This beats a
single stateful edge all the time and scales out linearly with the number of edges.

NSX Edge High availability failover triggers


Two types of edge failover events exist. The first type works at the edge node level and causes
all the SRs running on the node to fail. In this case the SRs running on the corresponding peer
edge node will take over. Peer edge nodes are defined based on each Tier-0 or Tier-1 SR. Two
edge nodes are considered peers for a service if that service runs in active/standby or
active/active across those edge nodes. For each active/standby gateway, we have two peer
edge nodes. For an active/active gateway, we can have up to 8 edge peer nodes. The peer node
relationship is not exclusive. Edge node one can have edge node two as a peer for a Tier-0
Gateway but have edge node three as a peer for a Tier-1 gateway (see FIGURE 4-49).

The second type of failover event affects an individual service or SR. In case of a failure an
individual SR, the corresponding SR on the peer edge node will take over, while others may still
be running on the original node.

Figure 4-49: Edge HA - Peer edge node definition

129
VMware NSX Reference Design Guide

Failover events that belong to the first category will cause all services and SRs to failover to the
corresponding peer edge node. Those conditions are:
1. Dataplane service is not running.
2. All TEP interfaces are down. This condition only applies to bare metal edge nodes as the
interface of an edge node VM should never go “physically” down. If a bare metal edge
node has more than one TEP interface, all need to be down for this condition to take
effect.
3. The edge node enters maintenance mode
4. Edge nodes TEPs run BFD with compute hosts and other edge nodes TEPs to monitor the
health of the overlay tunnels. If host transport nodes are present, when all the overlay
tunnels are down to both remote Edges and compute hypervisors, all the SRs on the
edge will be declared down. This condition does not apply if no host transport nodes are
present.
5. Edge nodes in an Edge cluster exchange additional BFD keepalives on two interfaces,
management and TEP interfaces. Each edge monitors the status of its peer edges on
those two interfaces, if both are down, it will consider the peer down, and it will make
the corresponding SRs active.

Failover event that belongs to the second category are (Only a specific SR is declared down and
individually failover to the peer edge without affecting other SRs):
6. Northbound routing on a Tier-0 SR is down. This is applicable to BGP, and OSPF with or
without BFD enabled and to static routes with BFD enabled. When this situation occurs,
the Tier-0 SR only will be declared failed. Other Tier-1 SRs on the same edge will still be
active. If the Tier-0 is configured with static routes with no BFD, northbound routing will
never be considered down.
7. Services are configured on SR. The health score of services on the SR is less than that on
the peer SR.

We will now review in more detail the failover triggers that requires a more careful
consideration from a design perspective.

FIGURE 4-50 outlines condition 5. The BFD sessions between two edge nodes are lost on both
the management and the overlay network. Standby Tier-0 and Tier-1 SRs will become active.
This condition addresses the scenario when an edge node fails or is completely isolated. When
designing edge node connectivity, it is important to consider corner-case situations when this
condition may apply, but the edge node is not down or completely isolated. For example, uplink
connectivity could be up while the overly and management network could be down if the uplink
traffic used dedicated pNICs. Designing for fate sharing between uplink, management, and
overlay traffic will mitigate the risk of a dual-active scenario. If dedicated pNICs are part of the
design, providing adequate pNIC redundancy to management and overlay traffic will ensure
that condition 5 is only triggered because of the complete failure of an edge.

130
VMware NSX Reference Design Guide

Figure 4-50: Failover triggers – BFD on management and overlay network down

FIGURE 4-51 below outlines condition 6. A northbound connectivity issue is detected by the
dynamic routing protocol timers or BFD. In this case, the affected Tier-0 only will undergo a
failover event. Any other Tier-1 SR on the same edge will remain active. Static routes without
BFD do not allow to detect a failure in the uplink network. For this reason, configurations
including static routes should always include BFD.

Figure 4-51: Failover triggers – northbound routing down

FIGURE 4-52 below outlines condition 4. In this case, edge node one did not fail, but it is not able
to communicate to any other transport node on the overlay network. This condition mitigates
situations when an edge node does not fail, but connectivity issues exist. All the SRs on the
affected edge node, including all Tier-0 and Tier-1 SRs, will failover. This condition only applies
to deployments including host transport nodes, which can help discriminate between the
failure of the peer edge and a network connectivity problem to it.

131
VMware NSX Reference Design Guide

Figure 4-52: Failover triggers – All tunnels down (with host transport nodes)

FIGURE 4-53 outlines condition 4 again, but this time no host transport node is present in the
topology. The all overlay tunnel down condition does not put the SRs running on edge node 2 in
a failed state. In the scenario depicted in the diagram, edge node one failed, and the SRs on
edge two took over even if all the overlay tunnels on edge 2 were down. If the all tunnel
condition applied to deployment without host transport nodes, the SRs running on edge node
two would go into a fail state, blackholing the traffic. This topology applies to VLAN-only
deployments, where workloads are connected via service interfaces.
Note: VLAN-Only deployments with service interfaces must use a single TEP. Multi-TEP
configurations will cause the all overlay tunnel down condition to disable the surviving node.

132
VMware NSX Reference Design Guide

Figure 4-53: Failover triggers – All tunnels down (without host transport nodes)

We should pay special attention to the implications of deploying multiple edge node VMs on
the same host and/or on hosts with TEP interfaces on the same network (possible starting NSX
version 3.1) on the “all tunnel down condition” failover trigger. A failure of the upstream
overlay transport network (pNIC or switch level failure) will not bring down the tunnels local to
the host, those between edge VMs on the same host, or to the host TEP (See FIGURE 4-54). This
situation may lead the edge node VMs to not trigger a failover if northbound routing
connectivity is up. Traffic that the upstream network forward to the affected edge node VMs
will be blackholed because of the lack of access to the overlay network.

133
VMware NSX Reference Design Guide

Figure 4-54: Failover Triggers - All tunnels Down - Tunnels local to ESXi host

This condition can be avoided by designing for fate sharing between the overlay and upstream
network. If that is not a possibility based on the design requirements, we should eliminate the
possibility of establishing overlay tunnels within a host with one of these options:
• No more than one edge node VM per host on hosts not part of the same overlay TZ
• No more than one edge node VM per host on hosts part of the same overlay TZ but edge
and host overlay VLAN and subnet are different (BFD keepalives will need to be
forwarded to the physical network)
• With more than one edge node VM per host, edge node VMs should have different
overlay VLANs and subnets (BFD keepalives will need to be forwarded to the physical
network)

VRF Lite

VRF Lite Generalities


Virtual Routing Forwarding (VRF) is a virtualization method that consists of creating multiple
logical routing instances within a physical routing appliance. VRF instances are commonly used
in enterprise and service providers networks to provide control and data plane isolation,
allowing several use cases such as overlapping IP addressing between tenants, isolation of
regulated workload, isolation of external and internal workload as well as hardware resources
consolidation. With NSX, it is possible to extent the VRF present on the physical network onto
the NSX domain. Creating a development environment that replicates the production
environment is a typical use case for VRF.
Another representative use case for VRF is when multiple environments need to be isolated
from each other. As stated previously, VRF instances are isolated between each other by
default; allowing communications between these environments using the Route Leaking VRF

134
VMware NSX Reference Design Guide

feature is possible. While this feature allows inter-VRF communications, it is important to


emphasize that scalability can become an issue if a design permits all VRF to communicate
between each other. In this case, VRF might not be the option. VRF should not be replaced in
lieu of the DFW construct.
FIGURE 4-55 pictures a traditional VRF architecture. Logical (Switch Virtual Interface – Vlan
interface) or Physical interface should be dedicated to a single VRF instance. In the following
diagram, interface e1/1 and e2/1 belong to VRF-A while interface e1/2 and e2/2 belong to VRF-
B. Each VRF will run their own dynamic routing protocol (or use static routes).

Figure 4-55: Networking VRF architecture

With NSX 2.x, several multi-tenant designs are possible.


The first option was to deploy a Tier-1 gateway for each tenant while a shared Tier-0 provides
connectivity to the physical networking fabric for all tenants. In this configuration the Tier-0
gateway has a single routing table for all the tenants, requiring unique IP addresses for each
tenant or NAT implemented at the tenant Tier-1 gateway level when overlapping IPs exist.
FIGURE 4-56 diagrams that option.

135
VMware NSX Reference Design Guide

Figure 4-56: NSX 2.x multi-tenant architecture – Shared Tier-0 Gateway

Another supported design is to deploy a separate Tier-0 gateway for each tenant on a
dedicated tenant edge node. This configuration allows for duplicated IP addresses between
tenants, but requires dedicated edge nodes per tenant (no more than a single Tier-0 SR can run
on each edge node). While providing the required separation, this solution may be limited from
a scalability perspective.
FIGURE 4-57 shows a traditional multi-tenant architecture using dedicated Tier-0 per tenant in
NSX 2.X .

136
VMware NSX Reference Design Guide

Figure 4-57: NSX 2.x multi-tenant architecture. Dedicated Tier0 for each tenant

In traditional networking, VRF instances are hosted on a physical appliance and share the
resources with the global routing table. Starting with NSX 3.0, Virtual Routing and Forwarding
(VRF) instances configured on the physical fabric can be extended to the NSX domain. A VRF
Tier-0 gateway must be associated to a traditional Tier-0 gateway identified as the “Parent Tier-
0”. FIGURE 4-58 diagrams an edge node hosting a traditional Tier-0 gateway with two VRF
gateways. Control plane is completely isolated between all the Tier-0 gateways instances.

Figure 4-58: Tier-0 VRF Gateways hosted on a Parent Tier-0 Gateway

The parent Tier-0 gateway can be considered as the global routing table and must have
connectivity to the physical fabric. A unique Tier-0 gateway instance (DR and SR) will be created

137
VMware NSX Reference Design Guide

and dedicated to a VRF. F IGURE 4-59 shows a detailed representation of the Tier-0 VRF gateway
with their respective Service Router and Distributed Router components.

Figure 4-59: Detailed representation of the SR/DR component for Tier-0 VRF hosted on an edge node

FIGURE 4-60 shows a typical single tier routing architecture with two Tier-0 VRF gateways
connected to their parent Tier-0 gateway. Traditional segments are connected to a Tier-0 VRF
gateway.

Figure 4-60: NSX 3.0 multi-tenant architecture. Dedicated Tier-0 VRF Instance for each VRF

138
VMware NSX Reference Design Guide

Since control plane is isolated between Tier-0 VRF instances and the parent Tier-0 gateway,
each Tier-0 VRF needs their own network and routing constructs:
• Segments
• Uplink Interfaces
• BGP configuration with dedicated peers or Static route configuration
NSX 3.0 supports BGP and static routes for the Tier-0 VRF gateway. It offers the flexibility to use
static routes on a particular Tier-0 VRF while another Tier-0 VRF would use BGP. The OSPF
routing protocol is not supported for VRF topologies.
FIGURE 4-61 shows a topology with two Tier-0 VRF instances and their respective BGP peers on
the physical networking fabric. It is important to emphasize that the Parent Tier-0 gateway has
a BGP peering adjacency with the physical routers using their respective global routing table
and BGP process.
From a data plane standpoint, 802.1q VLAN tags are used to differentiate traffic between the
VRFs instances as demonstrated in the following figure.

Figure 4-61: BGP peering Tier-0 VRF gateways and VRF on the networking fabric

When a Tier-0 VRF is attached to parent Tier-0, multiple parameters will be inherited by design
and cannot be changed:

139
VMware NSX Reference Design Guide

• Edge Cluster
• High Availability mode (Active/Active – Active/Standby)
• BGP Local AS Number (Starting in NSX 4.1.0, VRFs can have an AS number that is
different from the parent T0 AS number)
• Internal Transit Subnet
• Tier-0, Tier-1 Transit Subnet.
All other configuration parameters can be independently managed:
• External Interface IP addresses
• BGP neighbor
• Prefix list, route-map, Redistribution
• Firewall rules
• NAT rules

VRF Lite HA
As mentioned previously, The Tier-0 VRF is associated to a Parent Tier-0 and will follow the high
availability mode and state of its Parent Tier-0.
Both Active/Active or Active/Standby high availability mode are supported on the Tier-0 VRF
gateways. It is not possible to have an Active/Active Tier-0 VRF associated to an Active/Standby
Parent Tier-0 and vice-versa.

In a traditional Active/Standby design, a Tier-0 gateway failover can be triggered if all


northbound BGP peers are unreachable. Similar to the high availability construct between the
Tier-0 VRF and the Parent Tier-0, the BGP peering design must match between the VRF Tier-0
and the Parent Tier-0.
Inter-SR routing is not supported in Active/Active Tier-0 VRF topologies. FIGURE 4-62 represents
a BGP instance from both parent Tier-0 and Tier-0 VRF point of view. This topology is supported
as each Tier-0 SR (on the parent and on the VRF itself) have a redundant path towards the
network infrastructure. The Top of Rack switches in this case are advertising the same prefixes
towards the NSX Tier-0.
Both the Parent Tier-0 gateway and the Tier-0 VRF gateway are peering with the same physical
networking device. The Parent Tier-0 gateways establish BGP adjacencies with both top of rack
switches in the global routing table while the Tier-0 VRF gateways establish BGP adjacencies
with both top of rack switch on another BGP process dedicated to the VRF. The Tier-0 VRF
leverages physical redundancy towards the networking fabric if one of its northbound link fails.

140
VMware NSX Reference Design Guide

Figure 4-62: Supported BGP peering Design

FIGURE 4-63 represents an unsupported VRF Active/Active design where different routes are
learned from different physical routers. Both the Parent Tier-0 and VRF Tier-0 gateways are
learning their default route from a single physical router. At the same time they receive specific
routes from a different single BGP peer. This kind of scenario would be supported for traditional
Tier-0 architecture as Inter-SR would provide a redundant path to the networking fabric, but it
is not supported for VRFs because of the lack of inter-SR peering capability. This VRF
architecture is not supported in NSX 4.1.

141
VMware NSX Reference Design Guide

Figure 4-63: Unsupported Active/Active Topology with VRF

FIGURE 4-64 demonstrates the traffic being as one internet router fails and Tier-0 VRF gateways
can’t leverage another redundant path to reach the destination. Since the Parent Tier-0
gateway has an established BGP peering adjacency (VRFs HA state is inherited from the parent),
failover will not be triggered, and traffic will be blackholed on the Tier-0 VRF. This situation
would not arise if each VRF SR peered with every ToR.

Figure 4-64: Unsupported Active-Active Topology with VRF – Failure

142
VMware NSX Reference Design Guide

On the Parent Tier-0:


1. VM “172.16.10.0” sends its IP traffic towards the internet through the Tier-0 DR.
2. Since the Tier-0 topology is Active/Active, the Tier-0 DR can send the traffic to either
Tier-0 SR1 and Tier-0 SR2 using a 2 or 5 tuple hashing algorithm.
3. From a Tier-0 SR1 point of view, the traffic that needs to be routed towards the internet
will be sent towards Tier-0 SR2 as there is an inter-SR BGP adjacency and that Tier-0 SR2
learns the route from another internet switch.
4. Traffic is received by the Tier-0 SR2 and routed towards the physical fabric.

On the Tier-0 VRF:


1. VM “172.16.10.0” sends its IP traffic towards the internet through the Tier-0 DR.
2. Since the Tier-0 topology is Active/Active, the Tier-0 DR can send the traffic to either
Tier-0 SR1 and Tier-0 SR2 using a 2 or 5 tuple hashing algorithm.
3. From a Tier-0 SR1 point of view, the traffic is blackholed as there is no inter-SR BGP
adjacency between the Tier-0 SRs VRF.
Following the same BGP peering design and principle for Active/Standby topologies is also
mandatory for VRF architectures as the Tier-0 VRF will inherit the behavior of the parent Tier-0
gateway.
FIGURE 4-65 represents another unsupported design with VRF architecture.

143
VMware NSX Reference Design Guide

Figure 4-65: Unsupported design BGP architecture Different peering with the networking fabric

In this design, traffic will be blackholed on the Tier-0 VRF SR1 as the internet router fails. Since
the Tier-0 VRF share its high availability running mode with the Parent Tier-0, it is important to
note that the Tier-0 SR1 will not failover to the Tier-0 SR2. The reason behind this behavior is
because a failover is triggered only if all northbound BGP sessions change to the “down” state
on the parent Tier-0 SR.
Since the parent Tier-0 gateway has an active BGP peering with a northbound router on the
physical networking fabric, failover will not occur and traffic will be blackholed on the VRF that
have only one BGP peer active. This situation would not arise if each VRF SR peered with every
ToR.

Connecting Tier-1 gateways to VRFs


A Multi-Tier routing topology is supported to add granularity and flexibility to a multi-tenancy
data center architecture as depicted in the following figure. Traditional Tier-1 gateway can be
connected to a Tier-0 VRF gateway as demonstrated in FIGURE 4-66.

144
VMware NSX Reference Design Guide

Figure 4-66: Multi-Tier routing architecture and VRF-lite

Stateful services can either run on a Tier-0 VRF gateway or a Tier-1 gateway except for VPN and
Load Balancing as these features are not supported on a Tier-0 VRF. Tier-0 SR in charge of the
stateful services for a particular VRF will be hosted on the same edge nodes as the Parent Tier-0
Gateway. It is recommended to run the stateful services on a Tier-1 gateways and leverage an
Active/Active Tier-0 gateway to send the traffic northbound to the physical fabric.
FIGURE 4-67 represents stateful services running on traditional Tier-1 gateways SR.

145
VMware NSX Reference Design Guide

Figure 4-67: Stateful Services supported on Tier-1

VRF route leaking


By default, data-plane traffic between VRF instances is isolated in NSX. By configuring inter-VRF
Routing, traffic can be exchanged between VRF instances. Before NSX version 4.1, exchanging
routes between VRF was already possible, but a convoluted procedure leveraging static routes
was required. In NSX 4.1, a more elegant workflow is implemented, and it should be used in any
new deployment. The new approach internally leverages BGP and Route Targets to exchange
routes between VRFs. Still, such implementation details are hidden from the user, who simply
has to enable inter-VRF routing the in the BGP settings for the VRF and optionally control which
routes will be leaked via routing filters.
The new workflow is outlined HERE .
The legacy workflow is still supported for backward compatibility, as it is outlined HERE .
In both scenarios stateful services such as gateway firewall and NAT cannot apply to inter-VRF
traffic.

VRFs shared Internet access over a common VLAN


In some situations, different VRFs instances require access to the Internet via a shared
connection. This configuration can be accomplished on the physical network via VRF route-
leaking and VRF aware NAT.

146
VMware NSX Reference Design Guide

If the intention is to minimize the physical network configuration and simply provide a common
VLAN where all the NSX VRFs can access the Internet, this configuration can be accomplished by
deploying a shared T0 Gateway dedicated to Internet access as demonstrated in the following
figure (FIGURE 4-68)

Figure 4-68: Different VRFs sharing Internet access via common Internet Tier-0

Directly connecting the different VRFs to a common VLAN is not allowed in NSX 4.1.
The VRFs can be interconnected to the shared T0 via dedicated overlay segments. The shared
Tier-0 gateway is providing internet access using a shared VLAN segment.
Internet facing NAT can implemented on the shared T0 if the VRFs are not carrying overlapping
IP spaces, or at the VRF level on the uplink facing the shared Tier-0 if they are.
eBGP peering between the VRFs Tier-0 Gateways and the shared Tier-0 gateway is
recommended to protect against an edge node failure.

Multi-Protocol BGP EVPN - VXLAN


In a multi tenancy network architecture based on VRF, each instances needs to establish its
own BGP peering with the Top of Rack switches VRF.

Each VRF instances need their own network constructs:

147
VMware NSX Reference Design Guide

• NSX Segments and VLANs


• Subnets and IP addresses
• BGP Configuration
• BFD timers
With a substantial number of tenants, the number of network objects needed both on the
physical network and within NSX can introduce an operational issue.
Using NSX and EVPN VXLAN overcomes that operational issue as all the VRF prefixes are
advertised over a single BGP session. BGP is one of the most popular routing protocols in Data
Centers as it provides an unmatched interoperability, granularity and flexibility over the learned
and advertised Network Layer Reachability Information (NLRI). MP-BGP is an extension of BGP
and leverages multiple address families like IPv4/IPv6 unicast, VPNv4/VPNv6 to distribute
routing information.
With the advent of the VXLAN popularity in the Data Center, an additional address and sub-
address families have been implemented in the MP-BGP protocol to alleviate the native flood
and learn behavior of VXLAN making scalability an issue in large data centers. The EVPN NLRI is
a new addition to the MP-BGP protocol and is carried in the L2VPN Address family (AFI 25) and
Subsequent Address Family EVPN (SAFI 70). The L2VPN/EVPN address family exchange
capability has to be exchanged during the initial BGP OPEN message.
FIGURE 4-69 compares the different NSX multi-tenancy architectures. The left end side of the
picture represents a native VRF-lite architecture requiring a BGP peering per tenant. The right
end side of the picture represents a VRF architecture based on EVPN using a single MP-BGP
peering for all tenants.

148
VMware NSX Reference Design Guide

Figure 4-69: VRF Lite vs EVPN peering model

NSX supports 2 EVPN implementation in the data center:


• NSX EVPN Inline Mode
• NSX EVPN Route Server Mode
This design guide will focus on the EVPN inline mode. The EVPN Route server mode will have its
own design guide as it is targeted towards telco providers.

MP-BGP EVPN for the Control Plane


As stated previously, MP-BGP uses the L2VPN address family and the EVPN sub-address family.
In the NLRI field, different EVPN route types can be used in EVPN to advertises MAC addresses
and IP prefixes as demonstrated in the next table. NSX EVPN Inline mode supports EVPN Route
Type 5 to advertise VRF prefixes between the Tier-0 gateway and the physical Top of Rack
Switch.

149
VMware NSX Reference Design Guide

Figure 4-70: EVPN Route Types

This EVPN Route type can advertise either IPv4 or IPv6 prefixes with a variable network mask
length (0 to 32 bits for IPv4 and 0 to 128 for IPv6). This route type allows inter subnet
forwarding, overlapping IP address scheme with the help of route distinguishers.
FIGURE 4-71 represents the field included in an EVPN Route Type 5

Figure 4-71: EVPN Route Type 5 Fields

When designing an EVPN architecture based on NSX, the following network constructs must be
taken into account:
• Segments used for the uplink interfaces

150
VMware NSX Reference Design Guide

• Uplink interfaces for the Parent Tier-0


• MP-BGP configuration – L2VPN/EVPN Address Family
• Route Distinguisher
• Route Targets
• VXLAN Virtual network Identifier

4.7.1.1 Uplink interfaces, Loopback and Segment


Since NSX supports a single VXLAN TEP in the parent Tier-0 gateway, it is recommended to use
a loopback interface.
The BGP architecture has no different requirements than a traditional non EVPN architecture,
redundancy must be present in the BGP peering architecture towards the physical fabric.
It is recommended to have an edge cluster with at least 2 nodes. The Tier-0 deployed on this
edge cluster should have interfaces (2 per edge nodes) connected to different top of rack
switches. These interfaces should be /28 to allow the design to scale up to 8 ECMP edge nodes
as demonstrated in FIGURE 4-72.

Figure 4-72: BGP Peering for EVPN

From a BGP standpoint, there are 2 mains requirements needed to setup the basic MP-BGP
Peering:

151
VMware NSX Reference Design Guide

• TEP interfaces (loopbacks) need to have reachability (Top of racks loopback interfaces to
NSX Tier-0 Loopbacks hosted on each edge nodes)
• L2VPN/EVPN address family negotiated between the Top of Rack Switch and the Tier-0
Service Router.
To achieve connectivity between the loopbacks hosted on the Tier-0 SR and the Top of Rack
switches, 2 designs are available as demonstrated in FIGURE 4-73. The first design will have a
BGP peering between the NSX Tier-0 SR and the Top of Rack switches using their uplink
interfaces. The loopbacks must be redistributed into BGP so that the entire network fabric is
aware of these TEP IP addresses to reach the different VRF prefixes hosted on the NSX domain.
The second design uses BGP Multi-hop between the loopback interfaces. To achieve
reachability between the loopbacks in this case, the use of static routing or OSPF is needed,
otherwise the BGP Multi-hop peering will never reach the “established” state.
The first design with loopback redistribution is recommended as it simplify the overall
architecture and reduce the operational load.

Figure 4-73: Peering over uplink interfaces with loopback redistribution into BGP (recommended) vs multi-hop E-BGP and
peering over the loopback interfaces

The Route Distinguisher field in a Type 5 EVPN route has the following functionalities and
properties:
• Allows VRF prefixes to be considered unique. If two tenants advertise the same prefix,
the Tier-0 (and the network fabric) must have a way to differentiate these 2 overlapping
prefixes so that they are globally unique among the different isolated routed instances.
• As a result, the Route Distinguisher must be unique per VRF. It is an 8 bytes identifier

152
VMware NSX Reference Design Guide

• NSX can auto assign Route Distinguishers, or an administrator can configure them
• Network engineer traditional use the format “BGP_AS:VRF” to configure the route
distinguishers
The following pictures show a multi-tenancy architecture based on a single MP-BGP EVPN
session that advertises the same prefix owned by 3 different tenants using route distinguishers
to make the routes unique.

The role of the Route Target configuration per VRF is:


• A route target is a BGP extended community and is configured to import/export the
prefixes in specifics VRF
• When a particular prefix is exported in NSX with a route-target that matches the VRF
import route target configuration on the top of rack, the prefix will be imported in that
particular VRF routing table.
• When a particular prefix is exported in NSX with a route-target that does not matches
the VRF import route target configuration on the top of rack, the prefix will not be
imported in that particular VRF routing table.
• The route targets import/export should match between the NSX Tier-0 and the Top of
Rack switches and should be unique per VRF instance/tenants.
• It is possible to configure multiple route-target import/export statement per VRF.

153
VMware NSX Reference Design Guide

The role of the Virtual Extensible LAN (VXLAN) ID is:


• Each VRF should be mapped to a unique VNI to differentiate the network traffic between
tenants from a data plane standpoint.
• Geneve is the overlay encapsulation protocol used internally within NSX. VXLAN is used
as an overlay between the Top of Rack Switches and the NSX
• Since NSX already uses its own Geneve Overlay to allow communication between NSX
TEP, the VXLAN VNI range should be unique
• VXLAN VNI Range must be between 75001 and 16777215

Figure 4-74: Relationships between Route Distinguisher, Route Target and VXLAN ID

FIGURE 4-75 summarizes the networking constructs needed per VRF to design and implement
MP-BGP EVPN with VXLAN between the Tier-0 gateways and the Top of Rack switches.

154
VMware NSX Reference Design Guide

Figure 4-75: Inline EVPN Topology

The NSX edge node uses Geneve TEP interfaces for all NSX overlay traffic. To exchange network
traffic with the Top of Rack switches, the Tier-0 must have a different TEP interfaces enabled
for VXLAN. One VXLAN TEP being supported on each edge node, loopback interfaces acting as
VXLAN TEP interfaces are recommended in an NSX EVPN architecture.

Multicast routing within NSX


This section details the capabilities of NSX for routing multicast traffic and provides a high-level
technical overview of the implementation of its data plane operations. This knowledge is
helpful in understanding the recommendations listed in the last part of this chapter.

Multicast in a virtual environment


When considering a virtual network environment, it’s essential to notice that it includes a
significant part handled by the physical infrastructure. NSX is implementing most networking
functions, but in the end, physical switches and routers are responsible for forwarding traffic

155
VMware NSX Reference Design Guide

between the transport nodes. FIGURE 4-76 illustrates this point in the context of multicast
routing.

Figure 4-76: multicast between hosts and within a host

The diagram represents a multicast source located in the physical infrastructure sending to
multiple virtual machines on different ESXi hosts. Two kinds of multicast are involved here:
• multicast between physical devices, represented in dark green,
• multicast handling on the virtual switch inside the ESXi hosts, represented in light green.
More specifically, you can see that within ESXi-3, traffic has been replicated to VM1 and
VM3, but filtered to VM2.
When running multicast in a VLAN-backed environment (NSX VLAN segments, or simple
dvportgroups, if NSX is not present in the picture), the physical infrastructure can directly snoop
the IGMP traffic sent by the virtual machines and take care of the multicast replication and
filtering between the hosts. In that scenario, there is no difference between NSX and regular
vSphere networking: the virtual switch is identifying the receivers using IGMP snooping and
delivers the multicast traffic to the appropriate vnics. We will not elaborate on this capability in
this guide.
For overlay segments however, the physical infrastructure is unaware of the traffic being
tunneled and it’s up to NSX to handle efficiently the distribution of multicast traffic to the
appropriate transport nodes. This design guide section is going to focus on that scenario: the
replication/filtering of multicast traffic between transport nodes with an NSX overlay.

156
VMware NSX Reference Design Guide

NSX multicast capabilities


As of NSX 4.1, Tier0 and Tier1 Gateways are capable of routing IPv4 multicast traffic between
overlay segments (no IPv6 multicast.) This capability exists within the NSX domain, but also
between the Tier0 Gateway and the physical infrastructure, thus allowing multicast sources and
receivers inside and outside NSX.
Between NSX and the physical infrastructure:
• The Tier0 gateway must be explicitly enabled for multicast routing and have at least one
uplink configured for multicast routing.
• The Tier0 gateway runs PIM sparse mode on its multiple uplinks (other PIM flavors are
not supported.)
• The rendez-vous point (RP) must be in the physical infrastructure. NSX does not provide
the RP functionality.
• The RP can be identified within NSX using static RP (different RPs can be assigned
statically on different IP multicast address ranges) or via multicast PIM bootstrap (BSR), a
standard protocol included in PIMv2 that is used to automatically find the RP in a
multicast network.
Inside NSX:
• Segments must be enabled for multicast routing (this is the default) and attached to a
gateway with multicast routing enabled.
• Just like Tier0 gateways, Tier1 gateways must be enabled explicitly for multicast routing
and must be connected to a Tier0 with multicast routing enabled.
• NSX snoops IGMPv2 messages from the virtual machines to identify multicast receivers.
NSX does not support IGMPv3 yet and source specific multicast (SSM) is not possible at
that stage.

157
VMware NSX Reference Design Guide

Figure 4-77: NSX multicast routing capabilities

FIGURE 4-77 summarizes the kind of multicast connectivity provided by NSX with multicast
routing. Sources and receivers can be inside or outside the NSX domain, the Tier0 gateway is
running PIM SM to the physical infrastructure where the RP is located.

Technical overview
NSX routes multicast traffic between overlay segment thanks to its Tier0 and Tier1 gateways.
This section focuses on the data plane operations within NSX: how NSX achieves efficient
multicast replication between transport nodes. This multicast traffic is carried on an overlay
network and the physical infrastructure cannot look inside the tunnels NSX creates.

4.8.3.1 Control plane


As mentioned above, the Tier0 gateway runs PIM sparse mode and relies on an external
rendez-vous point. Other than that, there is absolutely no multicast routing protocol running
inside NSX between transport nodes. NSX simply distributes the IGMP reports sent by the
virtual machines to all the transport nodes within a gateway routing domain (the routing
domain of a gateway is the collection of transport nodes where this gateway is instantiated) so
that they get a complete picture of where the receivers are disseminated.

4.8.3.2 Data plane operations between Tier0 gateway and physical infrastructure
The Tier0 gateway connecting NSX to the physical infrastructure is implemented as multiple
Tier0-SRs on multiple edge transport nodes. FIGURE 4-78 illustrates a Tier0 made of four

158
VMware NSX Reference Design Guide

active/active Tier0-SRs, each configured with two physical uplinks running PIM sparse mode.
Thanks to ECMP, unicast traffic can be sent and received on all the uplinks of those Tier0-SRs.
NSX is also spreading multicast traffic across the different Tier0-SRs on a per-multicast group
basis.
For a given destination multicast IP address, NSX uses a hash to determine a unique “active”
Tier0-SR that will be responsible for handing traffic in and out of the NSX domain for this group.
Different groups are hashed to different Tier0-SRs, thus providing some form of load balancing
for multicast traffic. There is no configuration option to manually select the multicast active
Tier0-SR for a given multicast group.
All Tier0-SRs must have a configuration and connectivity to the physical infrastructure capable
of handling all the groups. If a Tier0-SR fails, the multicast groups it was handling will be
assigned to an arbitrary Tier0-SR that is still active. No multicast routing state is synchronized
between the Tier0-SRs, and all the affected multicast groups will see their traffic disrupted until
PIM has re-populated its tables. Note that only the groups that were associated to the failed
Tier0-SR are impacted, other groups remain associated to their existing active Tier0-SR.

Figure 4-78: multicast between edges and physical infrastructure, external sources, internal receivers

NSX will only join the source tree of a multicast source in the physical infrastructure on one of
the uplinks of the multicast active Tier0-SR for that group. That means that external multicast
traffic will always enter NSX through the multicast active Tier0-SR. In the above FIGURE 4-78, the
multicast traffic for G1 is entering NSX through Tier0-SR1, the multicast active Tier0-SR for

159
VMware NSX Reference Design Guide

group G1. Multicast group G2 could be associated to another Tier0-SR, like Tier0-SR3 in the
diagram, thus spreading inbound traffic across multiple Tier0-SRs.

Figure 4-79: multicast between edges and physical infrastructure, internal sources, external receivers.

When the source for the group is located inside the NSX domain, the traffic pattern might be
slightly different. As soon as the IP address of the source is known to the physical
infrastructure, a PIM join will be sent toward the IP address of the source, in order to join the
receiver to the source tree.
This PIM join can be forwarded by the physical infrastructure on any path toward the source.
Thanks to unicast ECMP, the IP address of the source, is reachable via any Tier0-SR. In FIGURE
4-79 above, the external receiver for group G1 ended up joining the source tree rooted at S1 G1
through Tier0-SR1, the Tier0-SR active for G1. This is purely by chance. By contrast, the receiver
for group G2 ended up trying to join the source tree rooted at S2 G2 via Tier0-SR4.
This is not the Tier0-SR active for G2. Tier0-SR4 will itself need to join the source tree rooted
inside NSX in order to provide connectivity to the receiver. Because G2 is associated with Tier0-
SR3, the multicast traffic to G2 is always sent to Tier0-SR3. From there, it will be forwarded to
Tier0-SR4 so that it can reach the receiver that joined the source tree through an uplink of
Tier0-SR4. In the “south-north” direction, multicast traffic can thus hairpin to the multicast
active Tier0-SR before reaching its destination.

160
VMware NSX Reference Design Guide

4.8.3.3 Data plane operations within NSX for segments attached to a Tier0 gateway
We have seen that the multicast traffic going in and out the NSX domain for group G is always
routed through the multicast active Tier0-SR for G. The rest of this section is dedicated to the
multicast traffic flow within NSX. We’re only going to represent traffic using active Tier0-SR1 in
the diagrams, but of course, traffic could be going through any other Tier0-SR, depending on
the group.
The following diagram show the path for multicast traffic initiated in the physical infrastructure
and distributed inside NSX via the multicast active Tier0-SR1. In this scenario, we’re going to
assume that all the receivers are attached to segments directly connected to the Tier0 gateway
(in other words, there is no Tier1 gateway in the picture).

Figure 4-80: external source, receivers on Tier0 segments

1. Multicast traffic from the external source S reaches the Tier0-SR multicast active uplink.
There are receivers in the NSX domain, so this packet needs to be replicated to one or
multiple ESXi transport nodes.
2. For performance reasons, Tier0-SR1 is not going to replicate the traffic to the interested
transport nodes in the NSX domain. Instead, it’s going to “offload” the replication to an
ESXi host, picked arbitrarily in the Tier0 routing domain (again, the routing domain is the
group of transport nodes where the Tier0 gateway spans.) The host selected for
replication does not need to have a local receiver for the traffic, but the destination
multicast IP address is part of the hash used to select the host. This way, this multicast
offload is distributed across all the hosts in the routing domain, based on multicast
group. A note on the format of the packet that is forwarded to the ESXi-1 transport

161
VMware NSX Reference Design Guide

node: this is a multicast packet tunneled unicast to ESXi-1. The destination IP address of
the overlay packet is the unicast IP address of a TEP in ESXi-1.
3. In this example, ESXi-1 has been selected to replicate the multicast packet. First, the
local Tier0-DR routes the traffic to any receiver on local VMs.
4. Then the Tier0-DR initiate a “hybrid replication” to all the remote hosts in the Tier0
routing domain that have at least one receiver (and only those hosts.) The next section
will detail what this hybrid replication means, at that stage, just assume that it’s a way
of leveraging the physical infrastructure to assist packet replication between ESXi hosts.
5. At that final step, the remote Tier0-DRs receive the multicast packet and route it to their
local receivers.
FIGURE 4-81 details the multicast packets flow when a source is in the NSX domain (on a
segment attached to a Tier0) and there are receivers inside and outside NSX.

Figure 4-81: internal source on Tier0 segment

1. The VM sends multicast traffic on the segment to which it is connected. It reaches the
local Tier0-DR. Note: there is no receiver on ESXi-1. If there were some in the source
segment, they would directly receive the multicast traffic, forwarded at Layer 2 (with
IGMP snooping.) If there were some receivers on other segments in ESXi-1, the T0-DR
would route the traffic directly to them too.
2. Again, we assumed that Tier0-SR1 is multicast active for the group. It is thus considered
as being the multicast router and needs to receive a copy of every multicast packet
generated in the NSX domain for this group. The Tier0-DR on ESXi-1 sends a unicast copy
of the multicast packet to Tier0-SR1 (multicast traffic encapsulated in unicast overlay.)

162
VMware NSX Reference Design Guide

3. The Tier0-SR receives the traffic and routes it to the outside world on one of its uplinks
(or potentially through another Tier0-SR, as we have seen earlier.)
4. The Tier0-DR of ESXi-1 also starts hybrid replication to forward the multicast traffic to all
other ESXi transport nodes with receivers in the routing domain (the part dedicated to
hybrid replication will detail how this is achieved.)
5. The remote Tier0-DRs forward the multicast traffic to their local receivers.

4.8.3.4 Data plane operations within NSX: Tier0 and Tier1 segments combined
Until NSX 3.0, multicast routing was only possible on Tier0 gateways. NSX 3.1 introduces the
capability of routing multicast on Tier1 gateways. The handling of the Tier1 gateways is an
extension of the model described in the previous part for the Tier0 gateways.
When multicast routing is enabled on a Tier1 gateway, a Tier1-SR must be instantiated on an
edge. This Tier1-SR will be considered as the multicast router for the Tier1 routing domain and
will receive a copy of all multicast traffic initiated on a segment attached to this Tier1 gateway.
Then this Tier1-SR will be attached as a leaf to the existing multicast tree of its Tier0 gateway
(remember that in order to run multicast, a Tier1 gateway must itself be attached to a Tier0
gateway running multicast.)
The following diagram represents the traffic path of some external multicast traffic that needs
to be replicated within NSX to receivers off a Tier0 gateway as well as two Tier1 gateways
(green and purple.) Step 1-5 are about replicating the traffic to segments attached to the Tier0
gateway, thus exactly similar to the previous example.

Figure 4-82: external source, receivers on Tier0 and Tier1 segments

163
VMware NSX Reference Design Guide

1. The multicast traffic is received by the multicast active Tier0-SR for the group (Tier0-
SR1.)
2. The Tier0-SR1 offloads replication to an arbitrary ESXi transport node, here ESXi-1.
3. The Tier0-DR of the offload transport node (ESXi-1) routes the multicast traffic to local
receivers.
4. The Tier0-DR of ESXi-1 starts hybrid replication to reach all the ESXi hosts that have
receivers on segments attached to the Tier0 gateway.
5. The Tier0-DRs of those remote ESXi transport nodes route the traffic to their local
receivers.
So far, those steps have been exactly like those described in the previous part.
6. This is the first step specific to multicast on Tier1 gateways: the multicast active Tier0-
SR1 initiate a hybrid replication targeting all the Tier1-SRs with receivers for this group.
7. The Tier1-SRs (green and purple) offload the multicast replication to an arbitrary ESXi
transport node in their routing domain. In this respect, they act exactly the same way as
the multicast active Tier0-SR for its routing domain.
8. The ESXi transport nodes selected for offloading their Tier1-SR (ESXi-3 for the green
Tier1 and ESXi-4 for the purple Tier1) route the traffic to their local receivers, if any.
9. The ESXi hosts selected for offloading their Tier1-SR initiate a hybrid replication for
reaching all the interested Tier1-DRs in their Tier1 routing domain.
10. Finally, the Tier1-DRs on remote transport nodes route the multicast traffic to their local
receivers.
For the sake of being complete, FIGURE 4-83 is showing the traffic flow when a multicast source
is in the routing domain of a Tier1 gateway and there are receivers off another Tier1 gateway,
Tier0 gateway and external.

164
VMware NSX Reference Design Guide

Figure 4-83: internal source, external receiver and receivers on Tier0 and Tier1 segments

1. The source is located on ESXi-3, off a segment attached to the purple Tier1. The
multicast traffic is first received by the local purple Tier1-DR.
2. The purple Tier1-DR initiate a hybrid replication to reach every remote purple Tier1-DR.
3. The remote purple Tier1-DRs route the traffic to their local destination.
4. The purple Tier1-DR of ESXi-3 also sends a copy of the multicast traffic to the multicast
router of the routing domain, i.e. the purple Tier1-SR on edge3.
5. The purple Tier1-SR forwards the traffic to all SRs with receivers for this group (Tier0SR
and green T1SR). Note that the diagram is simplifying the real process. You don’t need
to understand the details, but for accuracy, let’s just mention that the purple Tier1-SR is
in fact sending a unicast copy (not represented) to Tier0SR1, its multicast router. Then it
initiates a hybrid replication that will reach all SRs with receivers. So, the T0SR receives
two copies: one is replicated on the uplink toward external receivers (step 6 below),
while the other will go down the tree to internal receivers on segments attached to the
Tier0 (part of step 7 below.)
6. The multicast active Tier0-SR routes the traffic on one of its uplinks toward the external
receiver(s).
7. The Tier0SR and green Tier1-SR now offload replication to an arbitrary ESXi transport
node in their routing domain.
8. The Tier0-DR on ESXi-1 and green Tier1-DR on ESXi-3 (the host selected to offload the
green Tier1-SR) route the multicast traffic to their local receivers, if any.

165
VMware NSX Reference Design Guide

9. The Tier0-DR on ESXi-1 and green Tier1-DR on ESXi-3 initiate a hybrid replication to their
peers in their respective routing domain.
10. The remote Tier0-DR and green Tier1-DR route the traffic to their local receivers.
Some few notes on the Tier1 multicast model:
• The use of a Tier1-SR for multicast has an impact on unicast traffic. Indeed, unicast traffic
that needs to be routed to other Tier1s or to the outside world will now transit through
this Tier1-SR.
• Multicast traffic for Tier1 attached segments follows a tree that is rooted on the
corresponding Tier1-SR. The consequence is that a host with receivers on segments that
are attached to different Tier0/Tier1s will get multiple copies of the same multicast
traffic, one for each Tier0-DR or Tier1-DR involved. Check ESXi-4 in the above FIGURE
4-83: it received three separate copies of the same multicast packet, one for each
gateway.

4.8.3.5 Stateful Active/Active Tier1 Gateways


NSX 4.1 introduces the capability of running stateful active/active gateways. As we have seen,
multicast is not implemented as a stateful service and as a result, multicast is handled the exact
same way across “stateless” active/active Tier0 gateways and “stateful” active/active Tier0
gateways. However, the stateful active/active gateway features also enables the deployment of
multiple active Tier1 SRs, instead of the traditional active/standby model. On Tier1 gateways
too, multicast is not a stateful service, but the availability of multiple Tier1 SRs will allow
spreading the multicast load across multiple edges, following the exact same model described
earlier for multiple Tier0 SRs. For each Tier1 gateway, an order list of all the Tier1 SRs is
distributed by the control plane to all the transport nodes. A hash taken on the IP address of a
multicast group identifies a unique Tier1 SR that will be responsible for it across the whole
network.
Using multiple Tier1 SRs might sounds appealing for increasing the multicast throughput of a
Tier1 gateway. However, multicast is not a sufficient reason for deploying active/active Tier1
gateways. The stateful active/active gateway model has significant design implications. The
requirement to deploy a stateful active/active Tier0 and to have Tier0 SRs and Tier1 SRs
collocated on the same edges makes multiple active Tier0 SRs for multicast much less palatable
if this design is not accompanied by the need of running some stateful services too.

4.8.3.6 Hybrid replication


NSX implements its overlay segments by maintaining point to point tunnels between its
transport nodes. If we kept the exact same model for implementing the multicast data path,
this would imply that NSX would be responsible for performing in software all the multicast
replication work. Indeed, the physical infrastructure cannot help directly, as it’s not aware in
any way of the kind of traffic the NSX tunnels carry.

166
VMware NSX Reference Design Guide

4.8.3.6.1 Hybrid replication concept


Hybrid replication is all about leveraging the replication capability of the physical infrastructure.
To achieve that, NSX will now map a multicast group in the overlay segment to an underlay
multicast group. This will de facto create “multi-point” tunnels, where the overlay segment
multicast traffic is encapsulated in a tunnel packet with a multicast destination address and
replicated by the physical infrastructure to multiple hosts.
NSX does not assume that the physical network (the “underlay”) is capable of routing multicast
traffic though. NSX will just assume that if it sends IP multicast traffic on the physical
infrastructure, this traffic will be replicated to all the interested device in the same Layer 2
domain as the source. Ethernet switches can replicate in hardware multicast traffic, and most
of them run IGMP snooping by default, ensuring an adequate filtering of multicast traffic within
a Layer 2 domain. NSX will take care of replicating the multicast traffic across the underlay IP
subnets.

4.8.3.6.2 Hybrid replication example


When configuring NSX for multicast routing, the administrator is prompted for a “Replication
Multicast Range”. This is a range of multicast addresses reserved in the physical infrastructure
for hybrid replication. This range is split evenly in two ranges, let’s call them range1 and range2.
Range1 will be used for hybrid replication between ESXi transport nodes, while range2 will be
used for hybrid replication between service routers (Tier0-SR and Tier1-SRs.)
Suppose that a transport node is interested in receiving multicast traffic sent to multicast group
G in the overlay. This group G will be hashed by NSX into two multicast IP addresses M1 and M2
in the replication multicast range1 and range2 respectively. Then if the transport node is an
ESXi host, it will join group M1 in the underlay. This means that a TEP on this ESXi host will send
an IGMPv2 report with the intend of joining group M1. The physical infrastructure (running
IGMP snooping) can act on this IGMP message and record that this ESXi host is a receiver for
M1. If the transport node was a service router on an edge node, it would join the multicast
group M2 in the underlay. That’s the reason for the even split of the replication multicast
range: one is for hybrid replication between ESXi transport nodes, and the other is for hybrid
replication between edge transport nodes (you’ll notice in the previous multicast flow examples
that there is no hybrid replication between edges and ESXi hosts, only unicast communication.)
Now that all the transport nodes interested in G have joined a common multicast group in the
underlay, NSX can send tunnel traffic to multicast address M1 or M2 to replicate efficiently
multicast traffic for group G between ESXi hosts or edges: within the same Layer 2 domain, the
physical infrastructure will be able to take care of the replication. When traffic needs to be
replicated across underlay subnets, NSX will send a single unicast copy of the multicast traffic to
a remote transport node in each remote subnet. Don’t worry if this sounds confusing at that
stage, the following diagrams will hopefully make that clear!
In this following scenario, let’s assume that a virtual machine on transport node TN-1 generates
multicast traffic to multicast group G, and that there are receivers on multiple ESXi transport

167
VMware NSX Reference Design Guide

nodes. This example is not about repeating the multicast packet flow presented in the previous
part, we’re not going to consider the multicast traffic sent to the multicast router for example,
we’re just going to show how hybrid replication can deliver this multicast IP packet to all the
other transport nodes that have a receiver for G. In this diagram those transport nodes with
receivers are TN-2, TN4, TN-5, TN-7, TN-8, TN-9, TN-10, and TN-11. The transport nodes have
their TEPs in three different subnets. This is a common design: typically, hosts under the same
top of rack switch (TOR) have their IP address in the same subnet. In this example, transport
nodes TN-1 to TN-4 have their TEP IP addresses in subnet 10.0.0.0/24, TN-5 to TN-8 have TEPs
in subnet 20.0.0.0/24 and TN-9 to TN-12 have TEP addresses in 30.0.0.0.24.
Let’s assume that NSX has mapped group G to group M in the replication multicast range for
ESXi transport nodes (called earlier as “range1”.) All the ESXi hosts with a receiver for G in the
overlay are joining group M in the physical infrastructure (that’s what the “M” on the TEPs is
representing.)

Figure 4-84: hybrid replication

Transport nodes TN-1 knows the IP address of all the transport nodes interested in G (as
mentioned earlier, this has been advertised by relaying IGMP packets.) More specifically, TN-1
knows that it has receivers in its local subnet and in two remote subnets.
1. TN-1 encapsulate the multicast traffic to G in an overlay packet with M as a destination
and send it on its uplink. The physical infrastructure sees a packet to M and knows that
there are two hosts that have joined group M. The top of rack switch takes care of
replicating this packet to TN-2 and TN-4. Note again that we don’t expect the physical
infrastructure to route multicast traffic across subnet. The packet sent to M is thus
constrained to its source subnet 10.0.0.0/24 and not routed to the remote subnets
20.0.0.0/24 and 30.0.0.0/24 by the physical infrastructure.
2. TN-1 knows that receivers TN-5, TN-7 and TN-8 are all in the remote underlay subnet
20.0.0.0/24. It picks one of those three transport nodes (let’s say TN-7) and forward the
multicast traffic to G as an overlay unicast to it. TN-1 repeats the operation for the

168
VMware NSX Reference Design Guide

remote underlay subnet 30.0.0.0/24 and sends the multicast traffic to G as an overlay
unicast to TN-9.
3. TN-7 receives an encapsulated multicast packet from TN-1. In the metadata of this
packet, TN-1 has set a bit indicating that this packet needs to be replicated to the other
transport nodes in the underlay subnet (reference section O VERLAY ENCAPSULATION ). TN-
7 delivers the multicast packet to its local receiver VMs and copies the multicast traffic
to G on its uplink, encapsulated with an overlay destination multicast IP address M. The
physical infrastructure is then again taking care of replicating this packet to TN-5 and
TN-8. The metadata of this overlay packet created by TN-7 is not indicating this time
that this packet needs to be replicated to the other hosts in the same subnet, so TN-5
and TN-8 only forward the multicast traffic to their local receivers.
4. TN-9 receives the overlay unicast copy of the multicast traffic to G and perform the
same operation as TN-7 in the previous step.

NSX multicast design recommendations

4.8.4.1 Beware of the TTL


The NSX gateways behave like regular routers and do decrement the TTL of the packet they
forward. Multicast packets going through a Tier0 gateway and a Tier1 gateway will have their
TTL decremented twice. Make sure that the multicast traffic generated has an appropriate TTL,
we’ve seen multiple examples where customers migrating from NSX for vSphere losing
multicast traffic because of a wrong TTL (in NSX for vSphere, the edge and distributed logical
router represent a single logical router and only decrement the TTL once.)

4.8.4.2 Connecting the Tier0-SRs to the physical infrastructure


The administrator has no influence over which Tier0-SR is selected by NSX as the multicast
active for a specific group. When the design includes multiple active/active Tier0-SRs, make
sure they have consistent configuration.
Also, make sure that PIM is enabled on all the uplinks of the Tier0-SRs. This is not really an NSX
specific recommendation. Multicast routing must be enabled on all the unicast paths toward
the multicast source. As seen earlier, a PIM join from a receiver in the physical infrastructure
can reach the Tier0 gateway on any uplink leading to this source in the NSX domain.

4.8.4.3 Hybrid replication


This multicast section of the design guide included a relatively detailed description of the data
plane so that the reader could understand the following recommendations, tightly linked to the
way NSX uses hybrid replication.
The replication multicast range should be a dedicated non-routed range of multicast
addresses in the physical infrastructure

169
VMware NSX Reference Design Guide

When configuring multicast, the administrator needs to provide a range of multicast IP


addresses that will be used by NSX in the physical infrastructure for hybrid replication.
Theoretically, the overlay and underlay IP spaces are entirely independent, and you could
potentially use the same multicast addresses in overlay and the underlay. However, it is
recommended to reserve a separate range of multicast IP address for the NSX replication range.
First, it will make it easier to identify the traffic generated by NSX for its hybrid replication.
Second, it’s important to make sure that the physical infrastructure does not enable multicast
routing for the replication range. The reason is that NSX is responsible for propagating overlay
multicast between subnets. For example, suppose that in Figure 4-84: hybrid replication,
multicast routing was enabled for group M. In that scenario, the encapsulated multicast traffic
sent to multicast address M by TN-1 (in step 1) would be routed to the remote subnets
20.0.0.0/24 and 30.0.0.0/24 by the physical infrastructure. Note that the underlay multicast
traffic generated by NSX carries a TTL of 1. So, the remote hosts would most likely not receive
multiple copy of the multicast traffic already propagated by NSX between subnets (and even if
they received it, they would drop it.) The result would however be suboptimal, as some
bandwidth would be wasted for transmitting traffic that would be eventually discarded.
Reserve the largest replication multicast range possible
There is a mapping between each multicast IP address used in the overlay and a unique pair of
multicast addresses in this replication multicast range.
Suppose that the virtual network uses 1000 different multicast groups, but that the replication
multicast range only reserves 100 multicast addresses (leading to 50 addresses in range1 and
range2.) Obviously, there is a 1:20 over-subscription and multiple multicast groups from the
overlay address space will map to the same multicast address in the replication range. In the
following diagram, due to a hashing collision, two overlay multicast groups G1 and G2 are
mapped by NSX to the same underlay multicast group M.

Figure 4-85: hashing collision

When TN-1 sends traffic to group G1, it’s encapsulated in a multicast tunnel packet and sent to
multicast address M. As a result, both TN-2 and TN-4 receive this packet. This is sub-optimal

170
VMware NSX Reference Design Guide

because TN-4 has no receiver for group G1. TN-4 just discards this packet. Note that, from a
performance standpoint, this issue is not catastrophic. The additional replication work is done
by the physical infrastructure, the cost for discarding superfluous packets it not that great on
TN-4, but it’s still better to minimize this scenario.
Enable IGMP snooping in the physical infrastructure
As mentioned already in this document, NSX only uses the physical infrastructure for multicast
within the same Layer 2 domain. One of the reasons for this implementation choice is that we
did not want NSX to be dependent on configuring multicast routing in the physical
infrastructure (something that many of our customers are not able to do.) It is however
beneficial to have IGMP snooping enabled on the Layer 2 switches of the physical
infrastructure, in order to optimize the multicast replication within the rack.

Figure 4-86: benefits of IGMP snooping

Most switches default to IGMP snooping enabled, just check that your infrastructure is
configured that way. Without IGMP snooping, IP multicast is treated like broadcast by Layer 2
switches, which results in sub-optimal replication.
Limit the number of TEP subnets
This one is just a recommendation related to performance. With the hybrid replication model,
NSX performs a unicast copy for each remote TEP subnets (see FIGURE 4-84: HYBRID replication.)
This copy consumes CPU on the source transport node and wastes bandwidth on its uplink. The
following diagram proposes a relatively extreme example comparing the replication of a single
multicast frame to ten remote ESXi transport nodes. In the case represented on the top of the
diagram, all the receiving transport nodes are in different subnets, in the bottom part of the
diagram, the receiving transport nodes are in the same subnet as the source.

171
VMware NSX Reference Design Guide

Figure 4-87: replication across multiple subnets vs. within a single subnet

In the top example, the source transport node (TN-1) needs to send 10 unicast copies of the
multicast packet, one for each and every receiving transport node. Practically, it means that the
bandwidth of the uplink of TN-1 has been divided by 10. There is a CPU cost to this 10-fold
replication too.
In the example represented at the bottom, the source transport node just needs to send a
single multicast copy of the multicast packet. It is received by the physical switch local to the
rack and directly replicated to the receivers. The source transport node thus only had to send a
single packet, and the burden of the replication has been entirely put on the hardware switch.
Of course, we’re not recommending putting all your hosts in the same Layer 2 domain. In fact,
one of the benefits of the NSX overlay is precisely that you can design your physical
infrastructure without the need of extending Layer 2. The cost associated to replicating across
subnets is difficult to avoid for ESXi hosts, that are bound to be spread across subnets. It might
however be possible to group VMs joining the same multicast groups on hosts that are spread
across a minimum number of racks.
Hybrid replication is also used between edges hosting the Tier0-SRs and Tier1-SRs. The edges of
your edge cluster should not be dependent on the TORs of a single rack, but it’s perfectly
possible to deploy all the edges of your edge cluster across two racks (and two TEP subnets.)
This will adequately reduce the amount of unicast replication that the hybrid model needs to
perform.
Using Tier1-SRs in the model has consequences, minimize the number of Tier1 routers with
multicast

172
VMware NSX Reference Design Guide

This has already been mentioned in the chapter DATA PLANE OPERATIONS WITHIN NSX: TIER0 AND
TIER1 SEGMENTS COMBINED . An ESXi transport node will receive a copy of the same multicast
packet for each Tier0 and Tier1 with local receivers. The following diagram makes a simpler
representation of this property:

Figure 4-88: host transport node receiving multiple copies of the same multicast packet

Here, the ESXi host at the bottom receives three copies of a multicast packet generated outside
NSX, behind a Tier0 gateway. The multicast trees for the Tier1 gateways are rooted in their
Tier1-SRs. You can see in the diagram that the ESXi host receives the multicast packet for the
local receiver under the green Tier1-DR straight from the green Tier1-SR, not from the local
Tier0-DR (as it would, for unicast traffic.) This looks sub-optimal (multicast is all about avoiding
the same links to carry the same multicast traffic multiple times), but it’s mandatory for
inserting stateful services on the Tier1 gateways, in a future release. Just be aware that
multiplying Tier1 gateways with receivers for the same groups will cause NSX to replicate the
same traffic multiple times.
Using Tier1-SRs in the model has consequences, be aware of a Tier1-SR impact on unicast
traffic
The choice of implementing a Tier1-SR for multicast traffic also has an impact on unicast traffic:
unicast traffic can only enter/exit the Tier1 routing domain through this centralized Tier1-SR, as
represented in the figure below.

173
VMware NSX Reference Design Guide

Figure 4-89: Tier1-SR impacts unicast traffic

The impact of inserting a Tier1-SR is well known, and not specific to multicast. It is still possible
to have unicast ECMP straight from the host transport node to the Tier0-SRs (and the physical
infrastructure) by attaching workloads directly to a segment off the Tier0 gateway.

Static RP vs. BSR


The definition of the rendez-vous point in NSX can be achieved by static RP configuration or
through BSR. For faster convergence, we recommend using static RP. NSX allows entering
multiple static RPs, each with an associated range of IP multicast addresses. Those ranges can
even be overlapping, as long as they don’t use the same subnet mask.
Static RP also allows for greater control on which groups can interact with the physical
infrastructure. By using static RP and disabling BSR, it is indeed possible to define exactly which
groups will have an associated RP. Groups with no RP will not be able to receive/transmit
multicast with the outside world.
Note also that static RP has precedence over BSR. If no RP is assigned to a group via static RP, it
can still be associated to an RP via BSR (if BSR is enabled, of course.)

Edge Node
Edge nodes are service appliances with pools of capacity, dedicated to running network and
security services that cannot be distributed to the hypervisors. Edge node also provides

174
VMware NSX Reference Design Guide

connectivity to the physical infrastructure. Previous sections mentioned that centralized


services will run on the SR component of Tier-0 or Tier-1 gateways. These features include:
• Connectivity to physical infrastructure (static routing / BGP / MP-BGP/OSPF)
• VRF-lite
• EVPN
• NAT
• DHCP server
• Metadata proxy
• Gateway Firewall
• NSX Native Load Balancer (No NSX Advanced Load Balancer)
• L2 Bridging
• Service Interface
• VPN
• TLS Decryption
• FQDN Analysis
• URL Filtering
• IDS/IPS (Also available as a Distributed Service)
• Malware Prevention (Also available as a Distributed Service)
As soon as one of these services is configured or an external interface is defined on the Tier-0
gateway, a SR is instantiated on the Edge node. The Edge node is also a transport node just like
ESXi servers, and just like an ESXi server it can connect to more than one transport zone. The
NSX Edge runs one or more N-VDSs managing pNICs of a bare metal edge, or the vNics of a
virtual machine edge. A specific Edge node can be connected to only one overlay transport zone
and depending upon the topology, is connected to one or more VLAN transport zones for N-S
connectivity.
There are two transport zones on the Edge:
• Overlay Transport Zone: Any traffic that originates from a VM participating in NSX
domain may require reachability to external devices or networks. This is typically
described as external North-South traffic. Traffic from VMs may also require some
centralized service like NAT, load balancer etc. To provide reachability for N-S traffic and
to consume centralized services, overlay traffic is sent from compute transport nodes to
Edge nodes. Edge node needs to be configured with a single overlay transport zone so
that it can decapsulate the overlay traffic received from compute nodes as well as

175
VMware NSX Reference Design Guide

encapsulate the traffic sent to compute nodes. If the NSX deployment includes multiple
overlay transport zones, multiple sets of edge nodes must be deployed, one for each
overlay transport zone.
• VLAN Transport Zone: Edge nodes connect to the physical infrastructure using VLANs.
Edge node needs to be configured for VLAN transport zone to provide external or N-S
connectivity to the physical infrastructure. Depending upon the N-S topology, an edge
node can be configured with one or more VLAN transport zones.
Edge node can have one or more N-VDS to provide the desired connectivity. Each N-VDS on the
Edge node uses an uplink profile which can be the same or unique per N-VDS. The teaming
policies defined in the uplink profile defines how the N-VDS balances traffic across its uplinks.
The uplinks can in turn be individual pNICs or LAGs. While supported, implementing LAGs on
edge nodes is discouraged and not included in any of the deployment examples in this guide.

Types of Edge Nodes


Edge nodes are available in two form factors – VM and bare metal. Both leverage the data
plane development kit (DPDK) for faster packet processing and high performance. There are
different VM sizes available. Each of them has a different resource footprint and can be used to
achieve different guidelines. These are detailed in below table.

Size Memory vCPU Disk Specific Usage Guidelines

PoC only. L7 rules for firewall, load


balancing are not realized on a Tier-1
Small 4GB 2 200 GB gateway if you deploy a small sized NSX
Edge VM.

Suitable when only L2 through L4 features


such as NAT, routing, L4 firewall, L4 load
Medium 8GB 4 200 GB balancer are required and the total
throughput requirement is less than 2
Gbps.

Suitable when only L2 through L4 features


such as NAT, routing, L4 firewall, L4 load
Large 32GB 8 200 GB
balancer and the total throughput
requirement is more than 2 Gbps .

176
VMware NSX Reference Design Guide

Suitable when the total throughput


required is multiple Gbps for L7 load
Extra Large 64GB 16 200GB balancer and VPN
.

Suitable for production with centralized


services like NAT, Gateway Firewall, load
Bare metal
32GB 8 200 GB balancer etc. Typically deployed, where
Edge
higher performance at low packet size and
sub-second N-S convergence is desired.

Table 4-2: Edge VM Form Factors and Usage Guideline

The Bare Metal Edge resources specified above specify the minimum resources needed. It is
recommended to deploy an edge node on a bare metal server with the following specifications
for maximum performance:
• Memory: 256GB
• CPU Cores: 24
• Disk Space: 200GB
When NSX Edge is installed as a VM, vCPUs are allocated to the Linux IP stack and DPDK. The
number of vCPU assigned to a Linux IP stack or DPDK depends on the size of the Edge VM. A
medium Edge VM has two vCPUs for Linux IP stack and two vCPUs dedicated for DPDK. This
changes to four vCPUs for Linux IP stack and four vCPUs for DPDK in a large size Edge VM.
Several Intel and AMD CPUs are supported both for the virtualized and Bare Metal Edge node
form factor. Specifications can be found HERE .

Multi-TEP support on Edge Node


Staring with NSX 2.4 release, Edge nodes support multiple overlay tunnels (multi-TEP)
configuration to load balance overlay traffic for overlay segments/logical switches. Multi-TEP is
supported in both Edge VM and bare metal. FIGURE 4-90 shows two TEPs configured on the
bare metal Edge. Each overlay segment/logical switch is pinned to a specific tunnel end point IP,
TEP IP1 or TEP IP2. Each TEP uses a different uplink, for instance, TEP IP1 uses Uplink1 that’s
mapped to pNIC P1 and TEP IP2 uses Uplink2 that’s mapped to pNIC P2. This feature offers a
better design choice by load balancing overlay traffic across both physical pNICs and also
simplifies N-VDS design on the Edge.
Notice that a single N-VDS is used in this topology that carries both overlay and external traffic.

177
VMware NSX Reference Design Guide

In-band management feature is leveraged for management traffic. Overlay traffic gets load
balanced by using multi-TEP feature on Edge and external traffic gets load balanced using
"Named Teaming policy" as described in section TEAMING P OLICY in chapter 3.

Figure 4-90: Bare metal Edge -Same N-VDS for overlay and external traffic with Multi-TEP

Following pre-requisites must be met for multi-TEP support:


• TEP configuration must be done on one N-VDS only.
• All TEPs must use same transport VLAN for overlay traffic.
• All TEP IPs must be in same subnet and use same default gateway.

During a pNIC failure, Edge performs a TEP failover by migrating TEP IP and its MAC address to
another uplink. For instance, if pNIC P1 fails, TEP IP1 along with its MAC address will be
migrated to use Uplink2 that’s mapped to pNIC P2. In case of pNIC P1 failure, pNIC P2 will carry
the traffic for both TEP IP1 and TEP IP2. The BFD sessions from the individual Edge TEPs to other
transport nodes are not a reliable mechanism to detect and recover from partial failures (of a
single TEP for example). The exception is the “All tunnels down” condition when all the tunnels
from all edge TEPs to all the transport nodes are down. See the NSX EDGE HIGH AVAILABILITY
FAILOVER TRIGGERS section for more details.

Bare Metal Edge Node


NSX bare metal Edge runs on a physical server and is installed using an ISO file or PXE boot.
Legacy BIOS mode is the only supported booting mode on NSXT-T 3.0. A bare metal Edge differs
from the VM form factor Edge in terms of performance. It provides sub-second convergence,
faster failover, and higher throughput at low packet size (discussed in chapter 8 NSX
P ERFORMANCE & O PTIMIZATION). There are certain hardware requirements including CPU
specifics and supported NICs can be found in the NSX EDGE BARE METAL REQUIREMENTS section of
the NSX installation guide.

178
VMware NSX Reference Design Guide

When a bare metal Edge node is installed, a dedicated interface is retained for management. If
redundancy is desired, two NICs can be used for management plane high availability. These
management interfaces can also be 1G. Bare metal Edge also supports in-band management
where management traffic can leverage an interface being used for overlay or external (N-S)
traffic.
Bare metal Edge node supports a maximum of 16 datapath physical NICs. For each of these 16
physical NICs on the server, an internal interface is created following the naming scheme “fp-
ethX”. These internal interfaces are assigned to the DPDK Fast Path. There is a flexibility in
assigning these Fast Path interfaces (fp-eth) for overlay or external connectivity.

Figure 4-37: Bare metal Edge Management Configuration Choices

VM Edge Node
NSX VM Edge in VM form factor can be installed using an OVA, OVF, or ISO file. NSX Edge VM is
only supported on ESXi host.
Up to NSX 3.2.0, an NSX Edge VM has four internal interfaces: eth0, fp-eth0, fp-eth1, and fp-
eth2. Eth0 is reserved for management, while the rest of the interfaces are assigned to DPDK
Fast Path. Starting with NSX 3.2.1, four datapath interfaces are available: fp-eth0, fp-eth1, fp-
eth2, and fp-eth3.
These interfaces are allocated for external connectivity to TOR switches and for NSX overlay
tunneling. There is complete flexibility in assigning Fast Path interfaces (fp-eth) for overlay or
external connectivity. As an example, fp-eth0 could be assigned for overlay traffic with fp-eth1,
fp-eth2, or both for external traffic.

179
VMware NSX Reference Design Guide

FIGURE 4-91 shows an edge VM where only two of the fast path interfaces are in use. They are
managed by the same NVDS and carry both overlay and VLAN traffic. This design is in line with
the edge reference architecture and will be explained in detailed in chapter 7 section EDGE
CONNECTIVITY GUIDELINES FOR LAYER 3 P EERING USE CASE .

Figure 4-91: Edge VM

Edge Cluster
An Edge cluster is a group of Edge transport nodes. It provides scale out, redundant, and high-
throughput gateway functionality for logical networks. Scale out from the logical networks to
the Edge nodes is achieved using ECMP. There is a flexibility in assigning Tier-0 or Tier-1
gateways to Edge nodes and clusters. Tier-0 and Tier-1 gateways can be hosted on either same
or different Edge clusters.

180
VMware NSX Reference Design Guide

Figure 4-92: Edge Cluster with Tier-0 and Tier 1 Services

Depending upon the services hosted on the Edge node and their usage, an Edge cluster could
be dedicated simply for running centralized services (e.g., NAT). FIGURE 4-93 shows two clusters
of Edge nodes. Edge Cluster 1 is dedicated for Tier-0 gateways only and provides external
connectivity to the physical infrastructure. Edge Cluster 2 is responsible for NAT functionality on
Tier-1 gateways.

Figure 4-93: Multiple Edge Clusters with Dedicated Tier-0 and Tier-1 Services

There can be only one Tier-0 gateway per Edge node; however, multiple Tier-1 gateways can be
hosted on one Edge node.
A Tier-0 gateway supports a maximum of eight equal cost paths per SR or DR component, thus
a maximum of eight Edge nodes are supported for ECMP. Edge nodes in an Edge cluster run
Bidirectional Forwarding Detection (BFD) on both tunnel and management networks to detect
Edge node failure. Edge VMs support BFD with minimum BFD timer of 500ms with three retries,
providing a 1.5 second failure detection time. Bare metal Edges support BFD with minimum BFD
TX/RX timer of 50ms with three retries which implies 150ms failure detection time.

Failure Domain
Failure domain is a logical grouping of Edge nodes within an Edge Cluster. This feature can be
enabled on the Edge cluster level via API.
As discussed in the high availability section, a Tier-1 gateway with centralized stateful services
runs on Edge nodes in active/standby or active/active HA configuration mode. When a user

181
VMware NSX Reference Design Guide

assigns a Tier-1 gateway to an Edge cluster, NSX manager automatically chooses the Edge
nodes in the cluster to run the active and standby Tier-1 SR in case of active/standby HA mode
or the edge sub-clusters in case of active/active stateful HA mode (See section 4.5.3.3 to review
the role of sub-clusters in A/A Stateful configurations). The auto placement of Tier-1 SRs on
different Edge nodes considers several parameters like Edge capacity, active/standby HA state
etc.
Failure domains compliment auto placement algorithm and guarantee service availability in
case of a failure affecting multiple edge nodes. Active and standby instance of a Tier-1 SR or
members of sub-cluster always run in different failure domains.
FIGURE 4-94 shows an edge cluster comprised of four Edge nodes, EN1, EN2, EN3 and EN4. EN1
and EN2 connected to two TOR switches in rack 1 and EN3 and EN4 connected to two TOR
switches in rack 2. Without failure domain, a Tier-1 SR could be auto placed in EN1 and EN2. If
rack1 fails, both active and standby instance of this Tier-1 SR fail as well.
EN1 and EN2 are configured to be a part of failure domain 1, while EN3 and EN4 are in failure
domain 2. When a new Tier-1 SR is created and if the active instance of that Tier-1 is hosted on
EN1, then the standby Tier-1 SR will be instantiated in failure domain 2 (EN3 or EN4).

Figure 4-94: Failure Domains

To ensure that all Tier-1 services are active on a set of edge nodes, a user can also enforce that
all active Tier-1 SRs are placed in one failure domain. This configuration is supported for Tier-1
gateway in preemptive mode only.

182
VMware NSX Reference Design Guide

Other Network Services

Unicast Reverse Path Forwarding (uRPF)


A router forwards packets based on the value of the destination IP address field that is present
in the IP header. The source IP address field is generally not used when forwarding a packet on
a network (except when networks implement source-based routing).
Unicast Reverse Path Forwarding is defined in RFC 2827 and 3704. It prevents packets with
spoofed source IP address to be forwarded in the network. URPF is generally enabled on a
router per interface and not globally. A uRPF enabled router will inspect the source IP address
of every packet of each packet received on an interface. It will validate that packets are coming
from a legitimate source by inspecting the routing table. This protection prevents spoofed
source IP address attacks that are commonly used by sending packets with random source IP
addresses. When a packet arrives on an interface, the router will verify if the receiving interface
would be used to reach the source of the packet. It will discard the packets if the interfaces are
different.
This protection prevents spoofed source IP address attacks that are commonly used by sending
packets with random source IP addresses.
FIGURE 4-95 diagrams a physical network with URPF enabled on the core router.
1. The core router receives a packet with a source IP address of 10.1.1.1 on interface
ethernet0/2.
2. The core router has the URPF feature enabled on all its interfaces and will check in its
routing table if the source IP address of the packet would be routed through interface
ethernet 0/2. In this case, 10.1.1.1 is the source IP address present in the IP header. The
core router has a longest prefix match for 10.1.1.0/24 via interface ethernet 0/0.
3. Since the packet does not come from the interface ethernet 0/0, the packet will be
discarded.

183
VMware NSX Reference Design Guide

Figure 4-95: URPF

In NSX, URPF is enabled by default on external, internal and service interfaces. From a security
standpoint, it is a best practice to keep uRPF enabled on these interfaces. uRPF is also
recommended in architectures that leverage ECMP. On intra-tier and router link interfaces, a
simplified anti-spoofing mechanism is implemented. It is checking that a packet is never sent
back to the interface the packet was received on.
It is possible to disable uRPF in complex routing architecture where the upstream BGP or OSPF
peers do not advertise the same networks.

Network Address Translation


Users can enable NAT as a network service on NSX. This is a centralized service which can be
enabled on both Tier-0 and Tier-1 gateways.
Supported NAT rule types include:
● Source NAT (SNAT): Source NAT translates the source IP of the outbound packets to a
known public IP address so that the application can communicate with the outside
world without using its private IP address. It also keeps track of the reply.
● Destination NAT (DNAT): DNAT allows for access to internal private IP addresses from
the outside world by translating the destination IP address when inbound

184
VMware NSX Reference Design Guide

communication is initiated. It also takes care of the reply. For both SNAT and DNAT,
users can apply NAT rules based on 5 tuple match criteria.
● Reflexive NAT: Reflexive NAT rules are stateless ACLs which must be defined in both
directions. These do not keep track of the connection. Reflexive NAT rules can be used
in cases where stateful NAT cannot be used due to asymmetric paths (e.g., user needs
to enable NAT on active/active ECMP routers).

Table 4-3 summarizes NAT rules and usage restrictions.

NAT Rules
Type Specific Usage Guidelines
Type

Source NAT (SNAT) Can be enabled on both Tier-0 and Tier-1


Stateful
Destination NAT (DNAT) gateways

Can be enabled on Tier-0 gateway; generally


Stateless Reflexive NAT
used when the Tier-0 is in active/active mode.

Table 4-3: NAT Usage Guideline

Table 4-4 summarizes the use cases and advantages of running NAT on Tier-0 and Tier-1
gateways.

NAT Rule
Gateway Type Specific Usage Guidelines
Type

E-W routing between different tenants remains completely


Tier-0 Stateful
distributed.

Recommended for high throughput ECMP topologies.


Tier-1 Stateful Recommended for topologies with overlapping IP address
space.

Table 4-4: Tier-0 and Tier-1 NAT use cases

185
VMware NSX Reference Design Guide

NAT Service Router Placement


As a centralized service, whenever NAT is enabled, a service component or SR must be
instantiated on an Edge cluster. In order to configure NAT, specify the Edge cluster where the
service should run; it is also possible the NAT service on a specific Edge node pair. If no specific
Edge node is identified, the platform will perform auto placement of the services component on
an Edge node in the cluster using a weighted round robin algorithm.

DHCP Services
NSX provides both DHCP relay and DHCP server functionality. DHCP relay can be enabled at the
gateway level and can act as relay between non-NSX managed environment and DHCP servers.
DHCP server functionality can be enabled to service DHCP requests from VMs connected to
NSX-managed segments. DHCP server functionality is a stateful service and must be bound to
an Edge cluster or a specific pair of Edge nodes as with NAT functionality.

Metadata Proxy Service


With a metadata proxy server, VM instances can retrieve instance-specific metadata from an
OpenStack Nova API server. This functionality is specific to OpenStack use-cases only. Metadata
proxy service runs as a service on an NSX Edge node. For high availability, configure metadata
proxy to run on two or more NSX Edge nodes in an NSX Edge cluster.

Gateway Firewall Service


Gateway Firewall service can be enabled on the Tier-0 and Tier-1 gateway for North-South
firewalling. Table 4-5 summarizes Gateway Firewalling usage criteria.

Gateway Firewall Specific Usage Guidelines

Stateful Can be enabled on both Tier-0 and Tier-1 gateways.

Can be enabled on Tier-0 gateway; generally used when the Tier-0 is in


Stateless
active/active mode.

Table 4-5: Gateway Firewall Usage Guideline

Since Gateway Firewalling is a centralized service, it needs to run on an Edge cluster or a set of
Edge nodes. This service is described in more detail in the security chapter 5.

186
VMware NSX Reference Design Guide

Proxy ARP
Proxy ARP is a method that consist of answering an ARP request on behalf of another host. This
method is performed by a layer 3 networking device (usually a router). The purpose is to
provide connectivity between 2 hosts when routing wouldn’t be possible for various reasons.

Proxy ARP in an NSX infrastructure can be considered in environments where IP subnets are
limited. Proof of concepts and VMware Enterprise PKS environments are usually using Proxy-
ARP to simplify the network topology.
For production environment, it is recommended to implement proper routing between a
physical fabric and the NSX Tier-0 by using either static routes or Border Gateway Protocol with
BFD. If proper routing is used between the Tier-0 gateway and the physical fabric, BFD with its
sub-second timers will converge faster. In case of failover with proxy ARP, the convergence
relies on gratuitous ARP (broadcast) to update all hosts on the VLAN segment with the new
MAC Address to use. If the Tier-0 gateway has proxy ARP enabled for 100 IP addresses, the
newly active Tier-0 SR needs to send 100 Gratuitous ARP packets.

Edge Node HA Specific Usage Guidelines

Active / Standby Supported

Active / Active Supported

Table 4-6: Proxy ARP Support

By enabling proxy-ARP, hosts on the overlay segments and hosts on a VLAN segment can
exchange network traffic together without implementing any change in the physical networking
fabric. Proxy ARP is automatically enabled when a NAT rule or a load balancer VIP uses an IP
address from the subnet of the Tier-0 gateway uplink.
FIGURE 4-96 presents the logical packet flow between a virtual machine connected to an NSX
overlay segment and a virtual machine or physical appliance connected to a VLAN segment
shared with the NSX Tier-0 uplinks.
In this example, the virtual machine connected to the overlay segment initiates networking
traffic toward 20.20.20.100.

187
VMware NSX Reference Design Guide

Figure 4-96: Proxy ARP topology

1. The virtual machine connected to the overlay segment with an IP address of


172.16.10.10 sends a packet to the physical appliance “SRV01” with an IP address of
20.20.20.100. the local DR hosted on the local hypervisor performs a routing lookup and
sends the traffic to the SR.
2. The SR hosted on an edge node translates the source IP of 172.16.10.10 with a value of
20.20.20.10 and sends the traffic to the Tier-0.
3. Tier-0 SR has proxy ARP enabled on its uplink interface and will send an ARP request
(broadcast) on the vlan segment to map the IP address of 20.20.20.100 with the correct
MAC address.
4. The physical appliance “SRV01” answers to that ARP request with an ARP reply.
5. Tier-0 SR sends the packet to the physical appliance with a source IP of 20.20.20.10 and
a destination IP of 20.20.20.100.
6. The physical appliance “SRV01” receives the packet and sends an ARP broadcast on the
VLAN segment to map the IP address of the virtual machine (20.20.20.10) to the
corresponding MAC address.

188
VMware NSX Reference Design Guide

7. Tier-0 receives the ARP request for 20.20.20.10 (broadcast) and has the proxy ARP
feature enabled on its uplink interfaces. It replies to the ARP request with an ARP reply
that contains the Tier-0 SR MAC address for the interface uplink.
8. The physical appliance “SRV01” receives the ARP request and sends a packet on the vlan
segment with a source IP of 20.20.20.100 and a destination IP of 20.20.20.10.
9. The packet is being received by the Tier-0 SR and is being routed to the Tier-1 who does
translate the Destination IP of 20.20.20.10 with a value of 172.16.10.10. Packet is sent
to the overlay segment and the virtual machine receives it.
It is crucial to note that in this case, the traffic is initiated by the virtual machine which is
connected to the overlay segment on the Tier-1. If the initial traffic was initiated by a server on
the VLAN segment, a Destination NAT rule would have been required on the Tier-1/Tier-0 since
the initial traffic would not match the SNAT rule that has been configured previously.
FIGURE 4-97 represents an outage on an active Tier-0 gateway with Proxy ARP enabled. The
newly active Tier-0 gateway will send a gratuitous ARP to announce the new MAC address to be
used by the hosts on the VLAN segment in order to reach the virtual machine connected to the
overlay. It is critical to fathom that the newly active Tier-0 will send a Gratuitous ARP for each IP
address that are configured for Proxy ARP.

Figure 4-97: Edge node failover and Proxy ARP

Topology Consideration
This section covers a few of the many topologies that customers can build with NSX. NSX
routing components - Tier-1 and Tier-0 gateways - enable flexible deployment of multi-tiered
routing topologies. Topology design also depends on what services are enabled and where
those services are provided at the provider or tenant level.

189
VMware NSX Reference Design Guide

Supported Topologies
FIGURE 4-98 shows three topologies with Tier-0 gateway providing N-S traffic connectivity via
multiple Edge nodes. The first topology is single-tiered where Tier-0 gateway connects directly
to the segments and provides E-W routing between subnets. Tier-0 gateway provides multiple
active paths for N-S L3 forwarding using ECMP. The second topology shows the multi-tiered
approach where Tier-0 gateway provides multiple active paths for L3 forwarding using ECMP
and Tier-1 gateways as first hops for the segments connected to them. Routing is fully
distributed in this multi-tier topology. The third topology shows a multi-tiered topology with
Tier-0 gateway configured in Active/Standby HA mode to provide some centralized or stateful
services like NAT, VPN etc.

Figure 4-98: Single tier and multi-tier routing topologies

As discussed in the two-tier routing section, centralized services can be enabled on Tier-1 or
Tier-0 gateway level. FIGURE 4-99 shows two multi-tiered topologies.
The first topology shows centralized services like NAT, load balancer on Tier-1 gateways while
Tier-0 gateway provides multiple active paths for L3 forwarding using ECMP.
The second topology shows centralized services configured on a Tier-1 and Tier-0 gateway.
Some centralized services are only available on Tier-1.

190
VMware NSX Reference Design Guide

Figure 4-99: Stateful and Stateless (ECMP) Services Topologies Choices at Each Tier

FIGURE 4-100 shows a topology with Tier-0 gateways connected back to back. “Tenant-1 Tier-0
Gateway” is configured for a stateful firewall while “Tenant-2 Tier-0 Gateway” has stateful NAT
configured. Since stateful services are configured on both “Tenant-1 Tier-0 Gateway” and
“Tenant-2 Tier-0 Gateway”, they are configured in Active/Standby high availability mode. The
top layer of Tier-0 gateway, "Aggregate Tier-0 Gateway” provides ECMP for North-South traffic.
Note that only external interfaces should be used to connect a Tier-0 gateway to another Tier-0
gateway. Static routing and BGP are supported to exchange routes between two Tier-0
gateways and full mesh connectivity is recommended for optimal traffic forwarding and
failover. This topology provides high N-S throughput with centralized stateful services running
on different Tier-0 gateways. This topology also provides complete separation of routing tables
on the tenant Tier-0 gateway level and allows services that are only available on Tier-0
gateways (like VPN with redundant remote peers) to leverage ECMP northbound. Note that
VPN is available on Tier-1 gateways starting NSX 2.5 release. NSX 3.0 introduces new multi
tenancy features such as EVPN and VRF-lite. These features are recommended and suitable for
true multi-tenant architecture where stateful services need to be run on multiple layers or Tier-
0

Full mesh connectivity over a single overlay segment is recommended for optimal traffic
forwarding and failover.
Back-to-back Tier-0 peering involving Active/Standby Tier-0 Gateways presents risks. Some
partial failure conditions may cause the peering tier-0 gateway to forward traffic to the Standby
SR, causing a black hole. These topologies include both back-to-back A/S Tier-0 gateway, and an
A/S tier-0 Gateway peering with an A/A Tier-0 gateway. In order to minimize the risk of such a
condition, the underlying physical infrastructure should prevent the failure of individual BGP
sessions so that any adjacent Tier-0 gateway always has the preferred path for any destination
via the Active SR and not the Standby SR (which, by default performs an AS-Path prepend with a
length of 3). In such cases, we recommend establishing a full mesh EBGP peering over a single

191
VMware NSX Reference Design Guide

overlay segment and actively monitoring the status of all the BGP sessions. In case a partial
failure of a subset of the BGP sessions occurs, the NSX operator should manually intervene to
reestablish the peering session or perform a manual failover making active an SR with all the
BGP sessions operational.
Active/Active Stateful Tier-0 Gateways cannot be used as aggregation routers. Aggregation
routers forward traffic between external interfaces connected to different segments.
Active/Active Stateful Tier-0 gateways require traffic to flow between an external and a
downlink interface. Active/Active stateful gateways can be connected to an aggregation A/A
stateless Tier-0 gateways (Preferably) or an A/S stateful Tier-0 Gateway.

Figure 4-100: Multiple Tier-0 Topologies with Stateful and Stateless (ECMP) Services

FIGURE 4-101 shows another topology with Tier-0 gateways connected back-to-back. “Corporate
Tier-0 Gateway” on Edge cluster-1 provides connectivity to the corporate resources
(172.16.0.0/16 subnet) learned via a pair of physical routers on the left. This Tier-0 has stateful
Gateway Firewall enabled to allow access to restricted users only.

“WAN Tier-0 Gateway” on Edge-Cluster-2 provides WAN connectivity via WAN routers and is
also configured for stateful NAT.

“Aggregate Tier-0 gateway” on the Edge cluster-3 learns specific routes for corporate subnet
(172.16.0.0/16) from “Corporate Tier-0 Gateway” and a default route from “WAN Tier-0
Gateway”. “Aggregate Tier-0 Gateway” provides ECMP for both corporate and WAN traffic
originating from any segments connected to it or connected to a Tier-1 southbound. Full mesh
connectivity is recommended for optimal traffic forwarding.

192
VMware NSX Reference Design Guide

Figure 4-101: Multiple Tier-0 Topologies with Stateful and Stateless (ECMP) Services

A Tier-1 gateway usually connects to a Tier-0 gateway, but it is possible for some use cases to
interconnect it to another Tier-1 gateway using service interfaces (SI) and downlink as depicted
in FIGURE 4-102. Static routing must be configured on both Tier-1 gateways in this case as
dynamic routing is not supported. The Tier-0 gateway must be aware of the all the overlay
segments prefixes to provide connectivity.

193
VMware NSX Reference Design Guide

Figure 4-102: Supported Topology – T1 gateways interconnected to each other using Service Interface and Downlink

Unsupported Topologies
While the deployment of logical routing components enables customers to deploy flexible
multi-tiered routing topologies, FIGURE 4-103 presents topologies that are not supported. The
topology on the left shows that a tenant Tier-1 gateway cannot be connected directly to
another tenant Tier-1 gateway using downlinks exclusively.
The rightmost topology highlights that a Tier-1 gateway cannot be connected to two different
upstream Tier-0 gateways.

Figure 4-103: Unsupported Topologies

194
VMware NSX Reference Design Guide

5 NSX Security
In addition to providing network virtualization, NSX also serves as an advanced security
platform, providing a rich set of features to streamline the deployment of security solutions.
This chapter focuses on core NSX security capabilities, architecture, components, and
implementation. Key concepts for examination include:

● NSX distributed firewall (DFW) provides stateful protection of the workload at the vNIC
level. For ESXi, the DFW enforcement occurs in the hypervisor kernel, helping deliver
micro-segmentation. However, the DFW extends to physical servers, and containers
providing distributed policy enforcement.
● Uniform security policy model for on-premises and public cloud deployment
● Agnostic to compute domain - supporting hypervisors managed by different compute-
managers while allowing any defined micro-segmentation policy to be applied across
hypervisors spanning multiple vCenter environments.
● Support for Layer 2, Layer 3, Layer 4, Layer-7 APP-ID, & Identity based firewall policies
provide security via protocol, port, and or deeper packet/session intelligence to suit
diverse needs.
● NSX Gateway firewall serves as a centralized stateful firewall service for N-S traffic.
Gateway firewall is implemented per gateway and supported at both Tier-0 and Tier-1.
Gateway firewall is independent of NSX DFW from policy configuration and enforcement
perspective, providing a means for defining perimeter security control in addition to
distributed security control.
● Gateway & Distributed Firewall Service insertion capability to integrate existing security
investments using integration with partner ecosystem products on a granular basis
without the need for interrupting natural traffic flows.
● Distributed IDS extends IDS capabilities to every host in the environment.
● Dynamic grouping of objects into logical constructs called Groups based on various
criteria including tag, virtual machine name or operating system, subnet, and segments
which automates policy application.
● The scope of policy enforcement can be selective, with application or workload-level
granularity.
● Firewall Flood Protection capability to protect the workload & hypervisor resources.
● IP discovery mechanism dynamically identifies workload addressing.
● SpoofGuard blocks IP spoofing at vNIC level.

195
VMware NSX Reference Design Guide

● Switch Security provides storm control and security against unauthorized traffic.

NSX 3.2 introduced advanced security features such as security on vCenter dvpgs, distributed
and centralized malware protection, centralized IDS/IPS, URL Filtering, extensive next
generation firewall App Identification support, Network Traffic Anomaly Detection, and
Network Detection and Response (NDR). These new features will be covered in the new version
if the SECURITY DESIGN GUIDE . This document will cover the NSX core security feature.

NSX Security Use Cases


The NSX security platform is designed to address the security challenges faced by IT admins.
Although it started with firewalling, the NSX security feature set has since grown to encompass
Identity Firewalling, IPS, and many more. The NSX firewall is delivered as part of a distributed
platform that offers ubiquitous enforcement, scalability, line rate performance, multi-
hypervisor support, and API-driven orchestration. These fundamental pillars of the NSX firewall
allow it to address many different use cases for production deployment.
One of the leading use cases NSX supports is micro-segmentation. Micro-segmentation enables
an organization to logically divide its data center into distinct security segments down to the
individual workload level, then define distinct security controls for and deliver services to each
unique segment. This is all possible without the need to change underlying network
architecture or addressing. A central benefit of micro-segmentation is its ability to deny
attackers the opportunity to pivot laterally within the internal network, even after the
perimeter has been breached.
VMware NSX supports micro-segmentation as it allows for a centrally controlled, operationally
distributed firewall to be attached directly to workloads within an organization’s network. The
distribution of the firewall for the application of security policy to protect individual workloads
is highly efficient; rules can be applied that are specific to the requirements of each workload.
Of additional value is that NSX’s capabilities are not limited to homogeneous vSphere
environments. NSX supports the heterogeneity of platforms and infrastructure that is common
in organizations today.

196
VMware NSX Reference Design Guide

Figure 5-1: Example of Micro-segmentation with NSX

Micro-segmentation provided by NSX is an essential element of zero trust, specifically it


embodies “making the access control enforcement as granular as possible.” (NIST ZTA
publication). It establishes a security perimeter around each VM or container workload with a
dynamically defined policy which can be down to the user level of granularity. Legacy security
models assume that everything on the inside of an organization's network can be trusted; zero-
trust assumes the opposite - trust nothing and verify everything. This addresses the increased
sophistication of networks attacks and insider threats that frequently exploit the conventional
perimeter-controlled approach. For each system in an organization's network, trust of the
underlying network is removed. A perimeter is defined per system within the network to limit
the possibility of lateral (i.e., East-West) movement of an attacker.
Implementation of a zero-trust architecture with traditional network security solutions can be
costly, complex, and come with a high management burden. Moreover, the lack of visibility for
organization's internal networks can slow down implementation of a zero-trust architecture
and leave gaps that may only be discovered after they have been exploited. Additionally,
conventional internal perimeters may have granularity only down to a VLAN or subnet – as is
common with many traditional DMZs – rather than down to the individual system.

197
VMware NSX Reference Design Guide

NSX DFW Architecture and Components


The NSX DFW architecture management plane, control plane, and data plane work together to
enable a centralized policy configuration model with distributed firewalling. This section will
examine the role of each plane and its associated components, detailing how they interact with
each other to provide a scalable, topology agnostic distributed firewall solution.

Figure 5-2: NSX DFW Architecture and Components

Management Plane
The NSX management plane is implemented through NSX Managers. NSX Managers are
deployed as a cluster of 3 manager nodes. Access to the NSX Manager is available through a
GUI or REST API framework. When a firewall policy rule is configured, the NSX management
plane service validates the configuration and locally stores a persistent copy. Then the NSX
Manager pushes user-published policies to the control plane service within Manager Cluster
which in turn pushes to the data plane. A typical DFW policy configuration consists of one or
more sections with a set of rules using objects like Groups, Segments, and application level
gateway (ALGs). For monitoring and troubleshooting, the NSX Manager interacts with a host-
based management plane agent (MPA) to retrieve DFW status along with rule and flow
statistics. The NSX Manager also collects an inventory of all hosted virtualized workloads on
NSX transport nodes. This is dynamically collected and updated from all NSX transport nodes.

Control Plane
The NSX control plane consists of two components - the central control plane (CCP) and the
Local Control Plane (LCP). The CCP is implemented on NSX Manager Cluster, while the LCP
includes the user space module on all of the NSX transport nodes. This module interacts with
the CCP to exchange configuration and state information.

198
VMware NSX Reference Design Guide

From a DFW policy configuration perspective, NSX Control plane will receive policy rules pushed
by the NSX Management plane. If the policy contains objects including segments or Groups, it
converts them into IP addresses using an object-to-IP mapping table. This table is maintained by
the control plane and updated using an IP discovery mechanism. Once the policy is converted
into a set of rules based on IP addresses, the CCP pushes the rules to the LCP on all the NSX
transport nodes.
The CCP utilizes a hierarchy system to distribute the load of CCP-to-LCP communication. The
responsibility for transport node notification is distributed across the managers in the manager
clusters based on an internal hashing mechanism. For example, for 30 transport nodes with
three managers, each manager will be responsible for roughly ten transport nodes.

Data Plane
The NSX transport nodes comprise the distributed data plane with DFW enforcement done at
the hypervisor kernel level. Each of the transport nodes, at any given time, connects to only one
of the CCP managers, based on mastership for that node. On each of the transport nodes, once
the local control plane (LCP) has received policy configuration from CCP, it pushes the firewall
policy and rules to the data plane filters (in kernel) for each of the virtual NICs. With the
“Applied To” field in the rule or section which defines scope of enforcement, the LCP makes
sure only relevant DFW rules are programmed on relevant virtual NICs instead of every rule
everywhere, which would be a suboptimal use of hypervisor resources.

NSX Data Plane Implementation

Data Plane Components


NSX uses VDS on ESXi hosts for connecting virtual workloads, managing it with the NSX
Manager application. The NSX DFW kernel space implementation for ESXi is same as the
implementation of NSX for vSphere – it uses the VMware Internetworking Service Insertion
Platform (vSIP) kernel module and kernel IO chains filters. FIGURE 5-3: NSX MANAGEMENT P LANE
COMPONENTS ON ESXI provides details on the data plane components for the ESXi host.

199
VMware NSX Reference Design Guide

Figure 5-3: NSX Management Plane Components on ESXi

NSX DFW Policy Lookup and Packet Flow


In the data path, the DFW maintains two tables: a rule table and a connection tracker table. The
LCP populates the rule table with the configured policy rules, while the connection tracker table
is updated dynamically to cache flows permitted by rule table. NSX DFW can allow for a policy
to be stateful or stateless with section-level granularity in the DFW rule table. The connection
tracker table is populated only for stateful policy rules; it contains no information on stateless
policies.
NSX DFW rules are enforced as follows:
• Rules are processed in top-to-bottom order.
• Each packet is checked against the top rule in the rule table before moving down the
subsequent rules in the table.
• The first rule in the table that matches the traffic parameters is enforced. The search is
then terminated, so no subsequent rules will be examined or enforced.
Because of this behavior, it is always recommended to put the most granular policies at the top
of the rule table. This will ensure more specific policies are enforced first. The DFW default
policy rule, located at the bottom of the rule table, is a catchall rule; packets not matching any
other rule will be enforced by the default rule - which is set to “allow” by default. This ensures

200
VMware NSX Reference Design Guide

that VM-to-VM communication is not broken during staging or migration phases. It is a best
practice to then change this default rule to a “drop” action and enforce access control through
an explicit allow model (i.e., only traffic defined in the firewall policy is allowed onto the
network). FIGURE 5-4: NSX DFW P OLICY Lookup diagrams the policy rule lookup and packet flow.

Figure 5-4: NSX DFW Policy Lookup

In the example shown above:

1. WEB VM initiates a session to APP VM by sending TCP SYN packet.


2. The TCP SYN packets hit the DFW on vNIC and does a Flow Table lookup first, to see if
any state matches the existing Flow. Given it's the first packet of the new session,
lookup results in “Flow state not found”.
3. Since the Flow Table missed, the DFW does a Rule Table lookup in top-down order for 5-
Tuple match.
4. Flow Matches FW rule 2, which is Allow so the packet is sent out to the destination.
5. In addition, the Flow table is updated with New Flow State for permitted flow as "Flow
2”.
Subsequent packets in this TCP session checked against this flow in the flow table for the state
match. Once the session terminates, the flow information is removed from the flow table.

201
VMware NSX Reference Design Guide

NSX Security Policy - Plan, Design and Implement

Planning, designing, and implementing NSX security policy is a three-step process:


1. Policy Methodology – Decide on the policy approach - application-centric,
infrastructure-centric, or network-centric
2. Policy Rule Model – Select grouping and management strategy for policy rules by the
NSX DFW policy categories and sections.
3. Policy Consumption – Implement the policy rules using the abstraction through grouping
constructs and options provided by NSX.

Security Policy Methodology


This section details the considerations behind policy creation strategies to help determine
which capabilities of the NSX platform should be utilized as well as how various grouping
methodologies and policy strategies can be adopted for a specific design.
The three general methodologies reviewed in FIGURE 5-5: MICRO-SEGMENTATION Methodologies
can be utilized for grouping application workloads and building security rule sets within the NSX
DFW. This section will look at each methodology and highlight appropriate usage.

202
VMware NSX Reference Design Guide

Figure 5-5: Micro-segmentation Methodologies

5.4.1.1 Ethernet
The Ethernet Section of the policy is a Layer 2 firewalling section. All rules in this section must
use MAC Addresses for their source or destination objects. Any rule defined with any other
object type will be ignored.

5.4.1.2 Application
In an application-centric approach, grouping is based on the application type (e.g., VMs tagged
as “Web-Servers”), application environment (e.g., all resources tagged as “Production-Zone”)
and application security posture. An advantage of this approach is the security posture of the
application is not tied to network constructs or infrastructure. Security policies can move with
the application irrespective of network or infrastructure boundaries, allowing security teams to
focus on the policy rather than the architecture. Policies can be templated and reused across
instances of the same types of applications and workloads while following the application
lifecycle; they will be applied when the application is deployed and is destroyed when the
application is decommissioned. An application-based policy approach will significantly aid in
moving towards a self-service IT model. In an environment where there is strong adherence to
a strict naming convention, the VM substring grouping option allows for simple policy
definition.
An application-centric model does not provide significant benefits in an environment that is
static, lacks mobility, and has infrastructure functions that are properly demarcated.

203
VMware NSX Reference Design Guide

5.4.1.3 Infrastructure
Infrastructure-centric grouping is based on infrastructure components such as segments or
segment ports, identifying where application VMs are connected. Security teams must work
closely with the network administrators to understand logical and physical boundaries.
If there are no physical or logical boundaries in the environment, then an infrastructure-centric
approach is not suitable.

5.4.1.4 Network

Network-centric is the traditional approach of grouping based on L2 or L3 elements. Grouping


can be done based on MAC addresses, IP addresses, or a combination of both. NSX supports
this approach of grouping objects. A security team needs to aware of networking infrastructure
to deploy network-based policies. There is a high probability of security rule sprawl as grouping
based on dynamic attributes is not used. This method of grouping works well for migrating
existing rules from an existing firewall.

A network-centric approach is not recommended in dynamic environments where there is a


rapid rate of infrastructure change or VM addition/deletion.

Security Rule Model


Policy rule models in a data center are essential to achieve optimal micro-segmentation
strategies. The first criteria in developing a policy model is to align with the natural boundaries
in the data center, such as tiers of application, SLAs, isolation requirements, and zonal access
restrictions. Associating a top-level zone or boundary to a policy helps apply consistent, yet
flexible control.
Global changes for a zone can be applied via single policy; however, within the zone there could
be a secondary policy with sub-grouping mapping to a specific sub-zone. An example
production zone might itself be carved into sub-zones like PCI or HIPAA. There are also zones
for each department as well as shared services. Zoning creates relationships between various
groups, providing basic segmentation and policy strategies.
A second criterion in developing policy models is identifying reactions to security events and
workflows. If a vulnerability is discovered, what are the mitigation strategies? Where is the
source of the exposure – internal or external? Is the exposure limited to a specific application or
operating system version?
When east-west security is first implemented in a brownfield environment, there are two
common approaches, depending on corporate culture: either an incremental zonal approach
where one application is secured before moving to the next, or a top-down iterative approach
where first prod and non-prod are divided then each of those areas are further subdivided.
Regardless of the chosen approach, there will likely be a variety of security postures taken

204
VMware NSX Reference Design Guide

within each zone. A lab zone, for example may merely be ring-fenced with a policy that allows
any traffic type from lab device to lab device and only allows basic common services such as
LDAP, NTP, and DNS to penetrate the perimeter in. On the other end of the spectrum, any zone
containing regulated or sensitive data (such as customer info) will often be tightly defined
traffic between entities, many types being further inspected by partner L7 firewall offerings
using Service Insertion.
The answers to these questions help shape a policy rule model. Policy models should be flexible
enough to address ever-changing deployment scenarios, rather than simply be part of the initial
setup. Concepts such as intelligent grouping, tags and hierarchy provide flexible yet agile
response capability for steady state protection as well as during instantaneous threat response.
The model shown in FIGURE 5-6: SECURITY RULE Model represents an overview of the different
classifications of security rules that can be placed into the NSX DFW rule table. Each of the
classification shown represents a category on NSX firewall table layout. The Firewall table
category aligns with the best practice around organizing rules to help admin with grouping
Policy based on the category. Each firewall category can have one or more policy within it to
organize firewall rules under that category.

Figure 5-6: Security Rule Model

Security Policy - Consumption Model


NSX Security policy is consumed by the firewall rule table, which is using NSX Manager GUI or
REST API framework. When defining security policy rules for the firewall table, it is
recommended to follow these high-level steps:

205
VMware NSX Reference Design Guide

• VM Inventory Collection – Identify and organize a list of all hosted virtualized workloads
on NSX transport nodes. This is dynamically collected and saved by NSX Manager as the
ESXi are added as NSX transport nodes.
• Tag Workload – Use VM inventory collection to organize VMs with one or more tags.
Each designation consists of scope and tag association of the workload to an application,
environment, or tenant. For example, a VM tag could be “Scope = Prod, Tag = web” or
“Scope=tenant-1, Tag = app-1”. Often, these categories will dive several layers deep
including BU, project, environment, and regulatory flags. When following the iterative
approach of segmentation, categories and tags can be added to entities with existing
tags. In the application centric approach, new categories can be added with each
application.
• Group Workloads – Use the NSX logical grouping construct with dynamic or static
membership criteria based on VM name, tags, segment, segment port, IP’s, or other
attributes. NSX allows for thousands of groups based on tags, although rarely are more
than a dozen or so needed.
• Define Security Policy – Using the firewall rule table, define the security policy. Have
categories and policies to separate and identify emergency, infrastructure, environment,
and application-specific policy rules based on the rule model.
The methodology and rule model mentioned earlier would influence how to tag and group the
workloads as well as affect policy definition. The following sections offer more details on
grouping and firewall rule table construction with an example of grouping objects and defining
NSX DFW policy.

5.4.3.1 Group Creation Strategies


The most basic grouping strategy is creation of a group around every application which is
hosted in the NSX environment. Each 3-tier, 2-tier, or single-tier applications should have its
own security group to enable faster operationalization of micro-segmentation. When combined
with a basic rule restricting inter-application communication to only essential shared services
(e.g., DNS, AD, DHCP server) this enforces granular security inside the perimeter. Once this
basic micro-segmentation is in place, the writing of per-application rules can begin.
Groups
NSX provides collection of referenceable objects represented in a construct called Groups. The
selection of a specific policy methodology approach – application, infrastructure, or network –
will help dictate how grouping construct is used. Groups allow abstraction of workload grouping
from the underlying infrastructure topology. This allows a security policy to be written for
either a workload or zone (e.g., PCI zone, DMZ, or production environment).

206
VMware NSX Reference Design Guide

A Group is a logical construct that allows grouping into a common container of static (e.g.,
IPSet/NSX objects) and dynamic (e.g., VM names/VM tags) elements. This is a generic construct
which can be leveraged across a variety of NSX features where applicable.
Static criteria provide capability to manually include particular objects into the Group. For
dynamic inclusion criteria, Boolean logic can be used to create groups between various criteria.
A Group creates a logical grouping of VMs based on static and dynamic criteria. TABLE 5-1: NSX
O BJECTS USED FOR Groups shows one type of grouping criteria based on NSX Objects.

NSX Object Description

IP Address Grouping of IP addresses and subnets.

Segment All VMs/vNICs connected to this segment/logical switch segment will be selected.

Nested (Sub-group) of collection of referenceable objects - all VMs/vNICs defined


Group
within the Group will be selected

Segment Port This particular vNIC instance will be selected.

Selected MAC sets container will be used. MAC sets contain a list of individual
MAC Address
MAC addresses.

Grouping based on Active Directory groups for Identity Firewall (VDI/RDSH) use
AD Groups
case.

Table 5-1: NSX Objects used for Groups

TABLE 5-2: VM P ROPERTIES USED FOR Groups list the selection criteria based on VM properties.

VM Property Description
All VMs that contain/equal/starts/not-equals with the string as part
VM Name
of their name.
Tags All VMs that are applied with specified NSX security tags

OS Name All VM with specific operating System type and version


All VMs that contain/equal/starts/not-equals with the string as part
Computer name
of their hostname.
Table 5-2: VM Properties used for Groups

207
VMware NSX Reference Design Guide

The use of Groups gives more flexibility as an environment changes over time. This approach
has three major advantages:
• Rules stay more constant for a given policy model, even as the data center environment
changes. The addition or deletion of workloads will affect group membership alone, not
the rules.
• Publishing a change of group membership to the underlying hosts is more efficient than
publishing a rule change. It is faster to send down to all the affected hosts and cheaper in
terms of memory and CPU utilization.
• As NSX adds more grouping object criteria, the group criteria can be edited to better
reflect the data center environment.

Using Nested Groups


Groups can be nested. A Group may contain multiple groups or a combination of groups and
other grouping objects. A security rule applied to the parent Group is automatically applied to
the child Groups. Nesting should be limited to 3 levels, although more are supported. This is to
ease troubleshooting, minimize unintentional policy results, and to optimize the computational
burden of publishing policy. Nothing prolongs downtime like trying to follow the logic of a
grouping nested 5 levels deep.
In the example shown in FIGURE 5-7: GROUP AND NESTED GROUP Example, three Groups have
been defined with different inclusion criteria to demonstrate the flexibility and the power of
grouping construct.
• Using dynamic inclusion criteria, all VMs with name starting by “WEB” are included in
Group named “SG-WEB”.
• Using dynamic inclusion criteria, all VMs containing the name “APP” and having a tag
“Scope=PCI” are included in Group named “SG-PCI-APP”.
• Using static inclusion criteria, all VMs that are connected to a segment “SEG-DB” are
included in Group named “SG-DB”.
Nesting of Group is also possible; all three of the Groups in the list above could be children of a
parent Group named “SG-APP-1-AllTier”. This organization is also shown in FIGURE 5-7: GROUP
AND N ESTED G ROUP Example.

208
VMware NSX Reference Design Guide

Figure 5-7: Group and Nested Group Example

Efficient Grouping Considerations


Calculation of groups adds a processing load to the NSX management and control planes.
Different grouping mechanisms add different types of loads. Static groupings are more efficient
than dynamic groupings in terms of calculation. At scale, grouping considerations should
consider the frequency of group changes for associated VMs. A large number of group changes
applied frequently means the grouping criteria is suboptimal.

5.4.3.2 Define Policy using DFW Rule Table


The NSX DFW rule table starts with a default rule to allow any traffic. An administrator can add
multiple policies on top of default rule under different categories based on the specific policy
model. NSX distributed firewall table layout consists of Categories like Ethernet, Emergency,
Infrastructure, Environment, and Application to help users to organize security policies. Each
category can have one or more policy/section with one or more firewall rules. Please refer to
Security Rule Model section above to understand the best practices around organizing the
policies.
In the data path, the packet lookup will be performed from top to bottom order, starting with
policies from category Ethernet, Emergency, Infrastructure, Environment and Application. Any
packet not matching an explicit rule will be enforced by the last rule in the table (i.e., default
rule). This final rule is set to the “allow” action by default, but it can be changed to “block” if
desired.
The NSX DFW enables policy to be stateful or stateless with policy-level granularity. By default,
NSX DFW is a stateful firewall; this is a requirement for most deployments. In some scenarios
where an application has less network activity, the stateless section may be appropriate to
avoid connection reset due to inactive timeout of the DFW stateful connection table.
NSX Firewall policy can also be locked by a user to avoid losing any update to policy with
multiple people editing same policy at the same time.

209
VMware NSX Reference Design Guide

Name ID Source Destination Service Profiles Applied To Action Advanced Stats


Setting

Table 5-3: Policy Rule Fields

A rule within a policy is composed of field shown in Table 5-3: Policy Rule Fields and its meaning is
described below
Rule Name: User field; supports up to 30 characters.
ID: Unique rule ID auto generated by System. The rule id helps in monitoring and
troubleshooting. Firewall Log carries this Rule ID when rule logging is enabled.
Source and Destination: Source and destination fields of the packet. This will be a GROUP
which could be static or dynamic groups as mentioned under Group section.
Service: Predefined services, predefined services groups, or raw protocols can be selected.
When selecting raw protocols like TCP or UDP, it is possible to define individual port numbers or
a range. There are four options for the services field:
• Pre-defined Service – A pre-defined Service from the list of available objects.
• Add Custom Services – Define custom services by clicking on the “Create New Service”
option. Custom services can be created based on L4 Port Set, application level gateways
(ALGs), IP protocols, and other criteria. This is done using the “service type” option in the
configuration menu. When selecting an L4 port set with TCP or UDP, it is possible to
define individual destination ports or a range of destination ports. When selecting ALG,
select supported protocols for ALG from the list. ALGs are only supported in stateful
mode; if the section is marked as stateless, the ALGs will not be implemented.
• Custom Services Group – Define a custom Services group, selecting from single or
multiple services. Workflow is similar to adding Custom services, except you would be
adding multiple service entries.

Profiles: This is used to select & define Layer 7 Application ID & FQDN profile. This is used for
Layer 7 based security rules.
Applied To: Define the scope of rule publishing. The policy rule could be published all
workloads (default value) or restricted to a specific GROUP. When GROUP is used in Applied-
To it needs to be based on NON-IP members like VM object, Segments etc. Not using the
Applied To field can result in very large firewall tables being loaded on vNICs, which will
negatively affect performance.
Action: Define enforcement method for this policy rule; available options are listed in TABLE
5-4: FIREWALL RULE TABLE – “ACTION” Values

210
VMware NSX Reference Design Guide

Action Description

Drop Block silently the traffic.

Allow Allow the traffic.

Reject Reject action will send back to initiator:


• RST packets for TCP connections.
• ICMP unreachable with network administratively prohibited code for UDP, ICMP and other IP
connections.

Table 5-4: Firewall Rule Table – “Action” Values

Advanced Settings: Following settings are under advanced settings options:


• Logging: Enable or disable packet logging. When enabled, each DFW enabled host will
send DFW packet logs in a syslog file called “dfwpktlog.log” to the configured syslog
server. This information from the default rule will provide insight to traffic not currently
being caught by existing policy. Best practice for deploying east-west traffic in a
brownfield environment is to define policy with a default allow rule with logging. This
allows for the verification of traffic not currently caught by policy.
• Direction: This field matches the direction of the packet, default both In-Out. It can be
set to match packet exiting the VM, entering the VM, or both directions.
• IP Protocol: By default, both IPv4 & IPv6 protocols. Option to choose IPv4 or IPv6.
• Log Label: You can Label the rule; this will be sent as part of DFW packet log when traffic
hits this rule.
• Notes: This field can be used for any free-flowing string and is useful to store comments.
Best practice is to use this field for change control by pointing to a ticket ID.
• Stats: Provides packets/bytes/sessions statistics along with popularity index associated
with that rule entry. The Also provides Popularity Index for the given rule. Stats per rule
are Polled Aggregated every 15 minutes from all the transport nodes.
Examples of Policy Rules for 3-Tier Application
FIGURE 5-8: 3-TIER APPLICATION NETWORK Topology shows a standard 3-Tier application topology
used to define NSX DFW policy. Three web servers are connected to “SEG Web”, two
applications servers are connected to “SEG App”, and 2 DB servers connected to “SEG DB”. A
distributed Gateway is used to interconnect the three tiers by providing inter-tier routing. NSX

211
VMware NSX Reference Design Guide

DFW has been enabled, so each VM has a dedicated instance of DFW attached to its
vNIC/segment port.

Figure 5-8: 3-Tier Application Network Topology

In order to define micro-segmentation policy for this application use the category Application
on DFW rule table and add a new policy session and rules within it for each application.
The following use cases employ present policy rules based on the different methodologies
introduced earlier.

Example 1: Static IP addresses/subnets Group in security policy rule.

This example shows use of the network methodology to define policy rule. Groups in this
example are identified in TABLE 5-5: FIREWALL RULE TABLE - EXAMPLE 1 – GROUP Definition while
the firewall policy configuration is shown in TABLE 5-6: FIREWALL RULE TABLE - EXAMPLE 1- Policy.

Group name Group definition

Group-WEB-IP IP Members: 172.16.10.0/24

Group-APP-IP IP Members: 172.16.20.0/24

Group-DB-IP IP Members: 172.16.30.0/24

Table 5-5: Firewall Rule Table - Example 1 – Group Definition

212
VMware NSX Reference Design Guide

Name Source Destination Service Action Applied To

Any to Web Any Group-WEB-IP https Allow All

Web to App Group-WEB-IP Group-APP-IP <Enterprise Service Bus> Allow All

App to DB Group-APP-IP Group-DB-IP SQL Allow All

Block-Other Any Any Any Drop All

Table 5-6: Firewall Rule Table - Example 1- Policy

The DFW engine is able to enforce network traffic access control based on the provided
information. To use this type of construct, exact IP information is required for the policy rule.
This construct is quite static and does not fully leverage dynamic capabilities with modern cloud
systems.
Example 2: Using Segment object Group in Security Policy rule.
This example uses the infrastructure methodology to define policy rule. Groups in this example
are identified in TABLE 5-7: FIREWALL RULE TABLE - EXAMPLE 2 – GROUP Definition while the firewall
policy configuration is shown in TABLE 5-8: FIREWALL RULE TABLE - EXAMPLE 2 – Policy.

Group name Group definition

Group-SEG-WEB Static inclusion: SEG-WEB

Group-SEG-APP Static inclusion: SEG-APP

Group-SEG-DB Static inclusion: SEG-DB

Table 5-7: Firewall Rule Table - Example 2 – Group Definition

Name Source Destination Service Action Applied To

Any to Web Any Group-SEG-WEB https Allow Group-SEG-WEB

213
VMware NSX Reference Design Guide

Web to App Group-SEG-WEB Group-SEG-APP <Enterprise Allow Group-SEG-WEB


Service Bus>
Group-SEG-APP

App to DB Group-SEG-APP Group-SEG-DB SQL Allow Group-SEG-APP


Group-SEG-DB

Block-Other Any Any Any Drop Group-SEG-WEB


Group-SEG-APP
Group-SEG-DB

Table 5-8: Firewall Rule Table - Example 2 – Policy

Reading this policy rule table would be easier for all teams in the organization, ranging from
security auditors to architects to operations. Any new VM connected on any segment will be
automatically secured with the corresponding security posture. For instance, a newly installed
web server will be seamlessly protected by the first policy rule with no human intervention,
while VM disconnected from a segment will no longer have a security policy applied to it. This
type of construct fully leverages the dynamic nature of NSX. It will monitor VM connectivity at
any given point in time, and if a VM is no longer connected to a particular segment, any
associated security policies are removed.
This policy rule also uses the “Applied To” option to apply the policy to only relevant objects
rather than populating the rule everywhere. In this example, the first rule is applied to the vNIC
associated with “SEG-Web”. Use of “Applied To” is recommended to define the enforcement
point for the given rule for better resource usage.
Security policy and IP Discovery
Both NSX DFW and Gateway Firewall (GFW) has a dependency on VM-to-IP discovery which is
used to translate objects to IP before rules are pushed to data path. This is mainly required
when the policy is defined using grouped objects. This VM-to-IP table is maintained by NSX
Control plane and populated by the IP discovery mechanism. IP discovery used as a central
mechanism to ascertain the IP address of a VM. By default, this is done using DHCP and ARP
snooping, with VMware Tools available as another mechanism with ESXi hosts. These
discovered VM-to-IP mappings can be overridden by manual input if needed, and multiple IP
addresses are possible on a single vNIC. The IP and MAC addresses learned are added to the
VM-to-IP table. This table is used internally by NSX for SpoofGuard, ARP suppression, and
firewall object-to-IP translation.

214
VMware NSX Reference Design Guide

Intrusion Detection
Much like distributed firewalling changed the game on firewalling by providing a distributed,
ubiquitous enforcement plane, NSX distributed IDS/IPS changes the game on IPS by providing a
distributed, ubiquitous enforcement plane. However, there are additional benefits that the
NSX distributed IPS model brings beyond ubiquity (which in itself is a game changer). NSX IPS is
IPS distributed across all the hosts. Much like with DFW, the distributed nature allows the IPS
capacity to grow linearly with compute capacity. Beyond that, however, there is an added
benefit to distributing IPS. This is the added context. Legacy network Intrusion Detection and
Prevention systems are deployed centrally in the network and rely either on traffic to be
hairpinned through them or a copy of the traffic to be sent to them via techniques like SPAN or
TAPs. These sensors typically match all traffic against all or a broad set of signatures and have
very little context about the assets they are protecting. Applying all signatures to all traffic is
very inefficient, as IDS/IPS unlike firewalling needs to look at the packet payload, not just the
network headers. Each signature that needs to be matched against the traffic adds inspection
overhead and potential latency introduced. Also, because legacy network IDS/IPS appliances
just see packets without having context about the protected workloads, it’s very difficult for
security teams to determine the appropriate priority for each incident. Obviously, a successful
intrusion against a vulnerable database server in production which holds mission-critical data
needs more attention than someone in the IT staff triggering an IDS event by running a
vulnerability scan. Because the NSX distributed IDS/IPS is applied to the vNIC of every workload,
traffic does not need to hairpinned to a centralized appliance, and we can be very selective as
to what signatures are applied. Signatures related to a windows vulnerability don’t need to be
applied to Linux workloads, or servers running Apache don’t need signatures that detect an
exploit of a database service. Through the Guest Introspection Framework, and in-guest drivers,
NSX has access to context about each guest, including the operating system version, users
logged in or any running process. This context can be leveraged to selectively apply only the
relevant signatures, not only reducing the processing impact, but more importantly reducing
the noise and quantity of false positives compared to what would be seen if all signatures are
applied to all traffic with a traditional appliance. For a detailed description of IDS configuration,
see the NSX Product Documentation.

Service Insertion
The value of NSX security extends beyond NSX to your pre-existing security infrastructure; NSX
is the mortar that ties your security bricks to build a stronger wall. Legacy security strategies
were intolerant of pre-existing security infrastructure. Anyone who had a Checkpoint firewall
and wanted to move to a Palo Alto Networks firewall would run the 2 managers, side by side
until the transition was complete. Troubleshooting during this transition period required a lot of
chair swiveling. NSX brings a new model, complementing pre-existing infrastructure. Service
Insertion is the feature which allows NSX firewalls (both gateway and DFW) to send traffic to
legacy firewall infrastructure for processing. This can be done as granularly as a port level,

215
VMware NSX Reference Design Guide

without any modification to existing network architecture. Service Insertion not only sends the
traffic to other services for processing, Service Insertion offers and a deep integration which
allows the exchange of NSX Manager objects to SI service managers. So, a group in NSX which
is comprised on VMs which a substring of “web” (for example) would get shared to the SI
service manager. Thus, when a new VM is spun up which becomes a member of the new
group, the NSX Manager will send that update to the SI Service Manager so that policy can be
consistently applied across platforms.

Additional Security Features


NSX extends the security solution beyond DFW with additional features to enhance data center
security posture on top of micro-segmentation. These features include:
• SpoofGuard - Provides protection against spoofing with MAC+IP+VLAN bindings. This can
be enforced at a per logical port level. The SpoofGuard feature requires static or dynamic
bindings (e.g., DHCP/ARP snooping) of IP+MAC for enforcement.
• Segment Security - Provides stateless L2 and L3 security to protect segment integrity by
filtering out malicious attacks (e.g., denial of service using broadcast/multicast storms)
and unauthorized traffic entering segment from VMs. This is accomplished by attaching
the segment security profile to a segment for enforcement. The segment security profile
has options to allow/block bridge protocol data unit (BPDU), DHCP server/client traffic,
non-IP traffic. It allows for rate limiting of broadcast and multicast traffic, both
transmitted and received.

NSX Security Enforcement – Agnostic to Network Isolation


The NSX security solution is agnostic to network isolation and topology requirements. Below
are the different possible deployment options for adapting NSX micro-segmentation policies
based on different network isolation requirements.
The consumption of security policies requires no changes from policy planning, design, and
implementation perspective. This applies to all of the deployment options mentioned below.
However, the following initial provisioning steps required to enforce NSX security policies:
a) Preparation of compute hosts for NSX.
b) Create VLAN or overlay segments on NSX based on network isolation and
c) Move relevant workloads to relevant VLAN or overlay segments/networks on compute
hosts for policy enforcement.

216
VMware NSX Reference Design Guide

NSX Distributed Firewall for VLAN Backed workloads


This is a very common use case for our customer who is looking at NSX as a platform only for
micro-segmentation security use case without changing existing network isolation, which is
VLAN backed. This is the ideal use case for a brownfield deployment where customer wants to
enhance the security posture for existing applications without changing network design.
The following diagram depicts this use case with logical and physical topology.

Figure 5-9: NSX DFW Logical topology – VLAN Backed Workloads

Figure 5-10: NSX DFW Physical Topology – VLAN Backed Workloads

NSX Distributed Firewall for Mix of VLAN and Overlay backed workloads
This use case mainly applies to customer who wants to adapt NSX micro-segmentation policies
to all of their workloads and looking at adapting NSX network virtualization (overlay) for their
application networking needs in phases. This scenario may arise when customer starts to either

217
VMware NSX Reference Design Guide

deploy new application with network virtualization or migrating existing applications in phases
from VLAN to overlay backed networking to avail the advantages of NSX network virtualization.
This scenario is also common where there are applications which prevent overlay backed
networking from being adopted fully (as described in section <BRIDGING> above). The order of
operations in this environment is as follows: on egress, DFW processing happens first, then
overlay network processing happens second. On traffic arrival at a remote host, overlay
network processing happens first, then DFW processing happens before traffic arrives at the
VM.
The following diagram depicts this use case with logical and physical topology.

Figure 5-11: NSX DFW Logical Topology – Mix of VLAN & Overlay Backed Workloads

Figure 5-12: NSX DFW Physical Topology – Mix of VLAN & Overlay Backed Workloads

NSX Distributed Firewall for Overlay Backed workloads


In this use case where all the virtualized applications are hosted or moved from VLAN to NSX
overlay backed networking from the network isolation perspective. This could apply to green
field deployment or final phase of brownfield deployments where all virtualized applications
have been moved from VLAN to NSX overlay backed networking.

218
VMware NSX Reference Design Guide

In summary, NSX Platform enforces micro-segmentation policies irrespective of network


isolation, VLAN or overlay or Mix, without having to change policy planning, design, and
implementation. A user can define NSX micro-segmentation policy once for the application, and
it will continue to work as you migrate application from VLAN based networking to NSX overlay
backed networking.

Gateway Firewall
The NSX Gateway firewall provides essential perimeter firewall protection which can be used in
addition to a physical perimeter firewall. Gateway firewall service is part of the NSX Edge node
for both bare metal and VM form factors. The Gateway firewall is useful in developing PCI
zones, multi-tenant environments, or DevOps style connectivity without forcing the inter-
tenant or inter-zone traffic onto the physical network. The Gateway firewall data path uses
DPDK framework supported on Edge to provide better throughput.

Consumption
NSX Gateway firewall is instantiated per gateway and supported at both Tier-0 and Tier-1.
Gateway firewall works independently of NSX DFW from a policy configuration and
enforcement perspective, although objects can be shared from the DFW. A user can consume
the Gateway firewall using either the GUI or REST API framework provided by NSX Manager.
The Gateway firewall configuration is similar to DFW firewall policy; it is defined as a set of
individual rules within a section. Like the DFW, the Gateway firewall rules can use logical
objects, tagging and grouping constructs (e.g., Groups) to build policies. Similarly, regarding L4
services in a rule, it is valid to use predefined Services, custom Services, predefined service
groups, custom service groups, or TCP/UDP protocols with the ports. NSX Gateway firewall also
supports multiple Application Level Gateways (ALGs). The user can select an ALG and supported
protocols by using the other setting for type of service. Gateway FW supports only FTP and
TFTP as part of ALG. ALGs are only supported in stateful mode; if the section is marked as
stateless, the ALGs will not be implemented.
When partner services are leveraged through service insertion, the implementation requires
registering the NSX Manager on the partner management console and the registration of the
partner management console in the NSX manager. Once the two managers are integrated,
they will share relevant objects, which will improve security policy consistency across the
board.

Implementation
Gateway firewall is an optional centralized firewall implemented on NSX Tier-0 gateway uplinks
and Tier-1 gateway links. This is implemented on a Tier-0/1 SR component which is hosted on
NSX Edge. Tier-0 and Tier-1 Gateway firewall supports stateful firewalling only with
active/standby or active/active stateful HA mode. It can also be enabled on a Tier-0 Gateway in

219
VMware NSX Reference Design Guide

active/active stateless mode, though it will be only working in stateless mode. Gateway firewall
uses a similar model as DFW for defining policy, and NSX grouping construct can be used as
well. Gateway firewall policy rules are organized using one or more policy sections in the
firewall table for each Tier-0 and Tier-1 Gateway. Firewalling at the perimeter allows for a
coarse grain policy definition which can greatly reduce the security policy size inside.

Deployment Scenarios
This section provides two examples for possible deployment and data path implementation.

Gateway FW as Perimeter FW at Virtual and Physical Boundary


The Tier-0 Gateway firewall is used as perimeter firewall between physical and virtual domains.
This is mainly used for N-S traffic from the virtualized environment to physical world. In this
case, the Tier-0 SR component which resides on the Edge node enforces the firewall policy
before traffic enters or leaves the NSX virtual environment. The E-W traffic continues to
leverage the distributed routing and firewalling capability which NSX natively provides in the
hypervisor. In addition to firewalling, the T1 Gateway can perform per tenant NAT. This is
highly desirable in containerized environments to reduce the consumption of IP addresses.

Figure 5-13: Tier-0 Gateway Firewall – Virtual-to-Physical Boundary

Gateway FW as Inter-tenant FW
The Tier-1 Gateway firewall is used as inter-tenant firewall within an NSX virtual domain. This is
used to define policies between different tenants who resides within an NSX environment. This

220
VMware NSX Reference Design Guide

firewall is enforced for the traffic leaving the Tier-1 router and uses the Tier-1 SR component
which resides on the Edge node to enforce the firewall policy before sending to the Tier-0
Gateway for further processing of the traffic. The intra-tenant traffic continues to leverage
distributed routing and firewalling capabilities native to the NSX.

Figure 5-14: Tier-1 Gateway Firewall - Inter-tenant

Endpoint Protection with NSX


NSX provides the Endpoint Protection platform to allow 3rd party partners to run agentless Anti-
Virus/Anti-Malware (AV/AM) capabilities for virtualized workloads on ESXi. Traditional AV/AM
services require agents be run inside the guest operating system of a virtual workload. These
agents can consume small amounts of resources for each workload on an ESXi host. In the case
of Horizon, VDI desktop hosts typically attempt to achieve high consolidation ratios on the ESXi
host, providing 10s to 100s of desktops per ESXi host. With each AV/AM agent inside the
virtualized workload consuming a small amount of virtual CPU and memory, the resource costs
can be noticeable and possibly reduce the overall number of virtual desktops an ESXi host can
accommodate, thus increasing the size and cost of the overall VDI deployment. The Guest
Introspection platform allows the AV/AM partner to remove their agent from the virtual
workload and provide the same services using a Service Virtual Machine (SVM) that is installed
on each host. These SVMs consume much less virtual CPU and memory overall than running
agents on every workload on the ESXi host. Removing the agent also removes that processing
tax from the VDI, resulting in greater individual VDI performance. Many AV/AM partner
solutions have the ability to add tags to the workloads based on the result of the AV/AM scan.
This allows for an automated immediate quarantine policy based on the result of the AV/AM
scan with the definition of DFW security rules based on the partner tags.

221
VMware NSX Reference Design Guide

Figure 5-15: Endpoint Protection Components

The Endpoint Protection platform for NSX following a simple 3 step process to use.

Figure 5-16: Endpoint Protection Steps

Registration
Registration of the VMware Partner console with NSX and vCenter.

222
VMware NSX Reference Design Guide

Deployment
Creating a Service Deployment of the VMware Partner SVM and deployment to the ESXi
Clusters. The SVMs require a Management network with which to talk to the Partner
Management Console. This can be handled by IP Pool in NSX or by DHCP from the network.
Management networks must be on a VSS or VDS switch.
Consumption
Consumption of the Endpoint Protection platform consists of creating a Service Profile of which
references the Service Deployment and then creating Service Endpoint Protection Policy with
Endpoint Rule that specifies which Service Profile should be applied to what NSX Group of
Virtual Machines.

Recommendation for Security Deployments


This list provides best practices and recommendation for the NSX DFW. These can be used as
guidelines while deploying an NSX security solution.
● For individual NSX software releases, always refer to release notes, compatibility guides,
HARDENING GUIDE and recommended CONFIGURATION MAXIMUMS .

● Exclude management components like vCenter Server, and security tools from the DFW
policy to avoid lockout, at least in the early days of DFW use. Once there is a level of
comfort and proficiency, the management components can be added back in with the
appropriate policy. This can be done by adding those VMs to the exclusion list.
● Use the Applied To field in the DFW to limit the rule growth on individual vNICs.
● Choose the policy methodology and rule model to enable optimum groupings and
policies for micro-segmentation.
● Use NSX tagging and grouping constructs to group an application or environment to its
natural boundaries. This will enable simpler policy management.
● Consider the flexibility and simplicity of a policy model for Day-2 operations. It should
address ever-changing deployment scenarios rather than simply be part of the initial
setup.
● Leverage DFW category and policies to group and manage policies based on the chosen
rule model. (e.g., emergency, infrastructure, environment, application...)
● Use an explicit allow model; create explicit rules for allowed traffic and change DFW the
default rule from “allow” to “drop”.

223
VMware NSX Reference Design Guide

A Practical Approach to Building Micro-Segmentation Policy


The ideal way to have least privilege security model is to define security policies explicitly
allowing all the applications within your data center and denying all other traffic by default.
However, it is a big challenge to profile tens/hundreds of applications and build security policies
for each. Often times, application owners don’t know very much about their application’s
details, which further complicates the process. This may take some time to profile your
application and come up with a port defined security policy. Instead of waiting for profiling of
an application to be complete, one can start with basic outside-in fencing approach to start
defining broader security policies to enhancing the security posture and then move gradually to
the desired explicit allow model over time as you complete the application profiling. The tag
model of NSX allows this to be done quite easily as one can start with just Prod and Non-prod
tags, then add more detailed tags as appropriate (PCI, HR, Finance, etc) to applicable VMs.
(Note: Regardless of approach, experience has shown that the best starting point is common
services such as DNS, NTP, AD, SNMP, etc. A great second step is the backup infrastructure.)
In this section we will walk you through a practical approach to start securing the data center
workloads in a phased outside-in fencing approach as you are working on profiling your
application to provide zero trust model.
First, we layout the data center topology and requirements. Then we will walk you through an
approach to micro-segmentation policies in phases. This approach can be applied to both
brownfield and green field deployment.

5.11.1.1 Data Center Topology and Requirements:


The following data center topology used which matches with most of our customer data center.
This approach can be applied to both brownfield and greenfield deployment.

224
VMware NSX Reference Design Guide

Figure 5-17: Data Center Topology Example

The data center has following characteristics:


1. Application deployment is split into two zones - production & development
2. Multiple applications hosted in both DEV and PROD ZONE
3. All application access same set of common services such as AD, DNS and NTP
The data center network has following characteristics:
1. The Zones have been assigned with dedicated IP CIDR block.
a) Development zone has 10.1.16.0/20 and 10.1.32.0/20 IP CIDR block assigned
b) Production zone has 10.2.16.0/20 and 10.2.32.0/20 IP CIDR block assigned.
2. Infrastructure Services have 10.3.16.0/24 subnet assigned.
3. The application within the ZONE would be given IP subnets within that ZONE specific
CIDR block.
4. In most cases application VM’s belonging to same application share same L2 segments.
In some cases, they have separate L2 segments, especially for Database’s. (In
brownfield environments where L2 segments may be mixed populations of workloads,
one can easily create static groups of workloads.)
The data center security has following Requirements:
1. All applications need to be allowed communicate with common Infrastructure services.
2. Between the ZONE - Workloads should not be allowed to communicate with each other.
3. Within the ZONE - Applications VM’s belonging to a certain application should not be
talking to other application VM’s.
4. Some application within a ZONE have common Database services which runs within that
ZONE.
5. Log all unauthorized communication between workloads for monitoring and for
compliance.

5.11.1.2 Phased approach for NSX micro-segmentation policies:

Phase-1: Define common-services policy.


Here are the suggested steps:

225
VMware NSX Reference Design Guide

1. Define NSX Groups for each of the Infrastructure Services. Following example shows the
group for DNS and NTP servers with IP addresses of the respective servers as group
members.

Figure 5-18: NSX Groups Example

2. Define policy for common services; like DNS, NTP as in the figure below.
a) Define this policy under Infrastructure tab as shown below.
b) Have two rules allows all workloads to access the common services using GROUPS
created in step 1 above.
c) Use Layer 7 context profile, DNS and NTP, in the rule to further enhance the security
posture.
d) Have catch-all deny rule to deny any other destination for the common services with
logging enabled, for compliance and monitoring any unauthorized communication.
Note: If the management entities are not in an exclusion list, this section would need to
have rules to allow the required protocols between the appropriate entities. See
HTTPS :// PORTS . VMWARE . COM/ HOME /V S PHERE for the ports for all VMware products.

226
VMware NSX Reference Design Guide

Figure 5-19: Common Services Policy Example

Phase-2: Define Segmentation around ZONES - by having an explicit allow policy between
ZONES
As per the requirement, define policy between zones to deny any traffic between zones. This
can be done using IP CIDR block as data center zones have pre-assigned IP CIDR block.
Alternatively, this can be done using workload tags and other approach. However, IP-GROUP
based approach is simpler (as admin has pre-assigned IP CIDR Block per zone), no additional
workflow to tag workload and also less toll, compare to tagged approach, on NSX Manager and
control plane. Tagged approach may add additional burden on NSX Manager to compute
polices and update, in an environment with scale and churn. As a rule of thumb, the larger the
IP block that can be defined in a rule, the more the policy can be optimized using CIDR blocks. In
cases where there is no convenient CIDR block to group workloads, static groupings may be
used to create entities without churn on the NSX Manager.
Here are the suggested steps:
1. Define 2 NSX Groups for each of the ZONE, Development and Production, say DC-ZONE-
DEV-IP & DC-ZONE-PROD-IP with respective IP CIDR BLOCKs associated with the
respective zones as members.

227
VMware NSX Reference Design Guide

Figure 5-20: Policies Between Zones Example

2. Define policy in environment category using the IP GROUPS created in step-1 to restrict
all communication between Development and Production ZONE’s.
3. Have logging enabled for this policy to track all unauthorized communication attempts.
(Note: In many industries, it is sufficient to log only the default action for
troubleshooting purposes. In others, there may be a compliance mandate to log every
action. Logging requirements are driven by the balance between storage costs and
compliance requirements.)

Figure 5-21: Policy Example

Phase-3: Define Segmentation around every Application, one at a time

228
VMware NSX Reference Design Guide

This is two step approach to build a policy for the application. First step is to start with fence
around application to build security boundary. Then as a second step profile the application
further to plan and build more granular port-defined security policies between tiers.

• Start with DEV zone first and identify an application to be micro-segmented, say DEV-
ZONE-APP-1.
• Identify all VM’s associated with the Application within the zone.
• Check application has its own dedicated network Segments or IP Subnets.
o If yes, you can leverage Segment or IP-based Group.
o If no, tag application VM’s with uniquely zone and application specific tags, say
ZONE-DEV & APP-1.
• Check this application requires any other communication other than infra services and
communication within group. For example, APP is accessed from outside on HTTPS.
Once you have above information about DEV-ZONE-APP-1, create segmentation around
application by following steps:
1- Apply two tags to all the VM’s belonging to APP-1 in the ZONE DEV, ZONE-DEV & APP-1.

Figure 5-22: Segmentation Example

2- Create a GROUP, say “ZONE-DEV-APP-1” with criteria to match on tag equal to “ZONE-
DEV & APP-1”.

229
VMware NSX Reference Design Guide

Figure 5-23: Group Example

3- Define a policy under Application category with 3 rules as in the FIGURE 5-24:
APPLICATION P OLICY Example.
a. Have “Applied To” set to “ZONE-DEV-APP-1” to limit the scope of policy only to
the application VM’s.
b. The first rule allows all internal communications between the application VM’s.
Enable logging for this rule to profile the application tiers and protocols. (Each
log entry will contain 5 tuple details about every connection.)
c. The second rule allows access to front end of the application from outside. Use
the L7 context profile to allow only SSL traffic. The below example uses Exclude
Source from within ZONE, so that application is only accessible from outside, not
from within except APP’s other VM’s, as per rule one.
d. Default deny all other communication to these “ZONE-DEV-APP-1” VM’s. Enable
log for compliance and monitoring any unauthorized communication.

Figure 5-24: Application Policy Example

Phase-4: Review Logs for Application Profile.

230
VMware NSX Reference Design Guide

Log entries will identify the direction (In/Out) as well as the protocol and source IP address/port
and destination IP addresses/port for each flow. If using the log file for policy definition, it is
often advisable to process the log files using excel to sort traffic. Typically, 2 sheets are created,
one for IN traffic and one for OUT traffic. Then, each sheet is sorted by port first then IP
address. (In the case of IN traffic by destination IP address and in the case of OUT traffic by
source address. This sorting methodology allows for the grouping of multiple servers
serving/accessing the same traffic.) For each of these groupings, a rule can be inserted above
rule 1 for the application. This will prevent the known traffic from appearing in the log. Once
sufficient confidence is gained that the application is completely understood (this is typically
when the logs are empty), the original rule ZONE-DEV-APP-1 can be removed. At this point, the
security model has transitioned from zone-based to micro segmentation. (Note: Certain
environments - such as labs - may be best served by ring fencing, whereas other environments
may wish to add service insertion for certain traffic types on top of micro segmentation – such
as sensitive financial information. The value of NSX is that a customer provides the means to
implement appropriate security in one environment without impacting the other.)
Phase-5: Repeat Phase-3 for other applications and ZONES.
Repeat the same approach as in Phase-3 for other applications, to have security boundary for
every application within the ZONE-DEV and ZONE-PROD. Note that the securing of each of
these applications can happen asynchronously, without impact to the others. This
accommodates application-specific maintenance windows, where required.
Phase-6: Define Emergency policy, Kill Switch, in case of Security Event
An emergency policy mainly leveraged for following use case and enforced on top of the
firewall table:
1- To quarantine vulnerable or compromised workloads in order to protect other
workloads.
2- May want to explicitly deny known bad actors by their IP Subnet based on GEO location
or reputation.
This policy is defined in Emergency Category as shown:
1- First two rules quarantine all traffic from workloads belonging to group GRP-
QUARANTINE.
a. “GRP-QUARANTINE” is a group which matches all VM with tag equal to
“QUARANTINE”. (If guest introspection is implemented, the AV/AM tags can be
used to define different quarantine levels.)
b. In order to enforce this policy to vulnerable VM’s, add tag “QUARANTINE” to
isolate the VM’s and allow only admin to access the hosts to fix the vulnerability.
2- Other two rule uses Group with known bad IP’s to stop any communication with those
IP’s.

231
VMware NSX Reference Design Guide

Figure 5-25: Emergency Category Example

In creating these policies, the iterative addition of rules to the policy is something that can be
done at any time. It is only when the action of the default rule changes from allow to
deny/drop that a maintenance window is advised. As logging has been on throughout the
process, it is highly unusual to see an application break during the window. What is most
frequently the case is that something within the next week or month may emerge as an
unforeseen rule that was missed. For this reason, it is advised that even in environments where
compliance does not dictate the collection of logs, the Deny All rule be set to logging. Aside
from the security value of understanding the traffic that is being blocked, the Deny All rule logs
are very useful when troubleshooting applications.
At this point you have basic level of micro-segmentation policy applied to all the workloads to
shrink the attack surface. As a next step you further break the application into application tiers
and its communication by profiling application flows using firewall logs or exporting IPFIX flows
to Network Insight platform. This will help to group the application workload based on the
function within the application and define policy based on associated port & protocols used.
Once you have these groupings and protocols identified for a given application, update the
policy for that application by creating additional groups and rules with right protocols to have
granularly defined rules one at a time.
With this approach you start with outside-in fencing to start with micro-segmentation policies
and finally come up with a granular port-based micro-segmentation policy for all the
application.

NSX Firewall- For All Deployment Scenario


NSX firewall provides different security controls: Distribute Firewall, Distributed IDS, Gateway
Firewall & Bridge Firewall, as an option to provide firewalling to different deployment
scenarios.

232
VMware NSX Reference Design Guide

A typical data center would have different workloads: VM's, Containers, Physical Server, and a
mix of NSX managed and non-managed workloads. These workloads may also have a
combination of a VLAN-based network or an NSX based overlay network.

The following FIGURE 5-26: NSX FIREWALL FOR ALL DEPLOYMENT Scenario summarizes different
datacenter deployment scenarios and associated NSX firewall security controls, which best fits
the design. You can use same NSX manager as a single pane of glass to define Security policies
to all of these different scenarios using different security controls.

1- NSX Managed Workloads with standard VLAN based networking.


• NSX Distributed Firewall can be used to protect NSX managed VM's, Containers &
Physical Server workloads.
2- NSX Managed Workloads with NSX Overlay for networking:
• NSX Distributed Firewall can be used to protect NSX managed VM's, Containers &
Physical Server workloads.
3- Non-NSX Managed workloads on traditional VLAN based network.
• NSX Gateway Firewall can provide the Inter VLAN routing and Firewalling. The Service
Interface on NSX Gateway is used as a gateway & firewall for all non-NSX managed VLAN
workloads.
4- NSX managed Overlay workload bridged to Non-NSX managed VLAN.
• This is the bridge scenario where an Overlay network is extended at Laye-2 into a VLAN
network using NSX Bridge. In this case, NSX managed Overlay workloads can use
DFW/D-IDS, and Bridge Firewall can secure traffic at the boundary between VLAN and
overlay network.

233
VMware NSX Reference Design Guide

Figure 5-26: NSX Firewall For all Deployment Scenario

234
VMware NSX Reference Design Guide

6 NSX Load Balancer


NSX Load Balancer Roadmap
This chapter presents an overview of the NSX Native Load Balancer. Starting with NSX version
3.2, NSX native load balancing features will not be added or enhanced, and NSX platform
enhancements will not be extended to the NSX native load balancer. The official announcement
is part of the NSX 3.2 RELEASE NOTES . The release notes include the following recommendations:
• It is recommended that new deployments with NSX take advantage of VMware NSX
Advanced Load Balancer (Avi) using release v20.1.6 or later and not use the native NSX
Load Balancer.
• If you are using Load Balancing in NSX, you are advised to migrate to VMware NSX
Advanced Load Balancer (Avi), which provides a superset of the NSX load balancing
functionality.
• If you have purchased NSX Advanced, NSX Enterprise Plus, NSX Advanced, or NSX
Enterprise, you are entitled to the Basic edition of VMware NSX Advanced Load Balancer
(Avi), which has feature parity with NSX LB.
• It is recommended that you purchase VMware NSX Advanced Load Balancer (Avi)
Enterprise to unlock enterprise grade load balancing, GSLB, advanced analytics,
container ingress, application security, and WAF.
The NSX DESIGN GUIDE FOR AVI VANTAGE describes the integration between NSX Advanced load
Balancer (AVI) and NSX Datacenter. The process outlined in that document is the recommended
integration path, and any new deployments should follow its guidelines.
NSX 3.2 introduced additional integrations between NSX Datacenter and NSX Advanced Load
Balancer at the UI/API level. Such functionalities include:
6. An installation workflow that allows deploying NSX Advanced Load Balancer through the
NSX Datacenter UI
7. Launch re NVMwaSX ALB (Avi) UI from the NSX Manager
8. Configure VMware NSX Advanced Load Balancer (Avi) from within NSX Manager.
9. The NSX Policy API can be used to manage the NSX Advanced Load Balancer
configurations of virtual services and their dependent objects (Since NSX 3.1.1)
Integrations 3 and 4 have been deprecated and will be removed in a future release. For this
reason, we recommend avoiding making them part of any deployment. The official deprecation
announcement is available in this KNOWLEDGE BASE ARTICLE .

235
VMware NSX Reference Design Guide

Load Balancing Overview


A load-balancer defines a virtual service, or virtual server, identified by a virtual IP address (VIP)
and a UDP/TCP port. This virtual server offers an external representation of an application while
decoupling it from its physical implementation: traffic received by the load balancer can be
distributed to other network-attached devices that will perform the service as if it was handled
by the virtual server itself. This model is popular as it provides benefits for application scale-out
and high-availability:
• Application scale-out:
The following diagram represents traffic sent by users to the VIP of a virtual server,
running on a load balancer. This traffic is distributed across the members of a pre-
defined pool of capacity.

Figure 6-1: Load Balancing Offers Application Scale-out

The server pool can include an arbitrary mix of physical servers, VMs or containers that
together, allow scaling out the application.
• Application high-availability:
The load balancer is also tracking the health of the servers and can transparently
remove a failing server from the pool, redistributing the traffic it was handling to the
other members:

Figure 6-2: Load Balancing Offers Application High-availability

Modern applications are often built around advanced load balancing capabilities, which go far
beyond the initial benefits of scale and availability. In the example below, the load balancer
selects different target servers based on the URL of the requests received at the VIP:

236
VMware NSX Reference Design Guide

Figure 6-3: Load Balancing Offers Advanced Application Load Balancing

Thanks to its native capabilities, modern applications can be deployed in NSX without requiring
any third party physical or virtual load balancer. The next sections in this part describe the
architecture of the NSX load balancer and its deployment modes.

NSX Load Balancing Architecture


In order to make its adoption straightforward, the different constructs associated to the NSX
load balancer have been kept similar to those of a physical load balancer. The following
diagram show a logical view of those components.

237
VMware NSX Reference Design Guide

Figure 6-4: NSX Load Balancing Main Components

Load Balancer
The NSX load balancer is running on a Tier-1 gateway. The arrows in the above diagram
represent a dependency: the two load balancers LB1 and LB2 are respectively attached to the
Tier-1 gateways 1 and 2. Load balancers can only be attached to Tier-1 gateways (not Tier-0
gateways), and one Tier-1 gateway can only have one load balancer attached to it.
Virtual Server
On a load balancer, the user can define one or more virtual server (the maximum number
depends on the load balancer form factor – See NSX Administrator Guide for load balancer
scale information). As mentioned earlier, a virtual server is defined by a VIP and a TCP/UDP port
number, for example IP: 20.20.20.20 TCP port 80. The diagram represents four virtual servers
VS1, VS2, VS5 and VS6. A virtual server can have basic or advanced load balancing options such
as forward specific client requests to specific pools (see below), or redirect them to external
sites, or even block them.
Pool
A pool is a construct grouping servers hosting the same application. Grouping can be configured
using server IP addresses or for more flexibility using Groups. NSX provides advanced load
balancing rules that allow a virtual server to forward traffic to multiple pools. In the above

238
VMware NSX Reference Design Guide

diagram for example, virtual server VS2 could load balance image requests to Pool2, while
directing other requests to Pool3.
Monitor
A monitor defines how the load balancer tests application availability. Those tests can range
from basic ICMP requests to matching patterns in complex HTTPS queries. The health of the
individual pool members is then validated according to a simple check (server replied), or more
advanced ones, like checking whether a web page response contains a specific string.
Monitors are specified by pools: a single pool can use only 1 monitor, but the same monitor can
be used by different Pools.

NSX Load Balancing deployment modes


NSX load balancer is flexible and can be installed in either traditional in-line or one-arm
topologies. This section goes over each of those options and examine their traffic patterns.

In-line load balancing


In in-line load balancing mode, the clients and the pool servers are on different side of the load
balancer. In the design below, the clients are on the Tier-1 uplink side, and servers are on the
Tier-1 downlink side:

Figure 6-5: In-Line Load Balancing

239
VMware NSX Reference Design Guide

Because the traffic between client and servers necessarily go through the load-balancer, there
is no need to perform any LB Source-NAT (Load Balancer Network Address Translation at virtual
server VIP).
The in-line mode is the simplest load-balancer deployment model. Its main benefit is that the
pool members can directly identify the clients from the source IP address, which is passed
unchanged (step2). The load-balancer being a centralized service, it is instantiated on a Tier-1
gateway SR (Service Router). The drawback from this model is that, because the Tier-1 gateway
now has a centralized component, East-West traffic for Segments behind different Tier-1 will be
pinned to an Edge node in order to get to the SR. This is the case even for traffic that does not
need to go through the load-balancer.

One-arm load balancing


In one-arm load balancing mode, both client traffic (client traffic to the load-balancer VIP) and
server traffic (load-balancer to server) use the same load balancer interface. In that case, LB-
SNAT will be used to make sure that the traffic from the servers back to the client indeed go
through the load-balancer. There are two variations over this one-arm load-balancing scenario:
a case where both clients are servers are on the same subnet and a case where they are on
different subnets. For both cases the solution leverages load-balancer source NAT in order to
make sure that traffic from a server to its clients is directed to the load-balancer. As a result,
the server will not see the real IP address of the clients. Note that the load-balancer can inject
an “X-Forwarded-For” header for HTTP/HTTPS traffic in order to work around this issue.

6.4.2.1 Clients and servers on the same subnet


In the design below, the clients and servers are on the same Tier-1 gateway downlink.

240
VMware NSX Reference Design Guide

Figure 6-6: One-Arm Load Balancing with Clients and Servers on the same segment

The need for a Tier-1 SR in for the centralized load-balancer service result in East-West traffic
for Segments behind different Tier-1 being pinned to an Edge node. This is the same drawback
as for the inline model described in the previous part.

6.4.2.2 Load Balancer One-Arm attached to Segment


In the design below, the blue Tier-1 gateway does not run any load-balancer service. Instead,
the load-balancer has been deployed as a standalone Tier-1 gateway (represented in orange in
the diagram), with a single Service Interface. This gateway is acting as an appliance instantiating
a load-balancer. This way, several segments below the blue Tier-1 gateway can have their own
dedicated load-balancer.

Figure 6-7: Load Balancer One-Arm attached to segment overlay

This design allows for better horizontal scale, as an individual segment can have its own
dedicated load-balancer service appliance(s). This flexibility in the assignment of load-balancing
resources comes at the expense of potentially instantiating several additional Tier-1 SRs on
several Edge nodes. Because the load-balancer service has its dedicated appliance, in East-West
traffic for Segments behind different Tier-1 gateway (the blue Tier-1 gateway in the above
diagram) can still be distributed. The diagram above represented a Tier-1 One-Arm attached to
overlay segment.

241
VMware NSX Reference Design Guide

Figure 6-8: Load Balancer One-Arm attached to segment VLAN

Tier-1 One-Arm LB can also be attached to physical VLAN segments as shown in above figure,
and thus offering load balancing service even for applications on VLAN. In this use case, the
Tier-1 interface is also using a Service Interface, but this time connected to a segment-VLAN
instead of a segment-overlay.

NSX load-balancing technical details


This section provides additional details on how the load-balancer components are physically
implemented in an NSX environment. Even if it’s not necessary for implementing the designs
described in the previous part, understanding the traffic flow between the components, the
high-availability model or the way the monitor service is implemented will help the reader
optimize resource usage in their network.

Load-balancer high-availability
The load-balancer is a centralized service running on a Tier-1 gateway, meaning that it runs on a
Tier-1 gateway Service Router (SR). The load-balancer will thus run on the Edge node of its
associated Tier-1 SR, and its redundancy model will follow the Edge high-availability design.

242
VMware NSX Reference Design Guide

Figure 6-9: Load Balancing High-Availability

The above diagram represents two Edge nodes hosting three redundant Tier-1 SRs with a load-
balancer each. The Edge High Availability (HA) model is based on periodic keep alive messages
exchanged between each pair of Edges in an Edge Cluster. This keepalive protects against the
loss of an Edge as a whole. In the above diagram, should Edge node 2 go down, the standby
green SR on Edge node 1, along with its associated load-balancer, would become active
immediately.
There is a second messaging protocol between the Edges. This one is event driven (not
periodic), and per-application. This means that if a failure of the load-balancer of the red Tier-1
SR on Edge node 1 is detected, this mechanism can trigger a failover of just this red Tier-1 SR
from Edge node 1 to Edge node 2, without impacting the other services.
The active load balancer service will always synchronize the following information to the
standby load balancer:
• State Synchronization
• L4 Flow State
• Source-IP Persistence State
• Monitor State
This way, in case of failover, the standby load balancer (and its associated Tier-1 SR) can
immediately take over with minimal traffic interruption.

243
VMware NSX Reference Design Guide

Load-balancer monitor
The pools targeted by the virtual servers configured on a load-balancer have their monitor
services running on the same load-balancer. This ensure that the monitor service cannot fail
without the load-balancer failing itself (fate sharing.) The left part of the following diagram is
representing the same example of relation between the different load-balancer components as
the one used in part 6.3. The right part of the diagram is providing an example of where those
components would be physically located in a real-life scenario.

Figure 6-10: NSX Load Balancer Monitor

Here, LB1 is a load-balancer attached to Tier-1 Gateway 1 and running two virtual servers VS1
and VS2. The SR for Tier-1 Gateway 1 is instantiated on Edge 1. Similarly, load-balancer LB2 is
on gateway Tier-1 Gateway 2, running VS5 and VS6.
Monitor1 and Monitor2 protecting server pools Pool1, Pool2 and Pool3 used by LB1. As a result,
both Monitor1 and Monitor2 are implemented on the SR where LB1 reside. Monitor2 is also
polling servers used by LB2, thus it is also implemented on the SR where LB2 is running. The
Monitor2 example highlights the fact that a monitor service can be instantiated in several
physical locations and that a given pool can be monitored from different SRs.

244
VMware NSX Reference Design Guide

Load-balancer Layer4 or Layer7 load balancing


NSX LB offers Layer4 (L4) and Layer7 (L7) load balancing. L4 VIP load balances UDP and TCP
connections.The client connection is load balanced by the VIP and its connection is terminated
by one of the pool members.
Virtual Server
20.20.20.20:80 Server
Pool

Figure 6-11: NSX L4 VIP

L7 VIP load balances HTTP or HTTPS connections. The client connection is terminated by the
VIP, and once the client’s HTTP or HTTPS request is received then the load balancer establishes
another connection to one of the pool members. If needed, some specific load balancing
configuration can also be done by L7 VIP, like a selection of specific pool members based on the
request.
Pool
Virtual Server
30.30.30.30:80
www

www.mysite.com

blog.mysite.com Pool
blog

Figure 6-12: NSX L7 VIP

For L7 VIP HTTPS, NSX LB offers 3 modes: HTTPS Off-Load, HTTPS End-to-End SSL, and SSL
Passthrough.
HTTPS Off-Load decrypts the HTTPS traffic at the VIP and forward the traffic in clear HTTP to the
pool members. It is the best balance between security, performance, and LB flexibility:
• Security: traffic is encrypted on the external side.
• Performance: web servers don’t have to run encryption.
• LB flexibility: all advanced configuration on HTTP traffic available like URL load
balancing.

245
VMware NSX Reference Design Guide

HTTPS Off-Load LB decrypts


and forwards in clear
Server Pool
S
HTTPS HTTP
S

S
VIP L7
HTTPS:443

Figure 6-13: NSX L7 HTTPS Off-Load VIP

HTTPS End-to-End SSL decrypts the HTTPS traffic at the VIP and re-encrypts the traffic in
another HTTPS session to the pool members. It is the best security and LB flexibility:
• Security: traffic is encrypted end to end.
• Performance: this mode has lower performance with traffic decrypted/encrypted twice.
• LB flexibility: all advanced configuration on HTTP traffic available like URL load
balancing.

HTTPS End-to-End SSL LB decrypts


and re-encrypts before
forwarding Server Pool
S
HTTPS HTTPS
S

S
VIP L7
HTTPS:443

Figure 6-14: NSX L7 HTTPS End-to-End VIP

HTTPS SSL Passthrough does not decrypt the HTTPS traffic at the VIP and SSL connection is
terminated on the pool members. It is the best security and performance, but with limited LB
flexibility:
• Security: traffic is encrypted end to end.
• Performance: highest performance since LB does not terminate SSL traffic
• LB flexibility: advanced configuration based on HTTP traffic is not available. Only
advanced configuration based on SSL traffic is available like SSL SNI load balancing.

246
VMware NSX Reference Design Guide

SSL Passthrough LB does not decrypt


and SSL connection is terminated on
Pool Members Server Pool
S
HTTPS
S

S
VIP L7
HTTPS:443

Figure 6-15: NSX L7 SSL Passthrough VIP

Load-balancer IPv6
NSX LB has many NSX network and security services offers its service for IPv4 and IPv6 clients.
LB IPv4
Clients LB Server

IPv4 IPv4
S Pool

Clients VIP Pool Members


IPv4 IPv4 IPv4

LB IPv6
Clients LB Server

IPv6 IPv6
S Pool

Clients VIP Pool Members


IPv6 IPv6 IPv6

Figure 6-16: NSX LB IPv4 and IPv6

Load-balancer traffic flows


Tier-1 gateway looks like a single entity from a logical point of view. However, and as
mentioned several times already, when a load-balancer is configured on a Tier-1 gateway, it is
physically instantiated on a Tier-1 Gateway Service Router. This part is exploring the scenarios
where the logical representation of a Tier-1 gateway, hiding the distinction between SR and DR,
can lead to confusion.

6.5.5.1 The in-line model


With the in-line model, traffic between the clients and the servers necessarily go through the
load balancer. Thanks to this property, there is no need for source LB-SNAT in order to make
sure that traffic goes through the load balancer both ways. The following diagram shows both
logical and physical representation of a Tier-1 gateway used to host a load-balancer operating

247
VMware NSX Reference Design Guide

in-line. Clearly, traffic from clients must go through the Tier-1 SR where the load-balancer is
instantiated in order to reach the server and vice versa:

Figure 6-17: In-Line Model: Logical and Expanded View

The following diagram represents another scenario that, from a logical standpoint at least,
looks like an in-line load-balancer design. However, source LB-SNAT is required in this design,
even if the traffic between the clients and the servers cannot apparently avoid the Tier-1
gateway where the load-balancer is instantiated.

Figure 6-18: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View

248
VMware NSX Reference Design Guide

The following expanded view, where the Tier-1 SR and DR are represented as distinct entities
and hosted physically in different location in the network, clarifies the reason why source LB-
SNAT is mandatory:

Figure 6-19: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View

Traffic from server to client would be switched directly by the Tier-1 DR without going through
the load-balancer on the SR if source LB-SNAT was not configured. This design is not in fact a
true in-line deployment of the load-balancer and does require LB-SNAT.

6.5.5.2 One-arm model


From a logical standpoint, the VIP of a virtual server belongs to the subnet of the downlink of
the Tier-1 gateway associated to the load-balancer. The following diagram represents a load-
balancer on a Tier-1 gateway with a downlink to subnet 10.0.0/24. The Tier-1 gateway interface
has the IP address 10.0.0.1, and a virtual server with VIP 10.0.0.6 has been configured on the
load balancer.

249
VMware NSX Reference Design Guide

Figure 6-20: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Logical View

The diagram below offers a possible physical representation of the same network, where the
Tier-1 gateway is broken down between an SR on an Edge Node, and a DR on the host where
both client and servers are instantiated (note that, in order to simplify the representation, the
DR on the Edge was omitted.)

Figure 6-21: Load Balancing VIP IP@ in Tier-1 Downlink Subnet – Tier-1 Expanded View

250
VMware NSX Reference Design Guide

This representation makes it clear that because the VIP is not physically instantiated on the DR,
even if it belongs to the subnet of the downlink of the Tier-1 gateway, some additional
“plumbing” is needed in order to make sure that traffic destined to the load-balancer reach its
destination. Thus, NSX configures proxy-ARP on the DR to answer local request for the VIP and
adds a static route for the VIP pointing to the SR (represented in red in the diagram.)

Load-balancing combined with SR services (NAT and Firewall)


Since NSX 2.4, the load-balancing service can be inserted in a service chain along with NAT and
centralized firewall.
In case of service chaining, the order is NAT, then central firewall, at last load balancing.

Figure 6-22: LB + NAT + FW Services Chaining

251
VMware NSX Reference Design Guide

7 NSX Design Considerations


This section examines the technical details of a typical NSX-based enterprise data center design.
It looks at the physical infrastructure and requirements and discusses the design considerations
for specific components of NSX. Central concepts include:
• NSX Deployment models
• Physical network fabric considerations
• Connectivity of management and control plane components (NSX Manager.)
• Design for connecting the compute hosts running ESXi
• Design for the NSX Edge and Edge clusters.
• Organization of compute domains and NSX resources.
• Review of sample deployment scenarios.

NSX Deployment Models

Figure 7-1: NSX Deployment Models

The first step in planning an NSX installation is understanding the deployment model we will
adopt. NSX can be adopted in three modes. They are not mutually exclusive, as they can coexist
in the same NSX installation and even vSphere cluster in some case. The three deployment
models map to different NSX use cases: virtual workload security, network virtualization, bare

252
VMware NSX Reference Design Guide

metal server security. NSX can help in a variety of more advanced use cases, such as multi
location connectivity, disaster recovery, and cloud native applications, but here we focus on the
core use cases to create a design framework.
The three deployment models are:
1. Distributed security model for virtualized workloads
2. Network virtualization with distributed security for virtualized workloads
3. Bare metal workloads security via gateway firewall (centralized security) or NSX bare
metal agent
In the distributed security model, NSX provides distributed security services such as distributed
firewall and distributed IDS/IPS. Still, the physical network retains a significant part of the
switching and routing responsibilities. The only NSX component participating in the networking
functionalities is the VMware VDS, responsible for switching the traffic between the VMs
running in the hypervisor and the physical uplinks. Once the packets are delivered to the
upstream physical switch, it is the responsibility of the physical fabric to deliver it to the
appropriate destination. This networking model is the most common in vSphere deployments
that are not running NSX, making the adoption of NSX distributed security entirely transparent
for the physical network in brownfield environments.
The networking and security model provides the full benefit of NSX network and security
virtualization. Distributed security services operate as in the security-only model, being
orthogonal to the adoption of the NSX network virtualization services. NSX network
virtualization decouples the physical network from the topology requirements of the
applications and services that run on top of it. The physical fabric can be designed and operated
based on a simpler and more stable model that does not require following the application's
lifecycle. The role of adapting to the ever-changing application requirements is offloaded to the
NSX network virtualization layer. At the same time, the physical fabric design can be centered
around maximizing throughput and high availability, minimizing latency, and optimizing
operational efficiency.

Overlay Networking Benefits

Area VLAN Only Networking NSX Overlay Networking


VM Mobility VM mobility is limited to the VM mobility across layer 3
span of a VLAN (usually one boundaries – any topology
rack)

253
VMware NSX Reference Design Guide

Rack Availability Achieving rack availability via Easy to achieve rack availability
vSphere HA is not possible via vSphere HA for compute
without extending VLANs workloads
between racks

Resource Pooling Resource pooling is limited by Heterogenous islands of compute


the VLAN span are easily integrated

Infrastructure Agility Slow application deployment The physical network is


dependent on physical network abstracted. No changes are
reconfigurations required to support the
applications lifecycle

Change Management Network changes require Single point of management for


reconfiguration of multiple the entire logical network.
devices The network can embrace an
Infrastructure as Code (IoC)
approach, where network
constructs follow the application
lifecycle

Disaster Recovery Disaster recovery requires re-IP Easy disaster recovery without
re-IP

L2 Scale Fundamentally tied to MAC Programmed from vSphere, and


learning and L2 discovery over thus no MAC learning and
different types of underlay abstracted from physical
technology (e.g., traditional L2, network, reducing the leaf switch
EVPN) scale requirements

L2 Security Physical fabric is directly NSX Overlay isolates the physical


exposed to L2 frames generated fabric from the layer 2 frames
by the VMs (e.g., STP BPDU). generated by the endpoints

Support of Advanced Tied to physical ASIC and limited Software defined allows scale the
Networking – NAT, by switch vendor networking and security stack
Multicast, IPV6, etc.

Elastic BW BW is tied to VLAN/Subnet to a BW is scalable via the rapid


rack switch uplink deployment of the software
gateway

254
VMware NSX Reference Design Guide

Tie to uplink oversubscription 160 Gateways per NSX Domains


per rack
Tied to Spine scale

Figure 7-2: Network Virtualization Benefits

The following sections list the physical network requirements for both deployment models and
provide considerations around specific network fabric designs.
The centralized security (or gateway firewall) model is a third less standard but emerging
deployment model. In this option, NSX Edges act as the perimeter firewall for the data center or
a branch location and protect not virtualized workloads operating as a next-generation firewall
appliance. Physical server security via the NSX agent is covered in this DOCUMENT .

Role of overlay networking in the modern datacenter


Overlay networking allows the decoupling of network topology from the physical infrastructure.
The physical network only requires a basic set of VLANs and subnets common to all the vSphere
deployments: management, storage, and vMotion, plus an additional VLAN/subnet for the NSX
overlay traffic. FIGURE 7-3 below shows an example where four racks, representing four
independent pods of computing resources, are separated by layer-3 boundaries. Each rack has
an independent set of those subnets: management, vMotion, storage, and overlay (Geneve).
VLAN IDs are kept consistent across the racks for a more standardized configuration, but it is
not required. FIGURE 7-3 also shows an addressing schema, where the third octet of the IP
ranges represents the rack or compute pod. Again, this is an approach we recommend to
encourage standardization, but technically not required.
Once this basic network connectivity is in place, we can place VMs on overlay networks (VNIs)
spanning all the racks. No VLAN in the physical network is required to span between racks.
vSphere clusters can span multiple racks without impacting vSphere HA, vMotion, and DRS.
Modern data centers do not grow organically. Physical network and compute resources have
different lifecycles and growth patterns. Physical network atomic units of scale are usually racks
and the top of the rack switches in them. When the switches' ports or the rack's space are
depleted, a new rack with new top-of-rack switches is deployed. In a vSphere deployment, the
atomic units of scale are the hosts pooled in clusters often dedicated for different purposes. In
an ideal world, the growth patterns for computing and physical networks coincide. In reality, it
rarely happens.
Looking at FIGURE 7-3, we can see the erratic growth pattern for the different vSphere clusters
that required the three clusters to be striped non-uniformly across all four racks. If we
performed capacity planning correctly and the budget required to purchase those compute
resources was available on day 1, the distribution could have been more orderly. Each rack

255
VMware NSX Reference Design Guide

could have been dedicated to a cluster, or maybe two for rack redundancy, to limit the span of
VM mobility. In the real world, a different line of business can fund their initiatives in different
phases, and IT operators must expand their clusters on demand leveraging the available
network resources.

Figure 7-3: Network Overlays for elastic and non-uniform compute requirements

The role of the physical network is to provide reliable high throughput connectivity between
the hosts regardless of their nature. Dedicating a set of physical switches or racks to specific
workloads (vSphere clusters) is not a valid approach in data centers required to reach cloud
scale. It represents a waste of resources when the rack and switches are underutilized, and they
are a scale upper boundary for the workloads. What happens when you need to add a host to a
cluster with a dedicated rack and the ports, or the rack space is unavailable? NSX overlays

256
VMware NSX Reference Design Guide

allows for the different vSphere cluster to grown inorganically while preserving connectivity
and VM mobility.
So far, we have explored the benefits and flexibility that overlays bring to private clouds
requiring elastic compute and network connectivity properties. We can extend the model to
north-south bandwidth requirements and network services. NSX edge nodes are responsible for
scaling the software-defined network solution over those dimensions. FIGURE 7-4 shows how
edge hosts (ESXi hosts running NSX edge VMs) can be added to the design over time based on
changing requirements. The only additional requirement on the physical network is two
additional subnets/VLANs to establish BGP peering in each rack. As for the other infrastructure
VLANs, those are local and do not stretch across racks.
A single NSX deployment can horizontally scale up to 640 edge nodes. They can be distributed
over multiple racks to reduce the physical fabric oversubscription ratio between the leaf and
spine layers.

257
VMware NSX Reference Design Guide

Figure 7-4: Network Overlays for elastic North-South Bandwidth and network services

258
VMware NSX Reference Design Guide

Distributed security model, dvpgs vs NSX VLAN segments

Figure 7-5: Security only vs Network & Security cluster preparation

When planning a distributed security deployment without overlay, two options exist:
1. Prepare the vSphere hosts for NSX Security only, and leave the virtual machines
connected to the virtual distributed port-groups managed by vCenter
2. Prepare the vSphere hosts for NSX Networking & Security, and migrate the virtual
machines to NSX VLAN port-groups
Option 1 does not require migrating the VM, and the entire networking configuration is
retained by vCenter. All virtual machines connected to dvpgs are automatically protected by
the NSX DFW. This is the most straightforward and recommended approach, however as of
now (Starting with NSX version 3.2) it does not allow the implementation of NSX overlays. If it is
expected to implement NSX Overlay soon, option 2 should be considered. Migrating from a
security-only deployment to a network & security deployment requires uninstalling NSX from
the vSphere cluster. Option 1 is available with NSX 3.2, vSphere 6.5 or later, and VDS 6.6 or
later.
Option 2 requires placing the workloads on NSX-managed distributed virtual port-groups. In a
brownfield, we can create NSX segments with the same VLAN as the dvpgs where the VMs
reside and then migrate them, reconfiguring the vNic with no disruption. VMs on dvpg are

259
VMware NSX Reference Design Guide

excluded by the NSX distributed firewall. With this option, we can adopt overlays at a later
stage. Option 2 requires NSX 3.0 or later, vSphere 7, and VDS 7.
The two deployment models can coexist within the same installation on a per cluster basis. A
vSphere cluster can be prepared for security only, another for network & security. This is
valuable for existing deployments where one can deploy NSX for security only while in parallel
overlay based clusters providing a path for a future migration.

Type of Transport DFW Service Interface


connectivity Zone

DVPG on VDS VLAN TZ implicitly Yes, cannot have VMs connected to a


(connects Guest defined by the NSX overlay in the dvpg can have a SI as
VMs to VLANs) system same cluster the default gateway. SI
is connected to an Edge
VLAN Segment.

ESXi NSX VLAN ESXi VLAN Yes, it can have VMs connected to an
Segment Transport Zone NSX overlay NSX VLAN Segment can
(connects Guest (Different than the segments in the have a SI as the default
VMs to VLANs) one for NSX Edge same cluster gateway
TNs)

NSX Edge VM Edge VLAN N/A. NSX Edge VMs SI is connected to NSX
VLAN Segment Transport Zone are automatically Edge VLAN when it
(routing peering or (Different than the excluded from serves as the default
service interfaces) one for ESXi TNs) DFW gateway for VMs on
dvpgs or VLAN
segments. VLAN ID
must match. Dynamic
routing over SI is only
supported for EVPN
route servers mode

NSX Overlay Single NSX Overlay Yes VMs connected to an


Segment Transport Zone NSX overlay can have
(connects Guest common with NSX an SI as the default
VMs to NSX edge nodes gateway. SI is
Overlays) connected to the same
overlay (Not
recommended, use a

260
VMware NSX Reference Design Guide

regular downlink
interface)

Physical Infrastructure of the Data Center


An essential characteristic of NSX is that it is agnostic to the physical network configuration,
allowing for great flexibility in adapting to a variety of underlay fabrics and physical network
topologies.
NSX deployments can be generally categorized into two models: distributed security and
network and security.

Physical Network Requirements


The security-only and the network and security deployment models share the following basic
requirements for the physical network fabric.

Distributed Security and Network and Security deployment models:


• IP Connectivity – IP connectivity between all components of NSX and the compute hosts.
This includes ESXi hosts management interfaces and NSX managers only for the security-
only deployment model as NSX edge nodes are not part of it. IP connectivity should
extend to the edge nodes VMs or bare metal servers for a network and security
deployment model.
• IP and MAC mobility across switches in the same rack – This is not NSX specific. It is
based on vSphere networking. When ESXi are connected to different switches, vSphere
networking assumes that moving the IP and MAC of VMs and VMK interface from one
uplink to another is transparent to the physical network. This is how vSphere teaming
policies protect against the failure of a pNIC or an upstream switch. If the two switches
where the ESXi is connected are not layer two adjacent, this mechanism cannot work.
Most network fabrics provide this capability; the exceptions consist in pure layer three
spine and leaf architectures. In such a scenario, the only option (not recommended) is to
single connect the servers and provide high availability at a different level.
Once the above requirements are met, an NSX security-only deployment is agnostic to any
physical fabric topology and configuration. NSX network virtualization has an additional MTU
requirement.

Network and Security deployment model:


• Jumbo Frame Support – A minimum required MTU is 1600. However, MTU of 1700 bytes
is recommended to address the whole possibility of a variety of functions and future
proof the environment for an expanding Geneve header. As the recommended MTU for

261
VMware NSX Reference Design Guide

the NSX is 9000, the underlay network should support this value. This value should be set
in the VDS MTU in vSphere and in the Tunnel Endpoint MTU setting in NSX.
• The VM MTU – VM MTU should be set 100 bytes or more (200 preferred) lower than the
MTU of the physical fabric. Typical deployment carries 1500 byte MTU for the guest VMs
so no change is required for the VM MTU if the physical fabric has an MTU of 1700 bytes
or higher. To improve the throughput, one can increase the MTU up to 8800 (a ballpark
number to accommodate bridging and future header expansion). Modern Operating
Systems support a TCP/IP stack with a feature called Path MTU (PMTU) discovery. PMTU
allows the discovery of underlying networks' capability to carry message size measured
by MSS (maximum segment size) for a given TCP session. Thus, every TCP session will
continue to make this discovery periodically and thus allows maximum efficiency payload
for a given session. This value should be set on every datacenter workload and should be
consistent with the Gateway Global MTU in NSX. In NSX, the global gateway MTU should
be 200 bytes lower than the setting for the Tunnel Endpoint MTU (i.e., 8800 vs 9000). In
case the ToR SVIs used for the dynamic routing peering to the NSX Tier-0 gateway had a
different MTU that the one specified in the NSX Gateway Global MTU setting, we would
recommend to adjust the MTU of the individual Tier-0 external interfaces to match that
of the ToR SVI in order to minimize the risk of any issue affecting the dynamic routing
peering.
Once the above requirements are met, an NSX network virtualization deployment is agnostic to
any underlay topology and configurations, specifically:
• Any physical topology – core/aggregation/access, leaf-spine, etc.
• On any switch from any physical switch vendor, including legacy switches.
• With any underlying technology. IP connectivity can be achieved over an end-to-end
layer 2 network as well as across a fully routed environment.
For an optimal design and operation of NSX, well-known baseline standards are applicable.
These standards include:
• Device availability (e.g., host, TOR, rack-level)
• TOR bandwidth - both host-to-TOR and TOR uplinks
• Fault and operational domain consistency (e.g., localized peering of Edge node to the
northbound network, separation of host compute domains, etc.)

Physical fabric considerations


This design guide uses a routed leaf-spine architecture for most examples. This model is a
superset of other network topologies and fabric configurations, so its concepts also apply to
layer two and non-leaf-spine topologies. This section will provide some high-level examples of

262
VMware NSX Reference Design Guide

deployments and considerations for different types of network fabrics to clarify the nuances of
deploying NSX over different fabric topologies.

7.2.2.1 Layer 3 spine and leaf with layer2 adjacent ToRs.


In this topology, the spine switches are pure layer three devices and connect to the leaf
switches via routed point-to-point links. The leaf switches are deployed in pairs in each rack.
They share a common layer two domain over a direct link (most likely a redundant port-
channel) to satisfy the vSphere networking requirements for MAC and IP mobility within a
single rack. Leaf switches may or may not implement an MLAG proprietary solution. The L2/L3
demarcation point is at the leaf level.

The following considerations apply to this topology in an NSX security only deployment (FIGURE
7-6):
• Different racks require different VLANs and network configurations. Rack 1 and 2
(yellow) host the compute clusters, rack 3 (green) hosts management workloads, rack 4
(blue) is dedicated to WAN connectivity, all with different VLAN requirements.
• While WAN and Management blocks are usually static, the compute rack switches may
require frequent configuration updates based on the applications lifecycle.
• Virtual machine mobility is not available across racks. This may limit the agility and
elasticity properties of the design and reduce the consolidation ratio in the data center.
• vSphere clusters cannot be striped across racks, which may limit high availability as the
infrastructure cannot protect against a rack failure.

263
VMware NSX Reference Design Guide

Figure 7-6: NSX Security Only Deployment – L3 fabric

The following considerations apply to this topology in an NSX network and security
deployment (FIGURE 7-7):
• The ToR VLAN configuration is more consistent than in a security-only deployment, and it
is agnostic to the application lifecycle. Generally, three configuration templates must be
provided for the ToR switches, one for the compute block (yellow), one for the
management block (green), and one for the edge block (blue). Those configurations
templates tend to be static as they do not require modification when new NSX virtual
networks are deployed.
• Virtual machine mobility is available across racks. NSX overlay segments extend layer two
networks across the layer three boundaries of each rack.
• Compute clusters can be stretched across racks for rack level high availability.
• Resource pooling is streamlined as hosts in different racks (or datacenter rooms) can be
added to the same vSphere cluster.
• While some infrastructure management VMs such as the vRealize suite components can
be deployed on NSX overlay segments, NSX managers and vCenter should be deployed
on a VLAN network. This limits the span of the management vSphere cluster to a single

264
VMware NSX Reference Design Guide

rack. The management VLAN must be extended across racks, to provide rack high
availability. The NSX manager cluster appliances can be deployed in different IP subnets
to avoid extending layer two between racks, but no solution is currently available for the
vCenter server.

Figure 7-7: NSX Network and Security Deployment – L3 fabric

FIGURE 7-8 below outlines a sample VLAN and IP design for an NSX networking and security
deployment over a layer three fabric. Because each rack represents a unique layer two domain,
we can use the same VLAN IDs across racks to improve consistency.
The compute racks generally require only four VLANs (ESXi Management, vMotion, Storage, ad
NSX Overlay), and they can share the same VDS. Management racks may or may not be
prepared for NSX. When they are not, they do not require the NSX Overlay VLAN.
The management racks require a VM Management VLAN to place the management
infrastructure components such as the vCenter Server and NSX Manager. This VLAN is generally
separate from the ESXi management network as it may be stretched between racks to provide
rack high availability for the management VMs. The stretching of the VM management VLAN
requires layer two links or physical fabric overlays. The ESXi management VLAN never has this
requirement.
The hosts in the edge racks are never prepared for NSX (unless edge and compute blocks are
merged) and require two VLANs dedicated to layer three peering between the physical fabric
and the NSX edges (VMs or Bare Metal). Edge vSphere clusters rarely need to be stretched
across racks. The edge node VMs themselves can participate in the same NSX edge cluster (and,
for example, support the same T0 Gateway) even if they are deployed on different vSphere

265
VMware NSX Reference Design Guide

clusters in different racks. Edges in different racks can use layer 3 peering VLANs local to each
rack, so those VLANs do not need to be stretched. The layer three peering subnets should
support ECMP to 8 edge nodes, so /28 subnets or larger are recommended.

Figure 7-8: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 3 fabric

7.2.2.2 Layer 2 fabrics


In a pure layer two fabric, all the connections between the spine and the leaf switches are layer
two links and carry all or a subset of the VLANs in the fabric. Redundant links between the spine
and leaf layers are handled by spanning tree, a proprietary implementation of multi-chassis link
aggregation protocol (MLAG), or less frequently a routed layer two protocol (e.g., Trill or Cisco
FabricPath). The type of layer two fabric does not impact the NSX design. The L2/L3
demarcation point is generally at the spine level, where the WAN or core network blocks are
usually connected.

The following considerations apply to this topology in an NSX security only deployment (FIGURE
7-9):
• At minimum, VLANs are extended between the switches in the same rack, meeting the IP
and MAC mobility requirement of vSphere networking. VLANs can be extended across
racks, increasing VM mobility and allowing the possibility to stripe vSphere clusters
across racks for increased high availability.
• While possible, extending VLANs across multiple racks increases the size of a layer two
fault domain.
• In a dynamic environment requiring frequent provisioning of new networks to support
the applications lifecycle, the network administrator may need to perform frequent and
widespread changes to the physical network configuration.
• The network architect will be required to balance the above considerations. Maximum
flexibility in VM placement via a larger VLAN span will negatively impact the size of

266
VMware NSX Reference Design Guide

failure domains and the overall manageability. A more segmented approach will instead
limit VM mobility and the overall elasticity of the design.

Figure 7-9: Security Only Deployment – L2 Fabric

The following considerations apply to this topology in an NSX network and security
deployment (FIGURE 7-10 AND FIGURE 7-11):
• The network architect can limit the size of the layer two domains to a single rack by
pruning the VLANs appropriately. VM mobility is not impacted as NSX overlays extend
VM networks across layer three boundaries. This configuration makes the layer two
physical fabric properties very similar to those of a layer three fabric concerning the NSX
design.
• Management VMs VLAN can easily be extended between racks enabling the striping of
the management cluster.
• Small environments may benefit from a simpler VLAN/IP schema where the
infrastructure VLANs (management, vMotion, storage, overlay) are unique across the
fabric.
• The configuration of the physical devices becomes standardized and static. New
networks and topologies to support the applications lifecycle are provisioned in NSX
transparently to the physical fabric.

267
VMware NSX Reference Design Guide

• Bare metal edge nodes, or the ESXi hosts where edge node VMs reside, should be
connected to layer three capable devices. It usually means the spine or aggregation layer
switches in a layer two fabric. When implementing such a design is not desirable,
dedicated layer 3 ToR switches may be deployed to interconnect the edge nodes.

Figure 7-10: Network and Security Deployment – L2 fabric – Edge Nodes connected to spine switches

268
VMware NSX Reference Design Guide

Figure 7-11: Network and Security Deployment – L2 fabric – Edge Nodes connected to dedicated L3 ToRs

FIGURE 7-12 below outlines a sample VLAN/IP schema for a layer two fabric where VLANs have
been extended across the whole physical network. This design is appropriate for smaller
deployments. A large implementation may benefit from the segmentation of layer two
domains. In such cases, the VLAN/IP schema may approach the one outlined for the layer three
fabric example.

Figure 7-12: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 2 fabric

7.2.2.3 Layer 3 spine and leaf fabric with Overlay


A layer three spine and leaf fabric with overlay consists of a pure spine and leaf topology where
leaf switches are only connected to the spine layer via routed point-to-point links. No
connection exists between leaf switches in the same rack. On top of this routed topology, the
physical fabric implements an overlay (independent and transparent to vSphere and NSX),
making the same layer two segments available on any desired leaf switch. Examples of such
fabrics are EVPN based fabrics and Cisco ACI.

269
VMware NSX Reference Design Guide

From an NSX and vSphere perspective, a layer three fabric with overlay is equivalent to a layer
two fabric. As such, similar considerations apply. For NSX Security only deployments (FIGURE
7-13), the points to emphasize are:
• VLANs can be extended across racks via the physical fabric overlays, increasing the span
of VM mobility and providing the possibility to stripe vSphere clusters across racks for
increased high availability.
• The span of the VLANs is less of a concern regarding fault domains size compared to
Layer 2 fabrics because network overlays are in use.
• In a dynamic environment requiring frequent provisioning of new networks to support
the applications lifecycle, the network administrator may need to perform frequent and
widespread changes to the physical network configuration.

Figure 7-13: Security Only Deployment – L3 fabric with Overlays

For NSX Network and Security only deployments (FIGURE 7-14), the points to emphasize are:
• Larger physical fabrics can adopt a simplified VLAN/IP schema where unique VLANs are
common across the environment.
• NSX Geneve encapsulation will work over the physical fabric overlay encapsulation
without any problem as long as the virtual machines and physical fabric MTU account for
the additional headers.

270
VMware NSX Reference Design Guide

• The configuration of the physical devices becomes standardized and static. New
networks and topologies to support the application lifecycle are provisioned in NSX
transparently to the physical fabric.
• Edge nodes are connected to edge leaf switches with access to the WAN network.
• Edge node VMs should not be deployed on clusters striped across racks as it would
require extending the L3 peering VLANs via the physical fabric overlay. Edge nodes
should peer to the leaf switches in the same rack on dedicated local VLANs. Edges in
different racks and vSphere clusters can be grouped in the same NSX edge cluster.

Figure 7-14: Network and Security Deployment – L3 fabric with Overlays

FIGURE 7-15 outlines a sample VLAN/IP schema for a layer three fabric with an overlay where
NSX network overlays are also in use. The number of IP segments is minimized as the physical
networks overlay allows each IP network to be available on every rack. Note the four physical to
virtual transit segments (VLAN 106-108) in the edge racks, two per rack. They are not extended
between racks and only provide connectivity to the local edge nodes.

271
VMware NSX Reference Design Guide

Figure 7-15: Sample VLAN and IP Subnets schema for NSX Network Virtualization deployment on a Layer 3 fabric with overlays

272
VMware NSX Reference Design Guide

NSX Infrastructure Component Connectivity

Overview
NSX Manager Appliances (bundling manager and controller functions) are mandatory NSX
infrastructure components. Their networking requirement is basic IP connectivity with the
other NSX components. The detailed requirements for these communication are listed at
HTTPS :// PORTS . VMWARE . COM/ HOME /NSX-DATA-CENTER.

NSX Manager Appliances are typically deployed on a hypervisor and connected to a VLAN
backed port group on a vSS or vDS; there is no need for the colocation of the three appliances
in the same subnet or VLAN. There are no dependencies on MTU or encapsulation
requirements as the NSX Manager appliances send management and control plane traffic over
the management VLAN only. The NSX Manager appliances ca be deployed on ESXi hypervisors.
FIGURE 7-16 shows three ESXi hypervisors in the management rack hosting three NSX Manager
appliances.

Figure 7-16: ESXi Hypervisor in the Management Rack

The ESXi management hypervisors are configured with a VDS/VSS with a management port
group mapped to a management VLAN. The management port group is configured with two
uplinks using physical NICs “P1” and “P2” attached to different top of rack switches. The uplink
teaming policy has no impact on NSX Manager operation, so it can be based on existing
VSS/VDS policy.

273
VMware NSX Reference Design Guide

In a typical deployment, the NSX management components should be deployed on a VLAN. This
is the recommended best practice. Deploying NSX management components on the software
defined overlay requires elaborate considerations and thus beyond the scope of this document.

NSX Manager Node Availability and Hypervisor interaction


The NSX Management cluster represents a scale-out distributed system where each of the
three NSX Manager nodes is assigned a set of roles that define the type of tasks that the node
can implement. The cluster must have three nodes for regular operation; however, the cluster
can operate with reduced capacity in the event of a single node failure. The cluster requires a
majority of NSX Manager Nodes (i.e., two out of three) to be fully operational. At a minimum, it
is recommended to spread the deployment of the NSX Manager Nodes across separate
hypervisors to ensure that the failure of a single host does not cause the loss of a majority of
the cluster. More generally, it is recommended to spread the deployment of the NSX Manager
Nodes across separate failure domains to ensure that a single failure does not cause the loss of
a majority of the cluster. A failure domain at a minimum should address the failure of a single
host and may extend to a data store, vSphere cluster, or even cabinet or rack. NSX does not
natively enforce this design practice.
When a single vSphere-based management cluster is available, deploy the NSX Managers in the
same vSphere cluster and leverage the native vSphere Distributed Resource Scheduler (DRS)
and anti-affinity rules to avoid instantiating more than one NSX manager nodes on the same
ESXi server. For more information on how to create a VM-to-VM anti-affinity rule, refer to the
VMware documents on VM-to-VM and VM-to-host rules. For a vSphere-based design, it is
recommended to leverage vSphere HA functionality to ensure single NSX Manager node can
recover during the loss of a hypervisor. Furthermore, NSX Manager should be installed on
shared storage. vSphere HA requires shared storage so that VMs can be restarted on another
host if the original host fails.
Additional considerations apply for management Cluster with respect to storage availability and
IO consistency. A failure of a datastore should not trigger a loss of Manager Node majority, and
the IO access must not be oversubscribed such that it causes unpredictable latency where a
Manager node goes into read only mode due to lack of write access. If a single VSAN data store
is being used to host NSX Manager cluster, additional steps should be taken to reduce the
probability of a complete data store failure some of which are addressed in P HYSICAL PLACEMENT
CONSIDERATIONS FOR THE NSX MANAGER N ODES below. It is also recommended to reserve
resources in CPU and memory according to their respective requirements. Please refer to the
following links for details
NSX Manager Sizing and Requirements:
HTTPS :// DOCS.VMWARE . COM/ EN/VM WARE -NSX/4.1/ INSTALLATION /GUID-AECA2EE0-90FC-48C4-
8EDB-66517ACFE415. HTML
NSX Manager Cluster Requirements with HA, Latency and Multi-site:

274
VMware NSX Reference Design Guide

HTTPS :// DOCS.VMWARE . COM/ EN/VM WARE -NSX/4.1/ INSTALLATION /GUID-509E40D3-1CD5-4964-


A612-0C9FA32AE3C0. HTML

Physical placement considerations for the NSX Manager Nodes


This section covers the NSX Manager cluster deployment options in relation to the underlying
vSphere infrastructure. The NSX Manager Nodes part of the NSX Management Cluster are not
required to exist in a single vSphere Cluster or even under a single vCenter as long as the IP
Connectivity Requirements between the nodes are met. Predominantly, however, the most
common placement of the NSX Manager Nodes is into a single vSphere cluster. More rarely,
when high availability requirements demand it, they are spread across three vSphere clusters to
increase availability.

7.3.3.1 Single vSphere Cluster for all NSX Manager Nodes in a Manager Cluster
When all NSX Manager Nodes are deployed into a single vSphere cluster, it is important to
design that cluster to meet the needs of the NSX Managers with at least the following:
1. At least four vSphere Hosts. This is to adhere to best practices around vSphere HA and
vSphere Dynamic Resource Scheduling (DRS) and allow all three NSX Manager Nodes to
remain available during proactive maintenance or a failure scenario.
2. The hosts should all have access to the same data stores hosting the NSX Manager
Nodes to enable DRS and vSphere HA.
3. Each NSX Manager Node should be deployed onto a different data store (this is
supported as a VMFS, NFS, or other data store technology supported by vSphere)
4. DRS Anti-Affinity rules should be put in place to prevent, whenever possible, two NSX
Manager VMs from running on the same host.
5. During lifecycle events of this cluster, each node should be independently put into
maintenance mode, moving any running NSX Manager Node off the host prior to
maintenance, or any NSX Manager Node should be manually moved to a different host.
6. If possible, a rack-level or critical infrastructure (e.g., Power, HVAC, ToRs) should also be
taken into account to protect this cluster from any single failure event taking down the
entire cluster at once. In many cases, this means spreading this cluster across multiple
cabinets or racks and connecting the hosts in it to a diverse set of physical switches, etc.
Rack level redundancy requires extending the management VLAN between the racks as
the NSX Managers generally reside on a vCenter managed VLAN dvpg. Properly planned
rack-level redundancy includes a cluster striped across three racks and DRS rules placing
the NSX Managers on hosts in different racks because the failure of a cabinet with more
than one NSX manager causes the loss of the quorum.
7. NSX Manager backups should be configured and pointed to a location running outside of
the vSphere Cluster that the NSX Manager Nodes are deployed on.

275
VMware NSX Reference Design Guide

Figure 7-17: Single vSphere Cluster 1:1 Data Store

7.3.3.2 Single vSphere Cluster when leveraging VSAN as the storage technology
When all NSX Manager Nodes are deployed into a single vSphere cluster where VSAN is the
storage technology in use, we should take additional steps to protect the NSX Manager
Cluster’s availability since VSAN will present only a single data store. Few VSAN specific
parameters including, Primary Levels of Failures to Tolerate (PFTT), Secondary Levels of Failures
to Tolerate (SFTT), and Failure Tolerance Mode (FTM) (when only a single site in VSAN is
configured, PFTT is set to 0 and SFTT is usually referred to as just FTT) govern the availability of
the resources deployed on the VSAN datastore. The following configurations should be made to
improve the availability of the NSX Management and Control Planes:

• FTT >= 2
• FTM = Raid1
• Number of hosts in the VSAN cluster >= 5

This will dictate that for each object associated with the NSX Manager Node, two or more
copies of the data and a witness are available even if two failures occur, allowing the objects to
remain in a healthy state. This will accommodate both a maintenance event and an outage
occurring without impacting the integrity of the data on the datastore. This configuration
requires at least five (5) hosts in the vSphere Cluster.
Cabinet Level Failures can be accommodated as well, by distributing the cluster horizontally
across multiple cabinets. No more hosts than the number of failures that can be tolerated (a

276
VMware NSX Reference Design Guide

maximum of FTT=3 is supported by VSAN) should exist in each rack. This means a minimum of
five (5) hosts spread across three (3) racks in a 2-2-1 pattern, and DRS rules preferentially
placing the NSX Managers in different racks.
Implementing VSAN Stretched Clusters to provide cabinet level protection is not
recommended. While appealing because of the reduced number of hosts required (2 hosts and
a witness VM in 3 different failure domains are the minimum requirements), the failure of a
rack may cause the loss of the NSX manager cluster quorum because two NSX manager
appliances are placed in the same rack.
It is strongly recommended that hosts, any time they are proactively removed from service,
vacate the storage and repopulate the objects on the remaining hosts in the VSAN cluster.
Providing a VSAN cluster with five hosts is the recommended approach, but fewer hosts are
available in some situations. VSAN 7 update 2 introduced the ENHANCED DATA
DURABILITY feature, which may reduce the risk of running the NSX manager cluster with a
storage policy with FTT=1. With this feature, if a failure occurs when one of the nodes has been
taken out of service for maintenance, VSAN can rebuild the up-to-date data once the host is out
of maintenance mode.
Additional VSAN settings and best practices that we should consider are:
• Enable O PERATIONS RESERVE AND HOST REBUILD RESERVE . This helps monitor the reserve
capacity threshold, generates alerts when the threshold is reached and prevents further
provisioning.
• Ensure that the VM storage policy applied to the NSX Managers is compliant.
• Ensure to use the Data migration pre-check tool before carrying out host/cluster
maintenance
• Review the vSAN Skyline Health Checks regularly.

Figure 7-18: Single vSphere Cluster Single VSAN Data Store

277
VMware NSX Reference Design Guide

7.3.3.3 Placement of NSX Manager Nodes Across Multiple vSphere Clusters


Depending on the design of the underlying infrastructure, it may become necessary or
advantageous to place the three NSX Manager Nodes across multiple vSphere Clusters. This
type of design tends to become relevant when a single vSphere cluster cannot provide the
resiliency or availability for all failure scenarios that are required to be addressed. This can
occur, for instance, if the physical infrastructure is preferred to be deployed across multiple
facilities or computer rooms or if storage or other underlying infrastructure is required to be
diverse. In this scenario:
• If the same IP space is not available to each vSphere Cluster, NSX managers may have IP
addresses in different IP subnets . In such a scenario, we cannot provide management
plane availability via a VIP, and an external load balancer may be part of the solution.
Section DEPLOYMENT O PTIONS FOR NSX MANAGEMENT CLUSTER describes the management
plane's high availability options and their requirements.
• If VSAN is part of this design, we recommend avoiding HCI Mesh and use the local VSAN
datastore presented to each vSphere Cluster for the NSX Manager Node that is running
in it.

Figure 7-19: 3x vSphere Clusters Each in a Different Infrastructure Failure Domain

278
VMware NSX Reference Design Guide

Deployment Options for NSX Management Cluster


Three NSX manager appliances form a cluster. The database is replicated to all the nodes in the
cluster, so there is a single database cluster instance. The benefit of clustering is that it provides
high availability of all services, including GUI and API access to the cluster. Combining the NSX
Manager, policy, and central controller reduces the number of appliances deployed and allows
greater flexibility with availability models.
NSX Manager appliance serves two critical areas. The first one is the consumption by external
systems and user access. Many different endpoints (such as HTTP, vRA, Terraform, Network
Container Plugin, custom automation modules) can consume NSX Manager northbound via an
IP address. It is the single point of entry for the entire system configuration. The second is the
communication with the other NSX components (controllers and transport nodes). The
controller communication to the transport node is done via the controller role (element) within
the NSX appliance node. We described the interaction between the controller role and the NSX
components in chapter 2 MANAGEMENT P LANE AND CONTROL P LANE . The controller availability
model is majority based, requiring at least two nodes to be available for normal operations.
The NSX Manager cluster can expose northbound API and GUI access via multiple availability
models. There are three different configuration modes available for northbound access to the
NSX Manager cluster.
• Default deployment (each node uniquely addressable, no common IP address)
• Cluster VIP based deployment
• External Load Balancer based deployment

7.3.4.1 Default Deployment


Deploying a 3-node cluster without any additional configuration is the default and the simplest
option with the least number of IP addresses consumed.
Different endpoints (users and automation systems) can access different nodes. Because the
configuration is always synchronized, the choice of the node does not matter. Each node is
accessible via a distinct IP address (or FQDN). The availability is driven by the external system by
choosing a different IP address (FQDN) in case of a node failure. For example, if vRA or an API
script uses the FQDN of node A and node A fails, there has to be some manual intervention to
restore the functionality. The same applies to GUI access via a browser. The NSX administrator
must repoint the browser to a different node. The options are to change the FQDN in your API
script or browser or update the FQDN entry in the DNS. Using round-robin DNS may result in
intermittent connectivity in a failure scenario. In terms of topology requirements, as long as
there is IP connectivity between all the nodes, this model will work.

279
VMware NSX Reference Design Guide

7.3.4.2 Cluster VIP based deployment


The second deployment option is based on a simple active/standbys redundancy model, in
which we configure a virtual IP address on the management cluster. The cluster VIP
configuration option provides node-level redundancy through a virtual IP address on the cluster
itself. This virtual IP (like VRRP/HSRP) provides redundancy in accessing the cluster via a single
FQDN. In other words: the entire northbound access to the NSX Manager is available via this
FQDN (or IP). Since a single FQDN name is available to all endpoints, all the GUI and API
requests are accessed through a single node owning the VIP. Essentially, it is a simple
availability model, where external intervention is not required in case of a node failure.
The cluster virtual IP address is an IP address floating between the cluster nodes. One of the
cluster nodes is assigned as the owner of the cluster VIP. If the case of failure, the system will
assign a new owner. Since the cluster VIP must remain the same, it assumes that other nodes
are available in the same subnet. The cluster VIP feature uses gratuitous ARP to update the
mac-address and the ARP table of the upstream network devices. Thus, it is mandatory to have
all NSX Managers in the cluster in the same subnet. From the physical topology perspective, the
placement of the NSX Manager nodes can vary. In a L2 physical fabric topology, the NSX
Manager appliances can be in the same or different racks. For a L3 topology, all nodes must be
in the same rack, assuming that VLAN/subnets are confined to a rack.

280
VMware NSX Reference Design Guide

Figure 7-20: NSX Manager Appliances Availability with Cluster VIP

The NSX Manager availability has improved compared to the previous option; however, it is
important to clarify the distinction between node availability and load-balancing. In the case of
cluster VIP, all the API and GUI requests go to one node, and it's not possible to achieve load-
balancing of GUI and API sessions. In addition, the sessions that were established on a failed
node will need to be re-authenticated and re-established at the new owner of the cluster VIP.
The cluster VIP model is also designed to address the failure of certain NSX Manager services
but cannot guarantee recovery for certain corner cases service failures.
It is important to emphasize that the northbound clients' connectivity happened via the cluster
VIP. Still, the communication between NSX Manager nodes and the NSX transport nodes
components happens on the individual node IPs.
The cluster VIP is the preferred and recommended option for high availability of the NSX
Manager appliance nodes.

7.3.4.3 External Load Balancer based deployment


The third deployment option is to keep the same configuration as the first option but add an
external load balancer. A VIP on the load balancer will represent the manager nodes as the
physical servers in the server pool. Then the UI and API access to the management cluster will
go through the VIP on the load balancer. The advantage of this option is that the endpoint

281
VMware NSX Reference Design Guide

access to the NSX Manager nodes will have load-balancing, and the cluster management access
is highly available via a single IP address. The external load-balancer option also makes the
deployment of NSX manager nodes independent of underlying physical topology (agnostic to L2
or L3 topology).
Additionally, one can distribute nodes to more than one rack to achieve rack redundancy (in L2
topology, it is the same subnet but a different rack, while in L3 topology, it will be a distinct
subnet per rack). The downside of this option is that it requires an external load balancer and
additional configuration complexity based on the load-balancer model. The make and model of
load-balancer are left to the user preference; however one can also use NSX native load
balancer included as part of the system and managed by the same cluster of NSX Manager
nodes that are load-balanced or the standalone NSX Advanced load balancer (formally AVI
network) offered as a part of NSX Datacenter portfolio.

Figure 7-21: NSX Manager Appliances with External Load Balancer

FIGURE 7-21 presents a scenario where the NSX manager nodes are load-balanced by an
external load balancer, and session persistency is based on the source IP of the client. With this

282
VMware NSX Reference Design Guide

configuration, the load balancer will redirect all requests from the same client to the same NSX
manager node.
NSX Manager can authenticate clients in four ways - HTML basic authentication, client
certificate authentication, vIDM, and session. The API-based client can use all four forms of
authentication while web browsers use session-based authentication. The session-based
authentication typically requires LB persistence configuration, while API-based access does not
mandate that. FIGURE 7-21 represents a VIP with LB persistent configuration for both browser
(GUI) and API-based access.

While one can conceive an advanced load-balancing schema in which dedicated VIP for browser
access with LB persistent while other VIP without LB persistence for API access, this option may
have limited value in terms of scale and performance differentiation while complicating the
access to the system. For this reason, it is highly recommended to first adopt the basic option of
LB persistence based on source IP with a single VIP for all types of access. The overall
recommendation is to start with a cluster VIP and move to an external LB if real needs exist.

7.3.4.4 Singleton NSX Manager


The resources required to run a cluster of three NSX Managers may represent a challenge in
small environments. Starting with NSX version 3.1, VMware supports deploying a single NSX
manager in production environments. This minimal deployment model relies on vSphere HA
and the backup and restore procedure to maintain an adequate level of high availability.
vSphere HA will protect against the failure of the physical host where the NSX manager is
running. vSphere HA will restart NSX Manager on a different available host. Enough resources
must be available on the surviving hosts; vSphere HA admission control can help ensure they
are available in case of failure.
Backup and restore procedures help in case of failure of the NSX manager itself. The SFTP
server where the backup is stored should not be placed on an infrastructure shared by the
single NSX Manager node.

283
VMware NSX Reference Design Guide

Compute Cluster Design


This section covers ESXi compute hypervisors. Compute hypervisors host the application VMs in
the data center.
In a typical enterprise design where both NSX network and security virtualization are adopted,
compute hosts will carry at least two kinds of traffic, typically on different VLANs –
management and overlay. The overlay VLAN will carry encapsulated traffic for all the
application networks, minimizing the number of VLANs defined on the physical fabric. Because
overlay traffic is involved, the uplinks are subject to the MTU requirement mentioned earlier in
section P HYSICAL NETWORK REQUIREMENTS . Security-only designs are not impacted by the MTU
requirement but will generally include multiple workloads VLANs based on the application
requirements.
Additionally, based on the type of hypervisor, compute hosts may carry additional types of
infrastructure traffic such as storage (VSAN, NFS, and iSCSI) and vMotion. The ESXi hypervisor
defines specific VMkernel interfaces for this infrastructure traffic, typically connected to
separate VLANs.
NSX offers choices for the management of infrastructure and guest VM traffic (overlay or VLAN)
through the flexibility of uplink profiles and teaming policies as described in chapter 3 (UPLINK
P ROFILE ). This chapter presents configuration choices and capabilities based on the
requirements and best practices prevalent in existing data-center deployments. Compute hosts
network requirements are centered around availability, performance, and manageability.
Availability is generally the most important design quality and in most cases it is achieved by
connecting the hosts to two different switches and ensuring that traffic is redirected on a
different path in case of a failure. Performance and manageability may lead the design in
different directions based on specific requirements and the network architect preference. In
this context two different design patterns emerge:
Optimization across all available physical NICs– With this choice, all traffic types share all
available pNICs. The assumption is that by making all pNICs available to all traffic, we can avoid
traffic hot spots during a peak/burst, and the probability of contention is reduced due to the
number of links and speed offered. This type of traffic management is typically suitable with 25
Gbps or greater speed links. In the case of lower-speed pNIC, it may be necessary to enable
traffic management tools such as NIOC to provide adequate bandwidth to critical traffic. One
example is VSAN traffic. The teaming type offered in NSX that enables this behavior is called
“Load Balanced Source Teaming.”
Deterministic traffic per pNIC– A certain traffic type is only carried on a specific pNIC, thus
allowing dedicated bandwidth for a given traffic type. Additionally, it allows deterministic
failure as one or more links could be in standby mode. Based on the number of pNICs, one can
design a traffic management schema that avoids the contention of two high bandwidth flows
(e.g., VSAN and vMotion). The teaming type offered in NSX that enables this behavior is called
“Failover Order.” We can build a teaming failover policy with a single pNIC or more. When a

284
VMware NSX Reference Design Guide

single pNIC is used, its failure is not covered by a standby pNIC. Such a scenario is not common,
but it can be an appropriate choice when other high availability mechanisms like vSphere HA or
application-level redundancy are in place.
A scenario where we may leverage deterministic traffic per pNIC is in the case of multiple
distinct physical networks, e.g., DMZ, Storage, or Backup Networks, where the physical
underlay differs based on pNIC.
Designing a traffic management schema that utilizes both design patterns is possible. This
design guide covers both scenarios based on specific requirements and makes generalized
recommendations.

ESXi-Based Compute Hypervisor with two pNICs


This section provides recommendations for deployments of computing hypervisors with the
following parameters:
• Two pNICs
• All host traffic (VM and infrastructure traffic) shares the same two pNICs
• Each host traffic type has a dedicated IP subnet and VLAN
We will cover the configuration options for two kinds of designs in the next sections, one that
enforces deterministic traffic per pNIC the second that optimizes link utilization by leveraging
all links for all types of traffic.

285
VMware NSX Reference Design Guide

7.4.1.1 Deterministic Traffic per pNIC

Figure 7-22: ESXi Compute Rack Failover Order Teaming with One Teaming Policy

In FIGURE 7-22Figure 7-22, a single VDS is used with a two pNICs design and carries both
infrastructure and VM traffic. Physical NICs “P1” and “P2” are attached to the different top of
rack switches. The teaming policy selected is failover order active/standby; “Uplink1” is active
while “Uplink2” is standby.
When the virtual switch is a VDS with NSX, only the NSX DVPG traffic will follow the NSX
teaming policy. The infrastructure traffic will follow the teaming policy defined in their
respective VDS DVPGs configured in vCenter.
The top-of-rack switches are configured with a first-hop redundancy protocol (e.g., HSRP or
VRRP), providing the active default gateway for all the VLANs on “ToR-Left .” The VMs are
attached to overlay or VLAN segments defined in NSX, and they will follow the default teaming
policy regardless of the type of segment they are connected to. With the use of a single
teaming policy, the above design allows for a simple configuration of the physical infrastructure
and simple traffic management at the expense of leaving an uplink completely unused.
It is, however, easy to load balance traffic across the two uplinks while maintaining the
deterministic nature of the traffic distribution. The following example shows the same hosts,
this time configured with two separate failover order teaming policies: one with P1 active, P2

286
VMware NSX Reference Design Guide

standby, and the other with P1 standby, P2 active. Then, individual traffic types can be assigned
a preferred path by mapping it to either teaming policy.

Figure 7-23: ESXi Compute Rack Failover Order Teaming with two Teaming Policies

In FIGURE 7-23, storage and vMotion traffic follow a teaming policy setting P1 as primary active,
while management and VM traffic are following a teaming policy setting P2 as primary active.
When the virtual switch is a VDS with NSX, the above design is achieved with a default teaming
policy P2 active & P1 standby. Then, the DVPGs for infrastructure traffic need to be configured
individually: storage and vMotion will have a failover order teaming policy setting P1 active &
P2 standby, while the management DVPG will be configured for P2 active & P1 standby.
The ToR switches are configured with a first-hop redundancy protocol (FHRP), providing an
active default gateway for storage and vMotion traffic on “ToR-Left,” management, and overlay
traffic on “ToR-Right” to limit interlink usage. Multiple teaming policies allow the utilization of
all available pNICs while maintaining deterministic traffic management. This is a better
recommended approach when adopting a deterministic traffic per pNIC design approach.

7.4.1.2 Optimization of traffic across all the available physical NICs


NSX supports two source teaming policies providing load balancing, load balancing based on
the source port and load balancing based on the source mac address. Those are the exact
equivalent to the corresponding teaming policies available in vSphere.

287
VMware NSX Reference Design Guide

FIGURE 7-24, shows a two-pNIC design as in the previous example, with load balance based on
source port teaming policy. Notice that NSX instantiates one TEP on every uplink specified in
the teaming policy configuration to be able to send overlay traffic on all the corresponding
physical ports. With this kind of policy, potentially both uplinks are utilized based on the hash
value generated from the source port originating the traffic or the source MAC address of the
traffic. In case of a pNIC failure, the virtual switch teaming policy moves the corresponding TEP
interface to a still operational uplink. That means that in a failure scenario multiple TEPs are still
operational but send and receive traffic over the same physical link.
Load balancing based on source port is generally sufficient and recommended over a MAC-
based hash because each VM usually generates traffic from a single MAC address. Both
infrastructure and guest VM traffic benefit from this policy, allowing the use of all available
uplinks on the host. Infrastructure traffic is generally handled by different VMK interfaces
dedicated to each traffic type, making it possible for the hashing algorithm to place different
kinds of traffic on different links. Each traffic type will flow over a single link, though. vSphere
features such as multi-nic vMotion can address the requirement for load balancing the same
kind of traffic across different physical NICs.

Figure 7-24: ESXi Compute Rack Load Balanced Source Teaming

Additionally, one can utilize a mix of the different teaming policy types together such that
infrastructure traffic (VSAN, vMotion, Management) leverage “failover order” enabling
deterministic bandwidth and failover, while VM traffic use some “load balance source” teaming
policy, spreading VM traffic across both pNICs.

288
VMware NSX Reference Design Guide

On a host running VDS with NSX, overlay traffic will follow the default teaming policy defined in
NSX. Infrastructure traffic will follow the individual teaming policies configured on their DVPGs
in vCenter.
Based on the requirement of underlying applications and preference, one can select a type of
traffic management as desired for the infrastructure traffic. However, for the overlay traffic, it
is highly recommended to use one of the “load balanced source” teaming policies as they’re the
only ones allowing overlay traffic on multiple active uplinks and thus provide better throughput
in/out of the host for VM traffic.

7.4.1.3 LAG based traffic distribution


An alternate method to distribute traffic over multiple pNICs is by implementing LAG. This
scenario requires the ESXi hosts to be connected to separate ToRs forming a single logical LAG
based on a proprietary multi-chassis link aggregation solution (MLAG, VPC, etc.) to provide ToR
redundancy. This introduces troubleshooting complexity and potential support coordination
challenges across multiple vendors. Relying on this kind of technology is not recommended
because it's vendor dependent and typically involves significant vendor-specific limitations that
VMware has not validated. The private cloud fundamental relies on decoupling of physical
network to NSX SDN topology, driving flexibly life cycle on both domains, and thus use of LAG
from Host is strongly discouraged and if possible avoid it.
However, existing deployment may carry this type of teaming, and the operation team may
have the knowledge and have accepted the risk to support LAG-based teaming. In such
situations, LAG is an acceptable solution for compute only hosts.
If the compute hosts carry edge VMs for North-South traffic, and the topology requires peering
over LAG, a change of the design is highly recommended, as support of the routing peer over
the LAG (specifically MLAG, LAG to the same switch is acceptable) is not recommended and not
validated by VMware. One option is to decouple the edge and compute hosts in a separate
vSphere cluster and not implement LAG on the hosts where the edge node resides. When a
separate vSphere edge cluster is not desirable, providing additional pNICs not part of the LAG
for the edge VMs L3 peering is the most preferred option.

ESXi-Based Compute Hypervisor with Four (or more) pNICs


In most cases, a host with four or more pNICs can be configured the same way as a 2-pNIC host.
There are just more uplinks available for consumption by the different teaming policies. We will
focus on a few scenarios where four pNICs or more are necessary.

7.4.2.1 Simple install or traffic on separate uplinks for policy reason


With a minimum of four pNICs, an ESXi host can run two virtual switches, each with redundant
uplinks. This type of installation has the following benefits:

289
VMware NSX Reference Design Guide

• the design with 4 pNIC design provides a non-disruptive and straightforward installation
of NSX on existing ESXi hypervisors. We can deploy an N-VDS (or a new VDS prepared for
NSX) with two dedicated uplinks for VM traffic while leaving an existing VSS/VDS running
on two separate uplinks to handle the infrastructure traffic.
• On top of the benefits from this simple NSX installation, the model also provides a strict
separation between VM and infrastructure traffic. There are scenarios where this
separation is mandated by policy. The fact that NSX is deployed on its dedicated virtual
switch ensures that no misconfiguration could ever lead to VM traffic being sent on the
uplinks dedicated to infrastructure traffic.
The following FIGURE 7-25 presents an example of a host with 4 uplinks and two virtual
switches:
• A VDS is dedicated to infrastructure traffic.
• A second NSX prepared VDS handles the VM traffic.

Figure 7-25: ESXi Compute Rack 4 pNICs – VDS and NSX virtual switch

The VDS is configured with pNICs “P1” and “P2”. And each port group is configured with
different pNICs in active/standby to use both pNICs. However, the choice of teaming mode on
VDS is left to the user based on the considerations we outlined in the two pNIC design section.
The virtual switch running NSX owns pNICs “P3” and “P4”. To leverage both pNICs, the VDS is
configured in load balance source teaming mode. Each type of host traffic has dedicated IP
subnets and VLANs.

290
VMware NSX Reference Design Guide

Because we can deploy NSX directly on a VDS, we can easily achieve the same functionality with
a single VDS running NSX and owning the four uplinks, as represented below.

Figure 7-26: ESXi Compute Rack 4 pNICs – VDS and NSX virtual switch

You can achieve this configuration by simply mapping the four pNICs of the host to 4 uplinks on
the VDS. Then, install NSX using an uplink profile that maps its uplinks (Uplink1/Uplink2 in the
diagram) to the VDS uplinks P3 and P4. This model still benefits from the simple, non-disruptive
installation of NSX. At the same time, the NSX component can only work with the two uplinks
mapped to the uplink profile. This means that VM traffic can never flow over P1/P2.
The final added benefit of this model is that the administrator can manage a single virtual
switch and has the flexibility of adding additional uplinks for overlay or infrastructure traffic.

7.4.2.2 Multiple virtual switches as a requirement


Specific scenarios still call for multiple virtual switches on a host. For example:
• Compliance-driven topologies, e.g., PCI, HIPPA, etc., often necessitate separate and
dedicated infrastructure components (e.g., pNIC, operational controls, etc.)
• Some cloud provider models require internal or external-facing infrastructure on
separate virtual switches.

291
VMware NSX Reference Design Guide

• There is a specific use case for NFV (Network Function Virtualization) where two pNICs
are dedicated to a standard virtual switch for overlay and the other two pNICs for an
“enhanced datapath” virtual switch. The “enhanced mode” is not discussed here. Please
refer to VMware NFV documentation.
In the following scenario, FIGURE 7-27 shows that each virtual switch is built to serve specific
topology or provide traffic separation based on enterprise requirements. The segments cannot
extend between two virtual switches as each is tied to a different transport zone.

Figure 7-27: ESXi Compute Rack 4 pNICs with to two VDSs prepared for NSX

Below, we list a series of use cases and configurations where the dual VDS design is relevant. In
any of those cases, the infrastructure traffic will be carried on the first virtual switch. Here are
some examples:
• The first two pNICs are exclusively used for infrastructure traffic, and the remaining two
pNICs are for overlay VM traffic. This allows dedicated bandwidth for overlay application
traffic. One can select the appropriate teaming mode as discussed in the above two
pNICs design section (ESXI-BASED COMPUTE HYPERVISOR WITH TWO P NICS)
• The first two pNICs are dedicated to “VLAN only” micro-segmentation, and the second
one is for overlay traffic.
• Building multiple overlays for separation of traffic. Starting with NSX 3.1, a host can have
virtual switches part of different overlay transport zones and the TEPs on each virtual
switch can be on different VLAN/IP subnets (still, all the TEPs for an individual switch
must be part of the same subnet). When planning such configuration, it is important to

292
VMware NSX Reference Design Guide

remember that while hosts can be part of different overlay transport zones, edge nodes
cannot. The implication is that when implementing multiple overlay transport zones,
multiple edge clusters are required, one per overlay transport zones as a minimum.
• Building regulatory compliant domains with VLAN only or overlay.
• Building traditional DMZ type isolation.
The two virtual switches running NSX must attach to different transport zones. See detail in
section SEGMENTS AND TRANSPORT ZONES
=====================================================
Note: The two virtual switches could be of different datapath mode. See chapter 9 for a
description of the different mode available.
=====================================================

293
VMware NSX Reference Design Guide

Edge Node and Services Design


Edge nodes are available in two form factors – VM and bare metal server. While both form
factors offer the same functionality, their physical infrastructure connectivity is quite different,
however they have common requirements around the different types of IP networks that must
be provided:
• Management – Accessing and controlling the Edge node
• Overlay (TEP) - Creating tunnels with peer transport nodes
• External (N-S) - Peering with the physical networking infrastructure to provide
connectivity between the NSX virtual components and the external network
Edge nodes provide a pool of capacity for running centralized services in NSX. Edge nodes are
always active in the context of connectivity and control plane. They host Tier-0 and Tier-1
routing services, installed in either active/active or active/standby mode. If a Tier-0 or Tier-1
router enables stateful services, it can only be deployed in active/standby or active/active
stateful mode. The status of active or standby mode is within the context of a gateway data
plane forwarding rather than related to the Edge node itself. The Edge node connectivity
options discussed in this section are independent of type of services enabled on any given
node.
The design consideration for Edge node has changed since NSX 2.5 with two critical areas of
enhancements
• Multi-TEP support for Edge – Just like an ESXi transport node supporting multiple TEP,
Edge node has a capability to support multiple TEP, each associated to a different uplink
providing the following advantages:
o Removes topology restriction with bare metal – straight through LAG
o Allowing the use of multiple pNICs for the overlay traffic in both bare metal and
VM form factor.
o An Edge VM supporting multiple TEP allows it to have two uplinks from the same
N-VDS, allowing utilization of both pNICs. Efficient load sharing among host to
Edge VM.
• Multiple teaming policy per N-VDS – Default and Named Teaming policies (TEAMING
P OLICYTEAMING P OLICY )
o Allows specific uplink to be designated or pinned for a given VLAN
o Allowing uplinks to be active/standby or active-only to drive specific behavior of
a given traffic types while co-existing with other traffic type following entirely
different paths on the same N-VDS

294
VMware NSX Reference Design Guide

• Normalization of N-VDS configuration – All the edge node form factors deployments
can use a single N-VDS like a host hypervisor. Single teaming policy for overlay – Load
Balanced Source. Multiple policies for N-S peering – Named teaming Policies

This section is divided into four major areas:

• Bridge use case


• Layer 3 peering use case
• Stateful Services use case
• Gateway Firewall for VLAN networks use case
• Edge services and resources design considerations

Bridging Use Case


The chapter 3 section B RIDGING O VERLAY TO VLAN WITH THE EDGE BRIDGE covers the basic
functionality and availability model requirements. The next section covers the bridging design
guidelines.
The Edge Bridge is available on both Edge form factors - bare metal or Virtual Machine (VM).
The use of the Bridge in the bare metal form factor is relatively straightforward: the bridged
traffic is sent on the uplinks of the N-VDS selected by VLAN transport zone specified on the
Bridge Profile. There is no Bridge-specific configuration necessary on the physical infrastructure
where the bare metal Edge attaches.

7.5.1.1 Bridging Design Considerations for VM Edge

7.5.1.1.1 Edge VM vNIC Configuration Requirement with Bridging


For the VM form factor, it is important to remember that the Edge Bridge will end up sourcing
traffic from several different mac addresses on its VLAN vNIC. This means, that the uplink vNIC
must be connected to a DVPG port group allowing:
• Forged transmit

• Mac learning or promiscuous mode


Both above capabilities are not supported on VSS while supported on VDS. This means it’s a
strong recommendation is to use VDS when deploying Edge node. Mac learning is available on
the VDS as of vSphere 6.7. However, there is no GUI capability on vCenter to configure mac-
learning as of this writing but it can be enabled via API or using powerCLI (See
HTTPS :// WWW. VIRTUALLYGHETTO . COM/2018/04/ NATIVE - MAC- LEARNING - IN- VSPHERE -6-7- REMOVES -
THE - NEED - FOR- PROMISCUOUS - MODE - FOR- NESTED -ESXI. HTML.)

295
VMware NSX Reference Design Guide

If deployment is running vSphere 6.5 where mac learning is not available, the only other way to
run bridging is by enabling promiscuous mode. Typically, promiscuous mode should not be
enabled system wide. Thus, either enable promiscuous mode just for DVPG associated with
bridge vNIC or it may be worth considering dedicating an Edge VM for the bridged traffic so that
other kinds of traffic to/from the Edge do not suffer from the performance impact related to
promiscuous mode.

7.5.1.1.2 Edge VM: Virtual Guest Tagging


The Edge Bridge will be sending bridged traffic with an 802.1Q tag on its VLAN uplink. That
means that this Edge VM vNIC will have to be attached to a port group configured for Virtual
Guest Tagging (VGT, i.e. the DVPG shows as VLAN Trunk in the vCenter UI.) Refer VLAN TAG
REQUIREMENTS for more information.

7.5.1.1.3 Edge VM dedicated to Bridging example


The following FIGURE 7-28 represents an Edge VM dedicated to bridging and following the rules
enunciated earlier in this section. The Edge VM has four vNICs, but this design only uses 3:
• vNIC1 is dedicated to management traffic
• vNIC2 is the uplink of N-VDS1, the N-VDS that will be used for overlay traffic. On the ESXi
host, it is attached to a Transport DVPG using active/standby both pNICs of the host for
redundancy.
• vNIC3 is the uplink of N-VDS2 that is attached to the VLAN transport zone “N-VDS2”
where the bridged traffic will be sent. On the ESXi host, the “Bridge VLAN” DVPG has the
following configuration:
o Virtual Guest Tagging is enabled so that it is possible to bridge to several
segments to different VLAN IDs
o Forged transmit and mac learning, so that the bridge can send traffic sourced
from different mac addresses. If mac learning is not possible, the promiscuous
can be configured instead at the expense of degraded performance.
o Active standby teaming policy leveraging the same pNICs (but not necessarily in
the same order) as the overlay DVPG. That last point is important and will be
justified in the next part.

296
VMware NSX Reference Design Guide

Figure 7-28: Dedicated Edge VM for Bridging

7.5.1.1.4 Edge VM: Edge uplink protection


As we have seen, the Edge Bridge sends/receives two kinds of traffic on its uplinks: overlay
traffic and VLAN traffic. This part discusses how to protect both against failure in the data path
on the host.
The Edge HA mechanism is exchanging BFD hellos over their TEP interfaces between the
different Edges in the Edge Cluster. As a result, overlay traffic is protected against failure in the
data path. In FIGURE 7-28 above, if both P1 and P2 went down on the host, all the tunnels
between this Edge VM and its peers would go down. As a result, this Edge VM would be
considered as failed by Edge HA and another Edge would take over the services it was running
(including, but not limited to, the bridge service.)
Let us now focus on the VLAN uplink redundancy. There is no keepalive mechanism protecting
the VLAN traffic and the edge VM does not detect when the physical uplinks are going down.
The “Bridge VLAN” dvPortGroup has a teaming policy that will protects against a single physical
uplink of the host going down. Should both physical uplinks go down on the host, then overlay
traffic would be affected, the edge itself would be considered as down and the standby edge
would takeover. In a way, the VLAN traffic benefits from the keepalive mechanism run on the
overlay. For this to happen, it is important to ensure that the VLAN traffic can use all the
physical uplinks where overlay traffic is forwarded.

7.5.1.2 Bridging Design Considerations for both VM and Bare Metal

7.5.1.2.1 Redundant VLAN connectivity


The Edge Bridge HA mechanism does not protect against connectivity problem in the VLAN
infrastructure beyond the Edge physical uplink.

297
VMware NSX Reference Design Guide

Figure 7-29: Physical bridging infrastructure must be redundant

In the above scenario, the failure of the uplink of Edge 1 to physical switch S1 would trigger an
Edge Bridge convergence where the Bridge on Edge 2 would become active. However, the
failure of the path between physical switches S1 and S3 (as represented in the diagram) would
have no impact on the Edge Bridge HA and would have to be recovered in the VLAN L2 domain
itself. Here, we need to make sure that the alternate path S1-S2-S3 would become active
thanks to some L2 control protocol in the bridged physical infrastructure.

7.5.1.2.2 Preemptive vs. non-preemptive


The choice is between precise bandwidth allocation on the uplinks and minimum disruption. It
also allows one to decide the placement of the bridge for capacity control of the either host or
Edge Node.
The preemptive model allows making sure that, when the system is fully operational, a Bridge is
using a specified uplink for its VLAN traffic. This is required for scaling out the solution, precisely
distributing the load across several Edge Bridges and getting more aggregate bandwidth
between virtual and physical by doing Segment/VLAN load balancing.
The non-preemptive model maximizes availability. If the primary fails then recovers, it will not
trigger a re-convergence that could lead to unnecessary packet loss (by preempting the
currently active backup.) The drawback is that, after a recovered failure, the bridged traffic
remains polarized on one Edge, even if there was several Bridge Profiles defined on this pair of

298
VMware NSX Reference Design Guide

Edges for Segment/VLAN load balancing. Also note that, up to NSX release 2.5, a failover cannot
be triggered by user intervention. As a result, with this design, one cannot assume that both
Edges will be leveraged for bridged traffic, even when they are both available and several
Bridge Profiles are used for Segment/VLAN load balancing. This is perfectly acceptable if
availability is more important than available aggregate bandwidth.

7.5.1.2.3 Performance: scaling up vs. scaling out


The performance of the Edge Bridge directly depends on the Edge running it. NSX thus offers
the option to scale up the Edge Bridge from a small form factor Edge VM running several other
centralized services to a powerful bare metal Edge node dedicated to bridging.
Scaling out is also possible, as a complement to or instead of scaling up. By creating two
separate Bridge Profiles, alternating active and backup Edge in the configuration, the user can
easily make sure that two Edge nodes simultaneously bridge traffic between overlay and VLAN.
The diagram below shows two Edges with two pairs (numbered 1 and 2) of redundant Edge
Bridges. The configuration defines the Primary 1 on Edge 1 and Primary 2 on Edge 2. With
preemption configured, this ensures that when both Edges are available, both are active for
bridging traffic.

Figure 7-30: Load-balancing bridged traffic for two Logical Switches over two Edges (Edge Cluster omitted for clarity.)

Further scale out can be achieved with more Edge nodes. The following diagram shows an
example of three Edge Nodes active at the same time for three different Logical Switches.

299
VMware NSX Reference Design Guide

Figure 7-32: Load-balancing example across three Edge nodes (Bridge Profiles not shown for clarity.)

Note that if several Bridge Profiles can be configured to involve several Edge nodes in the
bridging activity, a given Bridge Profile cannot specify more than two Edge nodes.

7.5.1.3 Recommended bridging designs


The following sections present the recommended design for the edge bridge use case for both
the edge node VM and bare metal form factor in the two or four pNICs configurations. The
designs have been selected based on the following characteristics:
• Availability: the designs protect against the failure of a pNIC, edge node, ESXi host, or
ToR switch
• Performance:
o traffic is distributed across all the available paths to maximize throughput
o the design can be scaled up by adding host resources (CPU, pNICs) or scaled out
via additional nodes
• Manageability:
o Configurations are agnostic to the physical network and do not require any
specific configuration of the ToR switches (e.g., LAG).
o The Layer 3 peering use case is compatible with the presented designs
o Avoid server cabling specific to the bridge use case.

7.5.1.3.1 4 pNICs server Bare Metal Form Factor


A 4 pNIC design is preferred over a two pNIC design to maximize the server resource utilization
and avoid the physical NICs representing the limiting performance factor. The bridging design
for a four pNICs server in a bare metal form factor follows the following guidelines:

300
VMware NSX Reference Design Guide

• Overlay traffic is load-balanced across two physical links leveraging multi-tep edge
capability. Multi-tep provides load balancing and better edge node resource utilization
compared to a single TEP.
• Because for each bridging instance, VLAN traffic can be forwarded over a single uplink at
the time, load balancing is achieved by pinning different VLANs to different pNICs.
• VLAN and overlay traffic are forwarded over different pNICs. While sharing the pNICs
between VLAN and overlay traffic may lead to better performances and high availability
(e.g., configuring 4 TEPs and connecting the edge to 4 ToRs), we chose a simpler design
with a deterministic traffic pattern.
The design presented in FIGURE 7-32 below includes a single NVDS per edge node. The NVDS
manages both the overlay and VLAN traffic. The diagram presents two bridging instances, for
VLAN and overlay X, and VLAN and overlay Y, but more can be configured. The VLAN traffic is
pinned to pNIC 2 or pNIC 3. Additional VLANs could be pinned to either interface. In this case,
the pinning is achieved via a bridging teaming policy, implying that any bridging instance must
be associated with a named teaming policy. Overlay traffic for any bridging instance is load-
balanced across pNIC0 and pNIC1 via the default teaming policy.
We should pay special attention to the cabling and VLAN to pNIC association, which is not the
same for the two bare metal edge nodes. The reason for this asymmetry is that while overlay
traffic can failover between the two pNICs on the same edge node, the same is not true for
VLAN traffic, which is pinned to a single uplink because of the limitation discussed in chapter 3.
If that single pNIC or the ToR where it is connected fails, the bridge instance should fail over to
the second edge, which must have a viable path for the associated VLANs. We want to maintain
a standard cabling configuration, where two pNICs are connected to one ToR and two to the
other consistently while at the same time allowing for the recovery of the traffic after the
failure of a ToR. For this reason, we crossed the mapping between pNIC2 and pNIC3 and uplink2
and uplink3 on the two bare metal edges.
FIGURE 7-31 shows a sample configuration. If this configuration is not desirable, connecting
pNIC2 to ToR2 and pNIC3 to ToR1 for bare metal edge-2 will have the same effect.

301
VMware NSX Reference Design Guide

Figure 7-31: Bridge Design – 4 pNIC BM edge – Single NVDS – uplink to pNIC mapping

302
VMware NSX Reference Design Guide

Figure 7-32: Bridge Design – 4 pNIC BM edge – Single NVDS

We can achieve the same design without named teaming policies and deploying multiple NVDSs
on the same bare metal edge. The functionalities provided are the same, and the decision
between the two options may be based on the preferred configuration workflows for new
bridging instances. This configuration option is presented in the diagram below.

303
VMware NSX Reference Design Guide

Figure 7-33: Bridge Design – 4 pNIC BM edge – 3 NVDSs

So far, we have shown how to achieve optimal high availability, CPU, and link utilization across
a single bare metal edge. The previous diagrams show Bare Metal Edge 2 running standby
bridging instances only. It is possible to spread the active bridging instances across multiple
bare metal edges via bridge profiles. Because the traffic for each bridging instance is always
handled by a single edge node at the time, and VLAN traffic is pinned to a single pNIC, at least
four bridging instances are required to utilize all available links. This configuration is depicted in
FIGURE 7-34.

304
VMware NSX Reference Design Guide

Figure 7-34: Bridge Design – 4 pNIC BM edge – Bridge profiles

7.5.1.3.2 2 pNICs server Bare Metal Form Factor


The bridging design for a 2 pNICs server follows the same principles of the 4 pNIC server design.
Specific considerations for this design are:
• The number of pNICs may represent the limiting factor from a performance perspective.
• Crossing the uplink to pNIC mapping or the cabling configuration must be performed to
avoid that VLAN traffic is interrupted because of a single ToR failure.
• Because the servers have 2 pNICs only, a singe NVDS must be deployed and named
teaming policies are require for the VLAN traffic.
Both the overlay and the VLAN traffic flow over the same set of pNICs in this case.
FIGURE 7-35 depicts the recommended bridging design for a 2 pNIC bare metal edge.

305
VMware NSX Reference Design Guide

Figure 7-35: Bridge Design – 2 pNIC BM edge

7.5.1.3.3 2 pNICs server VM Form Factor


The bridging design for a 2 pNICs server running edge nodes in a VM form factor follows the
following guidelines:
• Overlay traffic is load-balanced across two physical links leveraging multi-tep edge
capability. Multi-tep provides load balancing and better edge node resource utilization
than a single TEP.
• VLAN traffic is load-balanced across the same two physical links leveraging the VDS load
balancing based on MAC teaming capability.
From a configuration perspective, overlay traffic is load-balanced over vNIC1 and vNIC2 on the
edge node VM by the default teaming policy set to load-balance based on the source port.
vNIC1 and vNIC2 are connected to different VDS dvpg configured with a mirrored
active/standby teaming policy. These dvpgs may or may not be configured as trunk as they
carry a single VLAN for overlay traffic (this is different than in the L3 peering design where the
dvpg must carry both overlay traffic and the routing peering traffic to the physical network).
The bridge VLAN traffic is carried by the edge VM vNIC3 only. VLAN traffic presents different

306
VMware NSX Reference Design Guide

source MAC addresses (the MACs of all the VMs on the overlay) that can be used by the
upstream VDS for the load-balancing hash.
MAC learning and forged transmit must be enabled on the bridge dvpg carrying VLAN
traffic. It is not required to configure MAC learning on the Overlay1 and Overlay2 dvpgs as
the failures of vNIC1 or vNIC2 are not realistic occurrences and should not be part of a test
plan.
This configuration can be achieved with a single N-VDS and a named teaming policy as
depicted in FIGURE 7-36.

Figure 7-36: Bridge Design – 2 pNIC VM Edge – 1 NVDS

It can also be achieved without named teaming policies and two NVDSs per edge VM. The
resulting designs are equivalent, they only differ in the configuration workflow. The 2 NVDS
design may be preferable to avoid the need to associate a teaming policy to each bridging
instance. Doing so may lower the probability of a misconfiguration.

307
VMware NSX Reference Design Guide

Figure 7-37: Bridge Design – 2 pNIC VM Edge – 2 NVDSs

7.5.1.3.4 4 pNICs VM Form Factor


A 4 pNICs design is preferred over a 2 pNICs design to maximize the server resource utilization
and avoiding that the physical NICs represents the limiting performance factor.
When the edge node is in an edge VM form factor, a single VM is not likely to fully utilize the
resource of the underlying hardware. Two edge node VMs on a 4 pNIC physical host (4 x 25G
recommended) generally represent a more optimal configuration.
The 4 pNIC design is an extension of the two pNIC design for the edge VM for factor. The four
pNICs are managed by a single VDS that provides connectivity to both edge node VM. Wiring
and teaming policies configuration are the same as in the 2 pNIC model, the design is
fundamentally replicated for the second edge VM and the two additional pNICs. The diagram
below presents the recommended configuration. The two edge node VMs do not share the
same dvpg for the management interfaces. This choice enforce fate sharing between the
management and overlay network of the two edge VMs (i.e. if pNIC 2 and pNIC 3 go down the
two edges are completely isolated for both overlay and management). While not required, this
configuration may provide additional protection in case of specific failure scenarios in certain
topologies (Refer to the Edge HA section in Chapter 4).
The design proposed in this section requires the careful planning and configuration of bridging
profiles so that the active bridging instances are distributed across different edge VMs.
Specifically:
• Active/Standby instances for the same bridge profiles should not be associated to edge
VMs running on the same host.

308
VMware NSX Reference Design Guide

• A deployment incorporating two hosts with 4 edge VMs (2 per host) requires two bridge
profiles for an active/standby configuration (1 host active, 1 host standby), and 4 bridge
profiles for an active/active configuration (both hosts active).
Note: nor hosts or edge VMs are active or standby, it’s the bridging instances running on top of
the edge VMs that are. Designating hosts as active or standby represents a design abstraction
helpful when building a logical model hiding implementation complexities.

Figure 7-38: Bridge Design – 4 pNIC VM Edge – 1 NVDS

FIGURE 7-38 presents a 4 pNICs ESXi server hosting two edge VMs dedicated to bridging. The
two edge VMs are associated with different bridge profiles so that they will never run an
active/standby pair for any VLAN-Overlay pair. The failure of the host should not cause both the
active and standby instance to fail at the same time. Edge VMs running on a different host
provide high availability for the bridging instances depicted in the diagram. vSphere DRS rules
should be implemented to achieve the desired design and placement.

309
VMware NSX Reference Design Guide

Edge Connectivity Guidelines for Layer 3 Peering Use Case


The goal of the layer 3 N-S connectivity design is simple: deterministic and redundant
configuration without incurring any dependencies on Spanning-tree related configuration. This
means the peering VLAN is confined to a specific ToR and do not span across ToR. This topology
choice also allows direct mapping from Edge node uplinks to physical NIC of the devices (bare
metal or ESXi host) and eventually to ToR interface. This creates 1:1 mapping of physical
connectivity and logical peering connectivity from Edge node to physical router, resulting into a
simple, operationally deterministic connectivity of N-S traffic forwarding and troubleshooting.
The operator always knows the which peering is impacted during the failure of a pNIC, ToR or
Edge-Uplink.
In the typical enterprise design, Edge nodes in FIGURE 7-39 are assigned to a Tier-0 router. This
Tier-0 router peers with the physical infrastructure using eBGP. Two adjacencies per Edge node
with two logical switch connects to distinct “External-VLAN” per ToR. FIGURE 7-39 represents
the logical view of the BGP peering. FIGURE 7-39 represents the logical view of the BGP peering.
One common misconception prevails that separation of TEP and BGP on two different switches
somehow increases the resiliency or availability of the system. That is not the case at all. In fact,
when both TEP and BGP are configured on the same virtual switch, enables consistent failover
of both TEP and BGP at the same time. The dual TEP and BGP combination allows consistent
configuration either with Edge VM or bare metal Edge. However, there are corner use cases
where asymmetric bandwidth ration of E-W vs N-S exist in that case specific optimization of
design may be appropriate.
From the perspective of the physical networking devices, the Tier-0 router looks like a single
logical router; logically, the adjacency to “Router1” is hosted on Edge node “EN1”, while “EN2”
is implementing the peering to “Router2”.
Those adjacencies are protected with BFD, allowing for quick failover should a router or an
uplink fail. See the specific Graceful Restart and BFD Interaction with Active/Standby interaction
based on type of services – ECMP or stateful services – enabled in the Edge node and type of
physical routers supported.

310
VMware NSX Reference Design Guide

Router1
Router2

eBGP
eBGP External2-VLAN 200
External1-VLAN 100

EN1 EN2

SR
Tier0 DR

Tier1 Overlay
SR
Traffic DR

DR

Web1 Web2 App1 Web1

Figure 7-39: Typical Enterprise Bare metal Edge Note Logical View with Overlay/External Traffic

7.5.2.1 Bare Metal Edge Design


The bare metal Edge node is a dedicated physical server that runs a special version of the NSX
Edge software. The bare metal Edge node requires a NIC supporting DPDK. VMware maintains a
LIST OF THE COMPATIBILITY WITH VARIOUS VENDOR NICS.

This design guide covers commonly deployed bare metal configuration.

7.5.2.1.1 Bare metal Design with 2 Datapath pNICs


FIGURE 7-40 shows two pNIC bare metal Edges using a single N-VDS design for the data
plane.The left side of the diagram shows the bare metal Edge with four physical NICs where
management traffic has two dedicated physical NICs (P1 & P2) configured in active/standby
mode.
This topology uses a single N-VDS "Overlay and External N-VDS" that carries both overlay and
External traffic. Overlay traffic to and from different overlay segments/logical switches gets
pinned to TEP IP1 or TEP IP2 and gets load balanced across uplinks, Uplink1, and Uplink2. Notice
that both TEP IPs use the same transport VLAN, i.e., VLAN 200, configured on both top of rack
switches.
Two VLANs segments, i.e., "External VLAN Segment 300" and "External VLAN Segment 400,"
provide northbound connectivity to Tier-0 gateway. External traffic from these VLAN segments

311
VMware NSX Reference Design Guide

is load-balanced across uplinks using a named teaming policy, which pins a VLAN segment to a
specific uplink.
We can also use the same VLAN segment to connect Tier-0 Gateway to TOR-Left and TOR-Right.
However, it is not recommended because of inter-rack VLAN dependencies leading to spanning
tree-related convergence and the inability to load-balance the traffic across different pNICs.
This topology provides redundancy for management, overlay, and external traffic, in the event
of a pNIC failure on Edge node/TOR and a complete TOR Failure.
The right side of the diagram shows two pNICs bare metal edge configured with the same N-
VDS "Overlay and External N-VDS" for carrying overlay and external traffic as in the example
above, but that it is also leveraging in-band management.

Out of band Management In-Band Management


TOR-Left TOR-Right TOR-Left TOR-Right
Mgmt VLAN: 100 Mgmt VLAN: 100 Mgmt VLAN: 100 Mgmt VLAN: 100
Overlay VLAN 200 Overlay VLAN 200 Overlay VLAN 200 Overlay VLAN 200
External VLAN:300 External VLAN:400 External VLAN:300 External VLAN:400

P1 P2 P3 P4 P1 P2
Mgmt
Uplink1 Uplink2 Uplink1 Uplink2
Traffic

Linux Overlay and External Overlay and External


Overlay N-VDS N-VDS
IP Stack
Traffic TEP-IP1 TEP-IP2
Mgmt-IP TEP-IP1 TEP-IP2
Mgmt-IP
External
Traffic External Vlan External Vlan External Vlan External Vlan
(N-S) for Segment Segment Segment Segment
Vlan 300 300 400 300 400

External Tier-0 Tier-0


Traffic Baremetal Gateway Baremetal Gateway
(N-S) for Edge Node Edge Node
Vlan 400

Bare Metal Edge with 4 Physical NICS Bare Metal Edge with 2 Physical NICS
2* 1G NIC + 2 * 10G NIC 2 * 10G NIC

Figure 7-40: Bare metal Edge configured for Multi-TEP - Single N-VDS for overlay and external traffic

Both topologies use the same uplink profile, as shown in FIGURE 7-41. This configuration shows
a default teaming policy that uses both Uplink1 and Uplink2. This default policy is used for all
the overlay segments/logical switches created on this N-VDS.
Two additional teaming policies, "Vlan300-Policy" and "Vlan400-Policy," have been defined to
override the default teaming policy and send traffic to "Uplink1" and "Uplink2" only,
respectively.
"External VLAN segment 300" is configured to use the named teaming policy "Vlan300-Policy"
that sends traffic from this VLAN only on "Uplink1". "External VLAN segment 400" is configured

312
VMware NSX Reference Design Guide

to use a named teaming policy "Vlan400-Policy" that sends traffic from this VLAN only on
"Uplink2".
Based on these teaming policies, TOR-Left will receive traffic for VLAN 100 (Mgmt.), VLAN 200
(overlay) and VLAN 300 (Traffic from VLAN segment 300). Similarly, TOR-Right will receive
traffic for VLAN 100 (Mgmt.), VLAN 200 (overlay) and VLAN 400 (Traffic from VLAN segment
400). A sample configuration screenshot is shown below.

Figure 7-41: Bare metal Edge Uplink Profile

FIGURE 7-42 shows a logical and physical topology where a Tier-0 gateway has four external
interfaces. External interfaces 1 and 2 are provided by bare metal Edge node “EN1”, whereas
External interfaces 3 and 4 are provided by bare metal Edge node “EN2”. Both the Edge nodes
are in the same rack and connect to the TOR switches in that rack. Both the Edge nodes are
configured for Multi-TEP and use named teaming policy to send traffic from VLAN 300 to TOR-
Left and traffic from VLAN 400 to TOR-Right. Tier-0 Gateway establishes BGP peering on all four
external interfaces and provides 4-way ECMP.
Logical Topology Physical Topology

TOR-Left TOR-Right TOR-Left TOR-Right


Mgmt VLAN: 100 Mgmt VLAN: 100
Overlay VLAN 200 Overlay VLAN 200
External VLAN:300 External VLAN:400

Vlan 300 IP Vlan 400 IP


192.168.240.1/24 192.168.250.1/24

BGP P1 P2 P1 P2
Uplink1 Uplink2 Uplink1 Uplink2
External-1 External-4
192.168.240.2/24 192.168.250.3/24
Overlay and External Overlay and External
External-2 External-3
N-VDS N-VDS
192.168.250.2/24 192.168.240.3/24 TEP-IP1 TEP-IP2 TEP-IP1 TEP-IP2
EN1 EN2
Mgmt-IP Mgmt-IP

External-1 External-2 External-3 External-4


192.168.240.2/24 192.168.250.2/24 192.168.240.3/24 192.168.250.3/24
SR Tier-0 Vlan 300 Vlan 400 Vlan 300 Vlan 400
Gateway
SR Tier-0 SR Tier-0
Gateway Gateway
EN1 EN2

Figure 7-42: ECMP using multiple bare metal edges

313
VMware NSX Reference Design Guide

For most environments the performance provided by a pNIC bare metal edge are comparable
to an edge node VMs when the physical server has only 2 pNICs, if those interfaces are 10Gbps
or 25Gbps. The pNIC bandwidth usually represents the bottleneck and the VM form factor is
generally recommended for the easier lifecycle management. Situation when a bare metal edge
node should be considered are:
• Requirement for line rate services
• Higher bandwidth pNIC (25 or 40 Gbps)
• Traffic profile is characterized by small packet size (e.g., 250 Bytes)
• Sub-second link failure detection between physical network and the edge node
• Network operation team retains responsibility for the NSX edges and has a preference
for an appliance-based model
• In a multiple Tier-0 deployment model where the top Tier-0 is deployed on bare metal
edges and drives the throughput with higher speed (40 Gbps) pNICs.
• Management interface redundancy is not always required but a good practice. In-band
option is most practical deployment model when a limited number of interfaces is
available.

7.5.2.1.2 Bare metal Design with more than 2pNICs


FIGURE 7-43 shows NSX bare metal Edge with six physical NICs. Management traffic has two
dedicated pNICs configured in Active/Standby. Two pNICs, P3 and P4 are dedicated for overlay
traffic and two pNICs (P5 and P6) are dedicated for external traffic.
This topology uses a single N-VDS which carries both overlay and External traffic. However,
different uplinks are used to carry overlay and external traffic. Multi-TEP is configured to
provide load balancing for the overlay traffic on Uplink1 (mapped to pNIC P3) and Uplink2
(mapped to pNIC P4). Notice that, both TEP IPs use same transport VLAN i.e. VLAN 200 which is
configured on both top of rack switches.
FIGURE 7-43 also shows a configuration screenshot of named teaming policy defining two
additional teaming policies, “Vlan300-Policy” and “Vlan400-Policy”.
"External VLAN segment 300" is configured to use a named teaming policy “Vlan300-Policy”
that sends traffic from this VLAN only on Uplink3 (mapped to pNIC P5). "External VLAN segment
400" is configured to use a named teaming policy “Vlan400-Policy” that sends traffic from this
VLAN only on Uplink4 (mapped to pNIC P6). Hence, BGP traffic from Tier-0 on VLAN 300 always
goes to TOR-Left and BGP traffic from Tier-0 on VLAN 400 always goes to TOR-Right.
This topology provides redundancy for management, overlay and external traffic. This topology
also provides a simple, high bandwidth and deterministic design as there are dedicated physical
NICs for different traffic types (overlay and External traffic).

314
VMware NSX Reference Design Guide

Figure 7-43: Bare metal Edge with six pNICs - Same N-VDS for Overlay and External traffic

The bare metal configuration with greater than two pNICs is the most practical and
recommended design. This is because four or more pNICs configurations substantially offer
more bandwidths than the equivalent Edge VM configurations. The same reasons for choosing
bare metal apply as in the two pNICs configurations discussed above.

7.5.2.2 Edge Node VM


Starting with NSX 2.5 release, Edge nodes support Multi-TEP configuration to load balance
overlay traffic. Similar to the bare metal Edge one N-VDS design, we recommend configuring
the Edge VM with a single N-VDS for overlay and external traffic.
FIGURE 7-44 shows an Edge VM with one N-VDS to carry both overlay and external traffic. Multi-
TEP is configured to provide load balancing for overlay traffic on “Uplink1” and “Uplink2”.
“Uplink1” and “Uplink2” are mapped to use vNIC2 and vNIC3 respectively. Based on this
teaming policy, overlay traffic will be sent and received on both vNIC2 and vNIC3 of the Edge
VM. Load balancing across the two TEPs is based on a segment hash. Traffic to and from the
same overlay segment is forwarded by the same TEP. Both TEP IPs use same transport VLAN
(i.e. VLAN 200 in the diagram) which is configured on both top of rack switches.
Similar to the bare metal edge design, the Tier-0 Gateway connects to the physical
infrastructure leveraging VLAN segments “External VLAN Segment 300” and “External VLAN
Segment 400” respectively. The routing adjacency between the T0 Gateway and the TOR
switches happens on those two external VLANs. In this example, “External VLAN Segment 300”
and “External VLAN Segment 400” are configured with a VLAN tag, 300 and 400 respectively.
External traffic received on VDS port groups “Trunk1 PG” and Trunk2 PG” is VLAN tagged and
hence, these port groups should be configured in VGT (Virtual guest tagging) mode and allow
those specific VLANs.

315
VMware NSX Reference Design Guide

Named teaming policy is also configured to force external traffic on specific edge VM nNICs.
FIGURE 7-44 also shows named teaming policy configuration used for this topology. "External
VLAN segment 300" is configured to use a named teaming policy “Vlan300-Policy” that sends
traffic from this VLAN on “Uplink1” (vNIC2 of Edge VM). "External VLAN segment 400" is
configured to use a named teaming policy “Vlan400-Policy” that sends traffic from this VLAN on
“Uplink2” (vNIC3 of Edge VM). Based on this named teaming policy, external traffic from
“External VLAN Segment 300” will always be sent and received on vNIC2 of the Edge VM.
North-South or external traffic from “External VLAN Segment 400” will always be sent and
received on vNIC3 of the Edge VM.
Overlay or external traffic from Edge VM is received by the VDS DVPGs “Trunk1 PG” and
“Trunk2 PG”. The teaming policy used on the VDS port groups defines how this overlay and
external traffic coming from Edge node VM exits the hypervisor. For instance, “Trunk1 PG” is
configured to use active uplink as “VDS-Uplink1” and standby uplink as “VDS-Uplink2”. “Trunk2
PG” is configured to use active uplink as “VDS-Uplink2” and standby uplink as “VDS-Uplink1”.
This configuration ensures that the traffic sent on “External VLAN Segment 300” (i.e., VLAN 300)
always uses vNIC2 of Edge VM to exit the Edge VM. This traffic then uses “VDS-Uplink1” (based
on “Trunk1 PG” configuration) and is sent to the left TOR switch. Similarly, traffic sent on VLAN
400 uses “VDS-Uplink2” and is sent to the TOR switch on the right.
In case of a failure of an ESXi host pNIC, the active/standby teaming policy on the trunk dvpg
will redirect the traffic to the surviving pNIC (set the teaming policy to “Explicit Failover” with
one VDS uplink active and one standby). Overlay traffic (VLAN 200 in the diagram) will flow
over the operational ESXi pNIC. This failover event is entirely transparent to the edge node VM
N-VDS, which keeps load balancing overlay traffic over the two TEPs and vNICs. Traffic
originated or destined to both TEPs flows over a single ESXi pNIC. External VLAN traffic will not
failover to the surviving ESXi pNIC because, while the overlay VLAN is defined on both switches,
the two external VLANs are defined on TOR-Left or TOR-Right. A routing protocol
reconvergence will redirect the traffic to the surviving peering VLAN and corresponding pNIC.
The TOR switches configuration mustn't define the two peering VLAN on both switches;
otherwise, it would be possible to have the failed neighborship restored over the inter-switch
link, which is not desirable because of the spanning tree dependency and the lack of univocal
mapping between the physical links and the routing paths.
The BFD sessions from the individual Edge TEPs to other transport nodes are not a reliable
mechanism to detect and recover from partial failures (of a single TEP for example). This is the
reason for requiring an active/standby teaming policy to protect TEP traffic in case of a host
pNIC failure. The exception is the “All tunnels down” condition when all the tunnels from all
edge TEPs to all the transport nodes are down. See the NSX EDGE HIGH AVAILABILITY FAILOVER
TRIGGERS section for more details.

This design does not consider the failure of the edge vNICs because it does not represent a
realistic failure being the vNIC a virtual component. Disabling the edge vNICs will cause a black

316
VMware NSX Reference Design Guide

hole of the traffic. It is not recommended to enable MAC learning and/or forged transmit on
the upstream VDS dvpg to support this failure scenario.

Figure 7-44: VLAN tagging on Edge node

Starting with NSX release 2.5, single N-VDS deployment mode is recommended for both bare
metal and Edge VM. Key benefits of single N-VDS deployment are:
• Consistent deployment model for both Edge VM and bare metal Edge with one N-VDS
carrying both overlay and external traffic.
• Load balancing of overlay traffic with Multi-TEP configuration.
• Ability to distribute external traffic to specific TORs for distinct point to point routing
adjacencies.
• No change in DVPG configuration when new service interfaces (workload VLAN
segments) are added.
• Deterministic North South traffic pattern.

7.5.2.3 VLAN Backed Service Interface on Tier-0 or Tier-1 Gateway


A service interface is an interface connecting VLAN backed segments/logical switch to provide
connectivity and access to NSX edge centralized services (i.e., NAT or GFW) to VLAN backed
physical or virtual workloads. This interface acts as a gateway for these VLAN backed workloads

317
VMware NSX Reference Design Guide

and is supported both on Tier-0 and Tier-1 Gateways configured in active/standby HA


configuration mode.
Service interface is realized on Tier-0 SR or Tier-1 SR. This implies that traffic from a VLAN
workload needs to go to Tier-0 SR or Tier-1 SR to consume any centralized service or to
communicate with any other VLAN or overlay segments. Tier-0 SR or Tier-1 SR is always hosted
on Edge node (bare metal or Edge VM).
In the recommended single N-VDS design for layer 3 peering, no change in the DVPGs
configuration is required when new service interfaces (workload VLAN segments) are added.
FIGURE 7-45 shows a VLAN segment “VLAN Seg-500” that is defined to provide connectivity to
the VLAN workloads. “VLAN Seg-500” is configured with a VLAN tag of 500. Tier-0 gateway has
a service interface “Service Interface-1” configured leveraging this VLAN segment and acts as a
gateway for VLAN workloads connected to this VLAN segment. In this example, if the workload
VM, VM1 needs to communicate with any other workload VM on overlay or VLAN segment, the
traffic will be sent from the compute hypervisor (ESXi-2) to the Edge node (hosted on ESXi-1).
This traffic is tagged with VLAN 500 and hence the DVPG receiving this traffic (“Trunk-1 PG” or
“Trunk-2 PG”) must be configured in VST (Virtual Switch Tagging) mode. Adding more service
interfaces on Tier-0 or Tier-1 is just a matter of making sure that the specific VLAN is allowed on
DVPG (“Trunk-1 PG” or “Trunk-2 PG”).

Figure 7-45: VLAN tagging on Edge node with Service Interface

Service interface on a Tier-1 Gateway can also be connected to overlay segments for
standalone load balancer use cases. This is explained in Load balancer CHAPTER 6. Connecting a
service interface to an overlay segment to act as the default gateway for the VMs on that

318
VMware NSX Reference Design Guide

segment is supported but not recommended. The service interface is residing on the edge node
SR, meaning that any East-West traffic will traverse the edge node and distributed routing
capabilities are not available. Overlay segments should always be connected to downlink
interfaces (default behavior when creating a segment and connecting it to a Gateway).
In some corner case scenarios, service interfaces belonging to different Gateways (Tier-0 or
Tier-1) must be connected to the same segment. This configuration is supported, with the
caveat that if the segment is a VLAN the edges node hosting the different gateways must
belong to different edge clusters. This limitation does not apply to overlay segments.

7.5.2.4 Edge VM Design


This section covers the Edge VM design in various combinations. This design is solely based on
single N-VDS per Edge VM for basic overlay and N-S connectivity. This design is consistent with
the design that has been discussed for bare metal edge and remains the same for 2 pNIC or 4
pNIC design. The design pattern benefits in following ways:
• Symmetric bandwidth offered to both overlay and N-S
• Ability to accommodate asymmetric bandwidth requirement (i.e., 90% South to North
Traffic)
• Deterministic layer 3 peering with univocal 1:1 mapping between logical and physical
paths
• Repetitive design that can scale with pNIC number growth

7.5.2.4.1 Dedicated Host for Edge VM Design - 2 pNICs


FIGURE 7-46 below shows the Edge VM connectivity with 2 pNIC ESXi host. The basic
configuration for overlay and N-S is described in detail in the section: EDGE NODE VM. The
diagram shows how two edge VMs can share the same 2 host pNIC.

319
VMware NSX Reference Design Guide

Figure 7-46: Single N-VDS per Edge VM - Two Edge Node VM on Host

The key design considerations are as follows:


• For pure North/South packets forwarding configurations (Edge VM running T0 Gateway
only with no services) and common traffic profiles, the pNICs are generally the most
evident performance-limiting factor. In most situations, placing two edge nodes on the
same pNIC pair has a limited performance benefit.
• For shared edge clusters configurations, where the same edge node VM provides L3
forwarding and stateful services e.g., T0 or T1 Gateway firewall or NAT), CPU might be
the limiting performance factor. Connecting multiple edges on the same two pNICs might
better utilize the underlying hardware without saturating the pNIC capacity.
• Because of the tight relationship between edge VM performance and underlying
hardware (specifically pNICs), we must carefully plan the placement of the edge VMs on
the vSphere host. We should influence the automatic placement by vSphere DRS via DRS
rules or disabled it altogether. DRS can help in balancing CPU intensive edge VMs (i.e.
running TLS decryption or other CPU intensive services) across the vSphere hosts in a
cluster, but when pNICs are the main source of contention, manual placement is
generally more appropriate.
• In some designs placing multiple edge node VMs on the same ESXi host is driven by
topology requirements and not performance or high availability. An example would be a

320
VMware NSX Reference Design Guide

multi-tenant design where each tenant receives a dedicated T0 Gateway (Dedicated


edge VM). Higher consolidation ratios can be obtained by placing multiple edge VMs on
the same host. Oversubscription is a risk to be mitigated by monitoring pNIC and CPU
resource utilization. DRS can help mitigate CPU oversubscription in dense and/or
dynamic environments with high churn.
The key implementation best practices are as follows.
• Transport zone – one overlay and VLAN – consistent compared to three N-VDS designs
where external VLANs have two specific VLAN transport zone due to unique N-VDS per
peering
• N-VDS(derived from matching transport zone name – both overlay and VLAN) defined
with dual uplinks that map to unique vNICs, which maps to unique DVPG at VDS – duality
is maintained end-to-end
• N-VDS carries multiple VLANs per vNIC – overlay and BGP peering
• The overlay VLAN must be the same on both N-VDS uplink with source ID teaming.
• BGP Peering VLAN is unique to each vNIC as it carries 1:1 mapping to ToR with a named
teaming policy with only one active pNIC in its uplink profile
• VDS DVPG uplinks are active-standby (Failover Order teaming for the trunked DVPG) to
leverage faster convergence of TEP failover. The failure of either pNIC/ToR will force the
TEP IP to register (via GARP) on an alternate pNIC and TOR. The BGP peering recovery is
not needed as alternate BGP peering is alive. The BGP peering over the failed pNIC will
be timed out based on either protocol timer or BFD detection.
• The recommendation is not to use the same DVPG for non-edge VM traffic. VLAN
allowed should be those of the routing peering, overlay, and any optional service
interface.
• BFD configuration is recommended for fast link failure detection.
• If the physical fabric doesn’t support BFD, fast failure detection can be achieved by
setting the routing timers as low as 1/3 Sec. (Hello/Dead)
When adding bridging services to an edge VM configured for North/South peering, it is
straightforward to layer the bridging topology presented in the bridging use case section: 2
PNICS SERVER VM FORM FACTOR. The third edge VM vNIC can be enabled and leveraged for the
VLAN side of the bridge. A dedicated NVDS or the one already used for the routing peering
together with a named teaming policy can be used. FIGURE 7-47 presents a scenario with a
dedicated NVDS.

321
VMware NSX Reference Design Guide

Figure 7-47: Two Edge VMs per host with 2 pNICs – Routing and bridging

7.5.2.4.2 Dedicated Host for Edge VM Design - 4 pNICs


The design choice with four pNICs is optimal for having multiple Edge nodes per host and
minimizing pNIC oversubscription. In most cases (except when a host is oversubscribed with
other VMs or resources like management, multi-tenant edges, services intense edges, etc.), it is
not the host CPU, but the number of pNICs available on the host that determines the optimal
number of Edge node per host. We should customize the physical hardware dedicated to the
edge node VMs considering the specific traffic profile of the environment. A good starting point
for a balanced hardware configuration across the different relevant dimensions (CPU, pNICs,
number, and size of the edge node VMs) and optimized for L3 forwarding only (Single T0
Gateway only, no services) is:
• 4 x 25Gbps pNICs (RSS capable)
• 32 X CPU Cores
• 2 X Large Edge Node VMs
More information about the rationale for this design is provided in the performance chapter 8.

322
VMware NSX Reference Design Guide

From a configuration perspective, the 2 PNIC design is replicated “side-by-side” with dedicated
pNICs and dvpgs. We create a separate management dvpg carrying traffic for the same VLAN
and IP space of the first edge VM, but with a different teaming policy that forces the
management traffic for the second edge VM over P3 and P4. We perform this configuration to
enforce fate sharing across management, overlay, and VLAN peering networks for each edge
VM. Refer to the edge HA section in chapter 4 for the rationale behind this recommendation.

Figure 7-48: Two Edge VMs per host with 4 pNICs

7.5.2.5 Edge node VM connectivity to N-VDS (or VDS prepared for NSX)
In all the previous scenarios, we connected the edge node VMs to a vSphere VDS not prepared
for NSX, which is the most common scenario when a dedicated vSphere edge cluster is
available. When edge node VMs share the ESXi host with regular workload VMs requiring NSX
services (Overlay and/or DFW), it may be desirable to connect the edge VMs and the workload
VMs to the same virtual switch (NVDS or VDS prepared for NSX).
This type of design is the only available option when only two pNICs are available on the host.
When four or more pNICs are available, it is possible to have multiple virtual switches on the
ESXi host and connect edge VMs and workloads VMs to different ones.
Before NSX 3.1, it was required to place the ESXi TEPs and the Edge TEPs in different subnets
and VLANs when leveraging the same two pNICs for both. FIGURE 7-49 below presents such a
design. We can transport the edge overlay traffic over trunk NSX segments or regular vCenter
managed trunk dvpgs. If the ESXi host virtual switch consists of an NVDS, NSX VLAN segments
are the only choice.

323
VMware NSX Reference Design Guide

Figure 7-49: Edge VM connected to NSX Prepared VDS – Host and Edge TEP on different IP/VLAN

Starting with NSX version 3.1, edge and host TEPs can reside on the same VLAN because the
host now can process Geneve traffic internal to the host itself. We must transport edge VM
overlay traffic over an NSX Segment in this case. If the edge TEPs are connected to a vCenter
managed dvpg, tunnels between the host and the edge will not come up. This design is
presented in FIGURE 7-50 below:

324
VMware NSX Reference Design Guide

Figure 7-50: Edge VM connected to NSX Prepared VDS – Host and Edge TEP on the same IP/VLAN - NSX 3.1 or later

When connecting edge VMs to an NSX prepared VDS or NVDS, please keep in mind the
following recommendations:
• Host and Edges should be part of different VLAN Transport Zones. This ensures a clear
boundary between the transport segments on the host and those used for the routing
peering on the edges. The edge segment traffic is transported by the host segments,
configured as a trunk.
• When implementing a single TEP VLAN design like in FIGURE 7-50, the VDS trunk port
groups transporting the edge TEP traffic must be NSX managed segments and cannot be
created in vCenter.
• Follow the canonical recommendations regarding VLAN trunking and teaming policy
configuration for overlay and VLAN peering traffic described in section: 7.5.2.2.
The design with different VLAN/IP subnets per TEP is still valid and can be used with any NSX
version, including 3.1 or later. In most cases, the single TEP design is preferred for its simplicity,
however for most deployments having separate VLANs for Edge and Host TEP is recommended
due to following considerations:
• When Edges and hosts share the same TEP VLAN, they also share the span of that VLAN.
While it is usually desirable to limit the host TEP VLAN to a rack, edge VMs may require
mobility across racks or even sites (e.g., in the VCF stretched cluster design). Separate
VLANs allow to manage the span of host and edge TEP networks individually.
• An edge and the host where the edge is running will never lose TEP connectivity if they
share the same TEP network, regardless of a pNIC failure. This means that the edge node

325
VMware NSX Reference Design Guide

VM will never incur in an all tunnels down HA condition, limiting its ability to react to
specific failures. Please refer to the EDGE HA SECTION IN CHAPTER 4 for more information.
(Note: a design that matches FIGURE 7-50 should not incur any issue as management,
overlay, and VLAN peering networks share the same pNICs).

Edge Connectivity Guidelines for Services Only Use Case


Some network virtualization deployments may require deploying edge nodes for services only.
Those are edge nodes where no L3 peering to the physical network is not required, see FIGURE
7-51. They are leveraged for stateful services, usually running on tier-1 gateways, but at times
on tier-0 gateway in back to back tier-0 configurations (see TOPOLOGY CONSIDERATION ). In such
situations, we keep the configuration of the edge connectivity consistent with the layer three
peering and bridging use cases but with fewer required configurations. The edge nodes
dedicated for stateful services require only connectivity to the overlay network, so no VLAN
transport zone or VLAN segments must be provisioned for the BGP peering.
Key implementation guidelines that apply to both bare metal and VM form factors are:
• Single NVDS deployment with multi-TEP configuration
• The edge node is only part of the overlay transport zone
• No VLAN connectivity is required (no VLAN TZ, no VLAN segments)
FIGURE 7-51 shows an example of bare metal edge node configuration where edge node 1 is a
shared node where both Layer 3 peering and services are implemented. Bare metal edge node
2 is dedicated for services only. You can notice the absence of the VLAN transport zone and
VLAN connectivity on the NVDS of bare metal edge 2.

326
VMware NSX Reference Design Guide

Figure 7-51: Services Only BM Edge

FIGURE 7-52 shows a detailed example of services only edge node VM connectivity. Again, you
can notice the absence of VLAN connectivity requirements for the services only edges deployed
on the ESXi host on the right.
Specific implementation considerations apply to the edge node VM form factor:
• VLAN tagging is not required on edge NVDS as the upstream VDS dvpg only carries
overlay traffic and could be configured for a specific VLAN rather than as a trunk.
• Tagging overlay traffic on the edge node VM NVDS and configuring the dvpgs as trunks is
recommended for consistency with the other use cases and possibly adopting them at a
later stage on the same edge.
Specific design considerations apply to the edge node VM form factor:
• In a layer 2 physical fabric (or layer 3 with overlays), edge node VMs dedicated to T1
Gateway services can have a larger mobility span than edges providing peering
functionalities and can be deployed on clusters striped across multiple racks.

327
VMware NSX Reference Design Guide

• While the ESXi pNICs are usually the limiting performance factor for edge node VMs
running T0 Gateway for layer three peering, the host CPU may represent the bottleneck
for service-intensive edges. In such cases connecting multiple edge VMs to the same
pNIC may be appropriate (see the right ESXi in FIGURE 7-52)

Figure 7-52: Services only Edge VM

328
VMware NSX Reference Design Guide

NSX Edge Resources Design


The previous section covered the Edge node wiring for bare metal and VM form factors. It
explains how overlay, N-S peering, and bridging connectivity can be implemented based on
design considerations regarding the number of pNICs and the availability model. This section
provides details about building services (e.g., ECMP, FW, NAT, LB, and VPN) with either bare
metal or VM edge nodes. In addition, we include several considerations about optimizing the
resource footprint for such services. There are two major design areas affecting the
implementation of NSX service clusters: the type of services enabled and the clustering model.

7.5.4.1 Edge Services Requirements


The first step in designing and edge cluster is gathering the requirements for the solution. We
should group those requirements in functional and non-functional requirements.
Functional requirements describe what the system should do, and in the context of the edge
cluster design, they should list, as a minimum the services in scope. Ideally, they should be
mapped to the relevant workloads that will consume such services. An example of a weak
functional requirement is:
- The edge cluster will provide NAT, gateway Firewall, and VPN services
Stronger functional requirements are:
- Tenant 1 requires SNAT functionalities
- Tenant 2 requires layer 3 VPN services to redundant remote peers
- Tenant 3 requires access to the hosted workloads without any NAT
We use the word tenant here to represent a group of workloads. The considerations are
general and valid for both an enterprise and a service provider design. To be noted is that such
requirements are derived from higher level design decisions or business requirements, for
example:
- Tenant 1 will migrate their current application to the cloud platform without change to
the IP addresses, so IP conflict with other tenants should be managed
- Tenant 2 considers direct private links to the cloud provider too expansive and opted for
IPSEC VPN
- Tenant 3 users deploys VMs in the private cloud environment with the expectation that
they are fully accessible from any internal location with minimal complexity
Non-functional requirements describe how the edge services cluster work as a system and the
level of service it will provide. Non-functional requirements can be grouped in five categories:
Availability, Manageability, Performance, Recoverability, and Security. We will provide some
examples for each category.
Availability:

329
VMware NSX Reference Design Guide

• General SLA for the service itself. We could have an SLA for all the services, e.g., X for all
services to be available, or we could have a different SLAs for different services (e.g., X
for layer3 peering, Y for VPN, Z TLS Decryption), or even a specific one per tenant and
service (e.g., tenant 1 requires A for TLS decryption, tenant 1 requires B for Gateway IPS
functionality, tenant 2 requires C for TLS decryption). SLA requirements affect the edge
cluster design in term of the number of edge nodes deployed, where they are deployed
(same rack, different rack, different rooms), and the availability requirements of the
underlying infrastructure (e.g., the vSphere cluster or the datastore)
• Max recovery time in case of a network failure. While the general SLA considers the
total downtime over an extending time, usually a year, requirements should outline the
expectation for fast recovery in case of common hardware network failure such as the
failure of a pNIC, a host, or a top of rack switch. The reason for such requirements is that
why multiple short interruption may not violate the SLA for the year, they may represent
a problem for applications with short timeouts or incapable of reconnecting
automatically. This type of requirements affects, among other things, the edge node
form factor, routing protocols and BFD timer’s implementation. These requirements
should always be derived by specific application needs, and not provided as blanket
statements.
Manageability:
• Lifecycle. Different edge services should be able to undergo maintenance at different
times. Requirements in this area may lead to deploy edges in different edge clusters,
different vSphere clusters, and even different NSX Manager domains.
• Scalability. How elastic should the service cluster be? This consideration may not apply
to all the services in the same way. For example, I may need to increase the throughput
for the TLS decryption service based on on-boarding new tenants, but the VPN service
will not be affected by this change. Requirements in this area may lead to dedicate edge
clusters to specific services, and to design the underlying infrastructure with different
characteristics depending on the services (e.g., hosts with different pNIC, CPU resources,
memory)
• RBAC. Who is entitled to configure each service? Is it a self-service offering? This area
has impacts on the integration with directory services and user privileges.
• Object based RBAC. Do I need different users to be able to manage different objects of
the same type? For example, each tenant should be able to manage the Gateway
Firewall rules for their dedicated T0 or T1 Gateway. This area may lead to considerations
around separating the NSX deployment in multiple NSX Managers domains, the
integration with the most appropriate cloud management platform, or the adoption f
NSX Multi-tenancy via NSX Projects and/or VPCs.
• Monitoring. What are the metrics that I need to collect to ensure I am proactive in
addressing performance and scalability issues? Separating services on different edge

330
VMware NSX Reference Design Guide

VMs may provide granular reporting around the resource consumption of specific
services.
Performance:
• North/South throughput. The amount of traffic that the NSX edges should support
between the physical and the virtual environments is commonly estimated in bits per
second (bps), e.g., 20Gbits. Such requirements tend to be misleading as the throughput
is highly influenced by the packet size. So, the traffic profile should be taken into
consideration when stating the expected North/South throughput for the system. A
more general and accurate characterization of the capability of the provided design
would be stating North/South throughput is packets per second (PPS). Requirements in
this are affect edge node form factor, the number of edges deployed in ECMP, the host
resources in terms of pNIC and CPU, and others.
• Latency and jitter. Some application may have stringent latency and jitter requirements.
Those requirements should be evaluated and may lead to the provisioning of dedicated
resources such as T0 Gateways, pNICs or hosts to avoid the resource contention between
mice and elephant flows.
• New connections per second (CPS). This parameter has impact on the stateful services
design, it may require spreading services across multiple edge nodes.
• Throughput per service. This is different from the north/south throughput which is
generally and aggregate of all the services traffic. Different services may be more or less
resource intensive and may require the allocation of dedicated resources. For example,
NAT and Gateway firewall are in most case light on resources and can coexist on the
same edge dedicated to the layer 3 peering without a noticeable performance impact.
Other services (e.g. VPN and TLS decryption) may have noticeable impact if deployed at
scale. Requirements in this area will impact the segmentation of services in different
edge clusters, and the choice of the edge form factor for some services.
Recoverability:
• Recovery Time Objective (RTO). This is different than the requirements for high
availability in the sense that we want to specify how much time we have to restore the
services after a major outage. The outage in scope depends on the context of the design
and may refer for example to a rack failure, a datacenter room failure, or an entire site
failure. Requirements in this area may affect the striping of an edge cluster across
different availability zones, the protection mechanisms we want to rely on for the
recovery (e.g., edge native HA vs vSphere HA), and the recovery procedure in place (fully
automatic, vs. manual, vs. scripted)
• Recovery Point Objective (RPO). While RPO usually refers to workload data, in the
context of NSX services it may help us formalize how much configuration it is acceptable
to lose because of a failure. While this requirement may not be stringent in a manually

331
VMware NSX Reference Design Guide

provisioned environments, NSX is often a core building block of self-service platform


where applications and the supporting infrastructure services are continuously
provisioned. Requirements in this area may affect the NSX Manager cluster deployment
model, the backup settings, and the underlying storage technology.
Security:
• Compliance. Compliance requirements varies and may lead to various degrees of
separation. When distributed firewall is an acceptable segregation mechanism,
compliance has no impact on edge services and network topologies. When an
organization requires stronger “physical” boundaries, compliance may lead to segment
the environment with multiple gateways, edge clusters, transport zones, or even NSX
Manager domains.

7.5.4.2 Edge Services Technology rule set


Once the edge services requirements have been outlined, it is important to identify the relevant
technology constrains. A technology constrain is not a negative attribute of the technology in
use, in this case NSX, it’s simply a description of how the system behave based on the way it has
been designed. We need to consider the boundaries of the product capabilities to produce a
valid design. Chapters one to six provide detail explanation of how NSX works. In this section
we will emphasize functionalities and their limits that are important for the edge services
design.
• A single T0 Gateway can be deployed on each edge node.
• A T0 or T1 Gateway cannot span multiple NSX edge clusters (NSX Edge VM clusters can
span multiple vSphere clusters)
• T1 gateways should be deployed as DR-Only (No edge cluster configured) unless stateful
services are required (NAT/Gateway FW)
• ECMP. ECMP provides load sharing across multiple layer 3 next-hops. ECMP happens on
each T0 Gateway component. It means that the Distributed Router (DR) component on
the ESXi host is capable of forwarding the outbound traffic to eight edge nodes. The T0
service router (SR) component on each edge node is capable of load sharing the traffic
across eight upstream physical devices.
• Multiple Edge clusters can be deployed within a single NSX Manager, allowing for the
creation of pool of capacity that can be dedicated to specific services (e.g., NAT at Tier-0
vs. NAT at Tier-1)
• Within a single Edge cluster, all Edge nodes should be the same type – either bare metal
or VM. Edge node VMs of different size can be mixed in the same Edge cluster, as can
bare metal Edge nodes of different performance levels based on pNICs, however those
combination is discouraged but supported for upgrades and other lifecycle reasons. A

332
VMware NSX Reference Design Guide

mixture of different sizes/performance levels within the same Edge cluster can have the
following effects:
○ With two Edge nodes hosting a Tier-0 configured in active/active mode, traffic will
be spread evenly. If one Edge node is of lower capacity or performance, half of the
traffic may see reduced performance while the other Edge node has excess
capacity.
○ For two Edge nodes hosting a Tier-0 or Tier-1 configured in active/standby mode,
only one Edge node is processing the entire traffic load. If this Edge node fails, the
second Edge node will become active but may not be able to meet production
requirements, leading to slowness or dropped connections.
• Services available on a T1 Gateway and not on a T0 Gateway. Some services are available
on T1 Gateway only and are not available on T0 Gateway. Designs requiring Gateway
IDS/IPS/Malware Prevention, Gateway Identity Firewall, URL Filtering, TLS Inspection and
NSX Load Balancer, need to include T1 Gateways.
• Services on T1 implies the hair pinning of the traffic between tenants connected to
different T1 Gateways.
• VPN remote peer redundancy. VPNs can be enabled on T0 or T1 Gateways with exactly
the same functionalities except for the ability to establish BGP peering over an IPSEC
routed VPN. If remote peer redundancy is required VPNs on T0 Gateways are the only
option.
• Because Edge nodes can be part of a single overlay transport zone, if the NSX
deployment has been segmented in multiple overlay transport zone, as a minimum, a
corresponding number of edge clusters should be deployed to provide physical to virtual
connectivity to each overlay transport zone.

7.5.4.3 Preemptive vs Non-preemptive Mode with Active/Standby Services


Each stateful services can be configured for either preemptive or non-preemptive mode. The
design choice is between deterministic balancing of services among available resources
(bandwidth and CPU) verses reducing disruption of services.
• The preemptive model allows making sure that, when the system is fully operational,
pool of edge resources (bandwidth and CPU) always get balanced after the restoration of
host or Edge VM. However, preemptive mode triggers the switchover of Edge VM
running services, leading secondary disruption causing packet loss. Operationally this
may not be acceptable triggering intentional switchover.
• The non-preemptive model maximizes availability and is the default mode for the service
deployment. If the active fails then recovers; it will be in standby mode, it will not trigger
a re-convergence that could lead to unnecessary packet loss (by preempting the
currently active.) The drawback is that, after a recovered failure, the services remains

333
VMware NSX Reference Design Guide

polarized on a host or an edge cluster. This leads to an oversubscription of a host or


uplink bandwidth. If availability is more important than oversubscription (bandwidth and
CPU), this mode is perfectly acceptable. Typically, one can restore the rebalancing of the
services during off-peak or planned maintenance window.

7.5.4.4 Edge cluster deployment options, shared vs dedicated


Services can be enabled either in shared or dedicated Edge node.
Shared Mode: Shared mode is the most common deployment mode and a starting point for
building the edge services model. Tier-0 or Tier-1 not enabled with stateful service, by default
runs in distributed mode. In shared mode, typically Tier-0 runs ECMP service while Tier-1 can
runs stateful services (aka either active or standby mode for that services). If the Edge node
fails, all the services within that nodes fails. Below FIGURE 7-53: SHARED SERVICE EDGE NODE
Cluster shows the flexibility in deploying ECMP services along with variable services model for
Tier-1. Tier-0, represented by the green node, running the ECMP service on all four Edge nodes
providing aggregated multi-Gbps throughput for combined Tier-1 services. The Tier-1 services
are deployed based on tenants or workload needs. For an example few Tier-1 services (red,
black and blue) are stateful over a pair of Edge nodes where many services are spread over four
nodes. In other words, they do not have to be deployed on the same nodes so the services can
scale. The stateful services have a SR component running on the Edge nodes. The active SR
component is shown with solid color Tier-1 router icon while the standby on in light faded icon.
The Tier-1 routing function can be entirely distributed (aka stateless yellow Tier-1) and thus
does not have SR component and thus they exist on each Edge node by its nature of distributed
router component, below figure depiction of for it is for illustration purpose only. Note that
active/standby services all have distributed routing (DR) component running in all Edge nodes
but traffic will be always be going through active services. This services configuration can be
applicable to both bare metal and Edge VM.

Figure 7-53: Shared Service Edge Node Cluster

The shared mode provides simplicity of allocating services in automated fashion as NSX tracks
which Edge node is provisioned with service and reduced that Edge node as potential target for
next services deployment. However, each Edge node is sharing CPU and thus bandwidth is

334
VMware NSX Reference Design Guide

shared among services. In addition, if the Edge node fails, all the services inside Edge nodes fails
together. Shared edge mode if configured with preemption for the services, leads to only
service-related secondary convergence. On the other hand, it provides optimized footprint of
CPU capacity per host. If the high dedicated bandwidth per service and granular services control
is not a priority, then use shared mode of deployment with Edge services.
Dedicated Mode: In this mode, Edge node is either running ECMP or stateful services but not
both. This mode is important for building scalable and performance-based services edge
cluster. Separation of services on dedicated Edge node allows distinct operational model for
ECMP vs stateful services. The choices of scaling either ECMP or stateful services can be
achieved via choice of bare metal or multiple of Edge VMs.

Figure 7-54: Dedicated Service Edge Node Cluster

FIGURE 7-54: DEDICATED SERVICE EDGE NODE Cluster described dedicated modes per service,
ECMP or stateful services. One can further enhanced configuration by deploying a specific
service per Edge node, in another word each of the services in EN3 and EN4 gets deployed as an
independent Edge node. It’s the most flexible model, however not a cost-effective mode as
each Edge node reserves the CPU. In this mode of deployment one can choose preemptive or
non-preemptive mode for each service individually if deployed as a dedicated Edge VM per
services. In above figure if preemptive mode is configured, all the services in EN3 will
experience secondary convergence. However, if one segregate each service to dedicated Edge
VM, one can control which services can be preemptive or non-preemptive. Thus, it is a design
choice of availability versus load-balancing the edge resources. The dedicated edge node either
per service or grouped for set of services allows deploying a specific form factor Edge VM, thus
one can distinguish ECMP based Edge VM running larger form (8 vCPU) allowing dedicated CPU
for high bandwidth need of the NSX domain. Similar design choices can be adopted by allowing
smaller form factor of Edge VM if the services do not require line rate bandwidth. Thus, if the
multi-tenant services do not require high bandwidth one can construct a very high density per
tenant Edge node services with just 2 vCPU per edge node (e.g. VPN services or a LB deployed
with DevTest/QA). The LB service with container deployment is one clear example where
adequate planning of host CPU and bandwidth is required. A dedicated edge VM or cluster may
be required as each container services can deploy LB, quickly exhausting the underlying
resources.

335
VMware NSX Reference Design Guide

Another use case to run dedicated services node is multi-tier Tier-0 or having a Tier-0 level
multi-tenancy model, which is only possible with running multiple instances of dedicated Edge
node (Tier-0) for each tenant or services and thus Edge VM deployment is the most economical
and flexible option. For the startup design one should adopt Edge VM form factor, then later as
growth in bandwidth or services demands, one can lead to selective upgrade of Edge node VM
to bare metal form. For Edge VM host convertibility to bare metal , it must be compatible with
BARE METAL REQUIREMENT . If the design choice is to immunize from most of the future capacity
and predictive bandwidth consideration, by default going with bare metal is the right choice
(either for ECMP or stateful services). This decision to go with VM versus bare metal also hinges
on operational model of the organization in which if the network team owns the lifecycle and
relatively want to remain agnostic to workload design and adopt a cloud model by providing
generalized capacity then bare metal is also a right choice.

7.5.4.5 Services Availability Considerations with Edge Node VM


The availability and service placement for Edge node VM depends on multiple factors due to
nature of its flexibility as a VM. Design choices revolves around shared vs dedicated services
deployment, number of Edge nodes, in-rack vs multi-rack availability and number of pNIC
available in the host, growth, capacity and finally bandwidth required as a whole and per
services. In addition, restrictions exists when the Edge node VM coexist with compute VMs in
the same host. In this section the focus is mostly in the context of dedicated edge cluster and
availability with two pNICs.
There are two forms of availability to consider for Edge VM. First is Edge node availability as a
VM and second is the service that is running inside Edge VM. Typically, the Edge node VM
availability falls into two models. In-rack verses multi-rack. In-rack availability implies minimum
two hosts are available (for both ECMP and stateful services) and failure of a host will trigger
either re-deployment of Edge node to available host or a restart of the Edge VM depending on
the availability of the underlying hypervisor. For multi-rack availability, the recommendation is
to keep availability model for Edge node recovery/restoration/redeployment restricted to rack
avoiding any physical fabric related requirement (independent of L2 or L3 fabric).
For the simplest and most common form of Edge deployment is shown in below FIGURE 7-55:
ECMP BASE EDGE NODE CLUSTER GROWTH Pattern, in which entire NSX domains is only requiring
on Tier-0 active/active (ECMP) services. In this case, it is important to remember that all Tier-1
are distributed by default and thus does not require any specific consideration. The services
enablement assumes single rack deployment, with minimum two Tier-0 (ECMP only services)
Edge Nodes. The growth pattern starts with two hosts to avoid single point of failure, addition
of two additional Edge nodes per host, leading to four hosts with eight Edge nodes delivering
eight ECMP forwarding paths. Notice the edge cluster striping is vertical and not much impact
how it is striped. If one has requirement to support multi-tenant with each tenant requiring
dedicated Tier-0, one must stripe more than one cluster and vertically for which minimum unit
of deployment is two hosts with two Edge nodes (this is not shown in below diagram, but one
can imagine that arrangement of ECMP services)

336
VMware NSX Reference Design Guide

Figure 7-55: ECMP Base Edge Node Cluster Growth Pattern

For the deployment that requires stateful services the most common mode of deployment is
shared Edge node mode (see FIGURE 7-53: SHARED SERVICE EDGE NODE Cluster) in which both
ECMP Tier-0 services as well stateful services at Tier-1 is enabled inside an Edge node, based on
per workload requirements. The FIGURE 7-56: SHARED SERVICES EDGE NODE CLUSTER GROWTH
Patterns below shows shared edge not for services at Tier-1, red Tier-1 is enabled with load-
balancer, while black Tier-1 with NAT. In addition, one can enable multiple active-standby
services per Edge node, in other word one can optimize services such that two services can run
on separate host complementing each other (e.g. on two host configuration below one can
enable Tier-1 NAT active on host 2 and standby on host 1) while in four hosts configuration
dedicated services are enabled per host. For the workloads which could have dedicated Tier-1
gateways, are not shown in the figure as they are in distributed mode thus, they all get ECMP
service from Tier-0. For the active-standby services consideration, in this case of in-rack
deployment mode one must ensure the active-standby services instances be deployed in two
different host. This is obvious in two edge nodes deployed on two hosts as shown below as NSX
will deploy them in two different host automatically. The growth pattern is just adding two
more hosts so on. Note here there is only one Edge node instances per host with assumption of
two 10 Gbps pNICs. Adding additional Edge node in the same host may oversubscribed
available bandwidth, as one must not forget that Edge node not only runs ECMP Tier-0 but also
serves other Tier-1s that are distributed.

337
VMware NSX Reference Design Guide

Figure 7-56: Shared Services Edge Node Cluster Growth Patterns

The choices in adding additional Edge node per host from above configuration is possible with
higher bandwidth pNIC deployment (25/40 Gbps) or, even better, a higher number of pNICs. In
the case four Edge node deployment on two hosts, it is required to ensure active-standby
instances does not end up on Edge nodes on the same hosts. One can prevent this condition by
building a horizontal Failure Domain as shown in below FIGURE 7-57: TWO EDGE NODES PER HOST
– SHARED SERVICES CLUSTER GROWTH Pattern. Failure domain in below figures make sure any
stateful services.

4 hosts 1 Edge Node per Host


2 hosts 1 Edge Node per Host Shared Edge Node
Shared Edge Node
Shared mode
T-0 ECMP T-1 Active
T0 EN1 T1
Host 1 Host 1
T0 EN1 T1
T0 EN1 T1 Shared mode
T-0 ECMP T-1 Standby
T0 EN2 T1
Growth
Host 2 Host 2 T0 EN2 T1 T0 ECMP
T0 EN2 T1

Same Rack Edge Cluster T1 Active NAT


Host 3
Growth

T0 EN3 T1
T1 Standby NAT
Same Rack
Host 4 T1 Active LB
2 hosts 2 Edge Node per host T0 EN4 T1
Active/Standby Service Availability not
guaranteed without Failure Domain T1 Standby LB

Host 1

FD 1
T0 EN1 T1 T0 EN3 T1
Use Fault T0 EN1 T1 T0 EN3 T1

Host 2 Domains

T0 EN2 T1 T0 EN4 T1 FD 2
T0 EN2 T1 T0 EN4 T1
Edge Cluster
Same Rack

338
VMware NSX Reference Design Guide

Figure 7-57: Two Edge Nodes per Host – Shared Services Cluster Growth Pattern

An edge cluster design with dedicated Edge node per services is shown in below FIGURE 7-58:
DEDICATED SERVICES PER EDGE NODES GROWTH Pattern. In a dedicated mode, Tier-0 is only running
ECMP services belongs to first edge cluster while Tier-1 running active-standby services on
second edge cluster. Both of this configuration are shown below.

Figure 7-58: Dedicated Services per Edge Nodes Growth Pattern

Notice that each cluster is striped vertically to make sure each service gets deployed in separate
host. This is especially needed for active/standby services. For the ECMP services the vertical
striping is needed when the same host is used for deploying stateful services. This is to avoid
over deployment of Edge nods on the same host otherwise the arrangement shown in FIGURE
7-55: ECMP BASE EDGE NODE CLUSTER GROWTH Pattern is a sufficient configuration.
The multi-rack Edge node deployment is the best illustration of Failure Domain capability. It is
obvious each Edge node must be on separate hypervisor in a separate rack with the
deployment with two Edge nodes.
The case described below is the dedicated Edge node per service. The figure below shows the
growth pattern evolving from two to four in tandem to each rack. In the case of four hosts,
assuming two Edge VMs (one for ECMP and other for services) per host with two hosts in two
different rack. In that configuration, the ECMP Edge node is stripped across two racks with its
own Edge cluster, the placement and availability are not an issue since each node is capable of
servicing equally. The Edge node where services is enabled must use failure domain vertically
striped as shown in below figure. If the failure domains are not used, the cluster configuration
will mandate dedicated Edge cluster for each service as there is no guarantee that active-
standby services will be instantiated in Edge node residing in two different rack. This mandate
minimum two edge clusters where each cluster consist of Edge node VM from two racks
providing rack availability.

339
VMware NSX Reference Design Guide

Figure 7-59: Dedicated Services per Edge Nodes Growth Pattern

Finally, the standby edge reallocation capability (only available to Tier-1 gateways) allows a
possibility of building a multiple availability zones such that a standby edge VM can be
instantiated automatically after minimum of 10 minutes of failure detection. If the Edge node
that fails is running the active logical router, the original standby logical router becomes the
active logical router and a new standby logical router is created. If the Edge node that fails is
running the standby logical router, the new standby logical router replaces it.

There are several other combinations of topologies are possible based on the requirements of
the SLA as described in the beginning of the section. Reader can build necessary models to
meet the business requirements from above choices.

7.5.4.6 Edge Node VM Physical Host Considerations for Resiliency


As discussed above, there are numerous technical solutions in place to ensure that if/when an
Edge Node fails that the services running on that Edge Node are recovered and continued in a
reasonable amount of time. That said, there are steps that can be taken when designing the
environment hosting the Edge Node VMs to limit the impact of a failure event. An Edge Node
VM inherits the resiliency from the underlying platform that is hosting it, it is imperative that
the hosting platform is configured in a way to remove single points of failures whenever
possible such as:
• Redundant Power and UPS infrastructure

340
VMware NSX Reference Design Guide

• Redundant Physical pNICs


• Redundant northbound switches
• Fault tolerant storage platform for hosting the Edge Node VM
In addition to removing the single points of failures from the individual host, limiting the
potential for cascading failures across a single NSX Edge Cluster is also a key design
consideration. Some of the important steps that should be taken to achieve this are:
• Deployment of Edge Nodes in an NSX Edge Cluster across multiple northbound switches
preventing a single set of ToRs from taking down all NSX Edge Services.
• Deployment of Edge Node VMs across multiple datastores to prevent a single datastore
event (datastore corruption, array availability, etc.) from bringing down multiple Edge
Nodes at the same time.
• Configuration of NSX Failure domains (as described in the previous section) where some
Edge Nodes do share some single points of failure to ensure the availability of the
services in the event of a failure.
• Deployment of additional NSX Edge Node capacity to address the recovery of NSX
Services from a failed Edge Node to other Edge Nodes.
• Proactively choose IP Subnets for N/S Peering that can accommodate addition additional
T0 Peering IPs in the event that additional Edge Nodes must be brought online to restore
additional T0 connectivity.
• Enable Standby relocation for T1 Gateways with services. If the Edge node where the
active or standby logical router is running fails, a new standby logical router is created on
another Edge node to maintain high availability. If the Edge node that fails is running the
active logical router, the original standby logical router becomes the active logical router
and a new standby logical router is created. If the Edge node that fails is running the
standby logical router, the new standby logical router replaces it.
• Limit NSX Edge VM mobility to a single rack by using an independent vSphere cluster per
rack (preferred choice). This will prevent Edge VMs to move to a different rack and losing
the BGP peering with ToR switches. Peering should always be established to the local
ToR switches only. The per rack vSphere cluster model with an NSX Edge cluster
deployed across racks is depicted in FIGURE 7-60. DRS rules can be used to limit Edge VM
mobility if the vSphere cluster is striped horizontally as in FIGURE 7-62.

341
VMware NSX Reference Design Guide

Figure 7-60: Single Edge Cluster

7.5.4.7 Edge Node VMs on VSAN storage


While most of these recommendations can be implemented in any environment, in Hyper
Converged Infrastructure or VSAN based environments item number 2 (Deployment of Edge
Node VMs across multiple datastores) may become challenging. With VSAN, it is strongly
recommended that the following configurations are made:
• Edge Node VMs in a single NSX Edge Cluster should be placed across a minimum of two
(2) vSphere Clusters and therefore two (2) or more VSAN datastores. When the NSX Edge
cluster is placed across multiple VSAN datastores, it is acceptable to implement FTT=1.
• Each Edge Node VM hosted in a single vSphere Cluster using VSAN would then be
configured as a Failure Domain in NSX to ensure placement of Active/Standby Services
across the two datastores.
• The number edge nodes VMs in a vSphere Cluster used to host the NSX Edges should
provide sufficient throughput for the T0 Gateways in the event of a failure of one of the
vSphere Clusters.

342
VMware NSX Reference Design Guide

Figure 7-61: Single NSX Edge cluster across two vSphere/VSAN clusters

If deployment of the NSX Edge Cluster across multiple VSAN datastores is not feasible then the
following configurations can be made to minimize the potential of a cascading failure due to the
VSAN datastore:
1. The vSphere Cluster should be configured with an FTT setting of at least 2 and an FTM of
Raid1
This will dictate that for each object associated with the NSX Edge Node VMs, a witness,
two copies of the data (or component) are available even if two failures occur, allowing
the objects to remain in a healthy state. This will accommodate both a maintenance
event and an outage occurring without impacting the integrity of the data on the
datastore. This configuration requires at least five (5) hosts in the vSphere Cluster.
2. ToR and Cabinet Level Failures can be accommodated as well, this can be accomplished
in multiple ways; either through leveraging VSAN’s PFTT capability commonly referred
to as Failure Domains for VSAN Stretched Clusters and leveraging a Witness VM running
in a third failure domain or by distributing the cluster horizontally across multiple
cabinets where no more hosts exist in each Rack or Cabinet than the number of failures
that can be tolerated (a maximum of FTT=3 is supported by VSAN).

343
VMware NSX Reference Design Guide

3. When the vSphere cluster hosting NSX Edges is spread across multiple cabinets or ToRs,
it is critical to set DRS rules to better disburse the NSX Edge Nodes across the cabinets
and have predictability for which one is running where. Even if the physical fabric allows
for IP mobility across racks, edge VM mobility should be limited to a single rack to avoid
the BGP peering to the local ToR to drop upon vMotion. BGP peering across racks is
highly discouraged.
4. It is strongly recommended that Hosts, any time they are proactively removed from
service, vacate the storage and repopulate the objects on the remaining hosts in the
vSphere Cluster.

Figure 7-62: Single NSX Edge Cluster, Each Rack NSX Failure Domain

For a realistic example of edge nodes deployed on a single vSphere cluster striped across
multiple racks refer to section USE CASE : IMPLEMENTING A REPEATABLE RACK DESIGN FOR A SCALABLE
PRIVATE CLOUD PLATFORM IN A LAYER 3 FABRIC

While VSAN Stretched Cluster and other Metro-Storage Cluster technologies provide a very
high level of storage availability, NSX Edge Nodes provide an “Application Level” availability
through horizontal scaling and various networking technologies. If a dedicated vSphere Cluster
is planned to host the Edge Node VMs, using two independent clusters that are in diverse
locations as opposed to a single vSphere Cluster stretched across those locations should
seriously be considered as it is the best solution in most circumstances.

344
VMware NSX Reference Design Guide

NSX Platform Design Consideration


NSX enables an operational model that supports compute domain diversity, allowing for
multiple vSphere domains to operate alongside as part of the same networking and security
solution. A variety of different use cases can be implemented as part of the same architecture,
including IaaS (e.g., via vRealize Automation), VDI (via Horizon), and CaaS (e.g., via TKG), or PaaS
(e.g., via TAS or OpenShift).

Figure 7-63: Single Architecture for Heterogeneous Compute and Cloud Native Application Framework

There are essential factors to consider when evaluating how to best design these workload
domains, as well as how the capabilities and limitations of each component influence the
arrangement of NSX resources. Designing multi-domain compute requires considering the
following key factors:
● Use Cases
○ IaaS
○ CaaS
○ VDI
○ PaaS
○ Other
● Type of Workloads
○ Enterprise applications, QA, DevOps
○ Regulation and compliance
○ Performance requirements
○ Security

345
VMware NSX Reference Design Guide

● Management
○ RBAC
○ Inventory of objects and attributes controls
○ Lifecycle management
○ Ecosystem support – applications, storage, and staff knowledge
● Scale and Capacity
○ Compute hypervisor scale
○ NSX Platform Scale limits
○ Network services design
○ Bandwidth requirements, either as a whole compute or per compute domains
● Availability and Agility
○ Cross-domain mobility
○ Cross-domain connectivity

NSX domains design guidelines


It is important to understand that NSX is a component of the overall enterprise architecture. Its
role and design are influenced by the requirements of the solutions that consume its services.
This section covers the design considerations for the NSX platform as part of a larger private
cloud infrastructure initiative, single and multi NSX domain designs, and the different levels of
multitenancy.

7.6.1.1 Single NSX Domain


FIGURE 7-64 shows a single NSX installation providing services to multiple independent solutions
served by dedicated vCenter servers and vSphere resources. This design provides a unified
networking and security model with the following benefits:
Single pane of glass for networking and security
Easy to create security policy spanning multiple compute domains, for example, a policy
allowing access from the VDI environment to an application deployed on a Tanzu
Kubernetes cluster.
Higher efficiency and resource utilization. Resources can be shared if required.
Easier troubleshooting (e.g., end to end trace flow)
Unified lifecycle management
The single NSX domain solution is the most appropriate for the majority of the real-world
customer implementations.
The single NSX domain solution has the following implications:
 Unified lifecycle management. While it may benefit many situations, specifically smaller
deployments, having independent lifecycle management per use case (VDI vs. Tanzu) or
environment (Test vs. Prod) allows shorter and less risky maintenance operations.

346
VMware NSX Reference Design Guide

Upgrades motivated by features or bug fixes required by a specific solution (e.g., TKG-S)
now affect the whole infrastructure (e.g., VDI), requiring more planning, testing, and an
overall higher burden on the operations team.
 Separation of management responsibility. Different teams may be responsible for
different solutions. The organization may be segmented per solution rather than for
technology (e.g., the VDI team has NSX expertise, same as the team managing the Tanzu
platform. No single NSX team manages NSX across the different internal IT offerings)
 Scalability. The NSX platform scalability limits affect all the solutions. Capacity planning
becomes more complex as the growth of all the different environments must be taken
into consideration.
 The mixing of manual and programmatic consumption (potentially by multiple plug-ins,
integrations, or tools) on the same NSX manager creates the risk of configuration conflicts
and errors.

Figure 7-64: Single NSX Domain Design

7.6.1.2 Single NSX Domain with Multi-tenancy


In a single NSX domain design, management plane multi-tenancy can be provided via the
Projects and VPCs as part of the NSX MULTI-TENANCY FRAMEWORK .
FIGURE 7-65 presents a sample conceptual layout. A single NSX manager domain is segmented in
multiple projects. Projects can be backed by dedicated compute resources, as in the case of
Project-1 which has a dedicated compute manager an all its associated resources, project 2 and
project 3 where each project has a dedicated set of vSphere clusters within the same compute
manager.

347
VMware NSX Reference Design Guide

Projects (and VPCs) also work on shared infrastructure where multiple tenants (Project 4,5, and
6) share the same vSphere clusters. Tenant isolation is logically performed via logical routing
and distributed and gateway firewall.
Projects (and VPCs) are agnostic to the tenants’ compute infrastructure physical layout as they
were designed to work and provide logical separation on a shared infrastructure. If physical
separation is desired, it can be implemented at the vSphere level, but it will be transparent to
NSX multi-tenancy.
NSX multi-tenancy requires a single overlay transport zone (All projects and VPC segments are
created as part of the default overlay transport zone). It means that all project and VPC
segments will be available on all the vSphere clusters regardless of the tenant the cluster may
be associated with. vSphere RBAC can be implemented to enforce that the vSphere users part
of each tenant can connect VMs only to appropriate NSX projects (and VPC) segments.

Figure 7-65:Single NSX Domain with management plane multi-tenancy

Before implementing this topology carefully check the NSX features required by the design as
not all the NSX functionalities are supported in Projects or VPCs. For the most up to date list,
refer to the DOCUMENTATION PAGE .
VMware and third-party solutions may start adopting and consuming the new muti-tenancy
framework. Until the integration is not ready, it is possible for projects and VPCs to coexist in
the same NSX domain with solutions that are agnostic to the multitenancy framework. Those
integration will keep working in the NSX default space.

348
VMware NSX Reference Design Guide

FIGURE 7-66 shows an example where projects are implemented on the same NSX installation
where the Tanzu and the vRealize Automation integration consume NSX in the default space.
The diagram presents the different use cases leveraging the same shared infrastructure. It’s
certainly possible to segment the infrastructure and dedicate physical resources to each use
case (i.e., dedicated vSphere clusters, compute managers, or edge nodes), but be aware of the
requirement for any multitenancy networking object (Tier-0s, Tier-1, segment) to be part of the
default overlay transport zone. Dedicating a transport zone to different use case requires the
external solutions to support non default transport zones.

Figure 7-66: NSX multi-tenancy framework and default space integration coexistence

7.6.1.3 Multiple NSX Domains


This section presents a few designing incorporating multiple NSX domains. In each scenario, the
separation is motivated by different drivers. It is essential to carefully evaluate and document
the requirements that lead to the break of the NSX design in multiple domains. An excessive
level of separation leads to inefficiencies and management overhead when we underestimate
the implications on day-to-day operations and lifecycle management.

7.6.1.3.1 NSX Domain per Use Case


FIGURE 7-67 shows a design where one NSX domain is allocated per use case. Examples of
reasons for this decision are:
Tanzu environment requires the latest release, and frequent and timely software upgrades
are expected
VDI environment runs an internally certified version of NSX and does not require frequent
updates. Each new NSX version needs to go through an internal certification cycle.

349
VMware NSX Reference Design Guide

The IaaS environment includes dynamically created configuration objects (Gateways,


segments, DFW rules, Groups) by vRealize Automation. The frequent manual configuration
changes required by the security policies in the VDI environment should be carefully
planned to avoid conflicts, and manual configuration errors may lead to applications
downtime.
Examples of design implications for the NSX domain separation are:
 When the administrator creates security policies in the VDI environment to provide access
to applications in the IaaS environments, groups based on IP sets must be used. Intelligent
grouping is not available for resources outside of the NSX domain. Policies may have to be
duplicated in the destination NSX Domain (The duplication may be avoided by creating a
wider allow rule on the destination environment permitting the entire VDI pool range.
Granular policies can be enforced at the source only).
 Troubleshooting may be more complex as native tools such as trace flow or live traffic
analysis have a single NSX domain scope.
 Micro-segmentation planning may be more complex as NSX Intelligence have a single NSX
domain scope.

Figure 7-67: NSX Domain per Use Case

7.6.1.3.2 NSX Domain per Environment


FIGURE 7-68 shows a design where the different NSX domains support the same use case, in this
example: Horizon VDI. Examples of drivers for the separation decision are:
Test and Production environments have different lifecycle management. New versions and
features are validated in the test environment before being adopted in production.
More relaxed change control management in the Test environment

350
VMware NSX Reference Design Guide

The production environment has been segmented in multiple PODs for scalability reasons.
Examples of design implications for the NSX domain separation in this scenario are:
 VDI to VDI communication across PODs may require implementing IP Based Groups and
not intelligent grouping.
 Consistent security policies across the entire production environment must be enforced
manually, via custom automation, or by implementing NSX Federation.

Figure 7-68: NSX Domain Per Environment

7.6.1.3.3 NSX Domain per Tenant


FIGURE 7-69 presents a design where the NSX domain separation is motivated by a multi-
tenancy requirement. Different levels of multi-tenancy can be achieved within a single NSX
installation with a variety of tools. We can segment different environments very easily by
implementing distributed firewall security policies, or we can separate the data plane
altogether via tenant dedicated segments, Tier-1s, Tier-0s gateways, or VRFs. Physical resources
such as vSphere clusters and ESXi hosts pNICs can be separated by implementing multiple
transport zones.
What we cannot provide within a single NSX installation is an effective separation of the
management plane in a proper multi-tenant solution. Relying on the native multi-tenancy
capability of an interoperable cloud management solution such as VMware vRealize
Automation or VMware vCloud director is an option, but such solutions may not expose all the
required NSX functionalities or direct access to the NSX GUI/API may be preferred. In such
situations dedicating an NSX manager per tenant may be an appropriate solution. Typical
examples are VMware Cloud designs such as VMConAWS.
Examples of drivers for the separation decision are:

351
VMware NSX Reference Design Guide

Management plane multi-tenancy


Independent lifecycle management per tenant
Self-service IaaS solution
Examples of design implications for the NSX domain separation in this scenario are:
 NSX Manager sprawl. Hard to manage at scale without custom automation.
 Infrastructure resources overhead in small environments. NSX manager VMs must be
deployed for each tenant, even if the environment includes few hosts, at times one in DRaaS
use cases. The singleton NSX Manager deployment may alleviate the issue.

Figure 7-69: NSX Domain per Tenant

7.6.1.3.4 Multiple NSX Domains in a single vCenter server domain


In most cases a single NSX Manager domain is associated to one or more compute managers
(vCenter servers). Starting with NSX 4.1.0 (and NSX 3.2.2), it is possible to have a single
compute manager shared across multiple NSX instances. To make the coexistence of multiple
NSX instances possible the platform enforces the following constrain: each vSphere cluster can
be managed by a single NSX instance. Such constrain means that the overlay networks and the
distributed firewall rules for each cluster are managed by a specific NSX instance. NSX instances
can deploy NSX components such edge nodes and management plane nodes on clusters
managed by other NSX instances or where NSX is not installed.

352
VMware NSX Reference Design Guide

The primary reason to implement this topology (FIGURE 7-70) is to provide NSX management
plane multi-tenancy within a single vCenter Server deployment. Tenants have access to their
own dedicated NSX instance and any configuration they perform will only affect the workload
running on their respective cluster.
The benefits of this multi-tenancy model are the following:
Complete Management plane multi-tenancy
Reduced the number of vCenter servers to manage and lifecycle.
Tenants have access to every NSX feature.
The implications of this multi-tenancy model are the following:
 NSX Manager sprawl. Hard to manage at scale without custom automation.
 Infrastructure resources overhead in small environments. NSX manager VMs must be
deployed for each tenant, even if the environment includes few hosts, at times one in DRaaS
use cases. The singleton NSX Manager deployment may alleviate the issue.
 Physical compute resources are dedicated to each tenant. A shared compute infrastructure
model is not supported.

353
VMware NSX Reference Design Guide

Figure 7-70: Multiple NSX Domain in a Single vCenter Sever Deployment

Besides multi-tenancy, another use-case for the multi nsx feature is the creation of a shared
pool of compute resources to host NSX components across multiple NSX domains. This scenario
is presented in FIGURE 7-71. Multiple NSX domains can deploy edge nodes on the same set of
clusters via the compute manager integration. This type of topology was supported before but
it required the edge nodes to be deployed manually trough the OVA

354
VMware NSX Reference Design Guide

Figure 7-71:Multi NSX compute manager leveraged as an edge nodes pool of resources

The official documentation for the multi NSX feature is available HERE .

7.6.1.4 Single NSX Domain with one or multiple overlay transport zones
This section presents examples of how the NSX transport zone can be used to segment the
infrastructure for different use cases (VDI vs. IaaS) or environments (Test vs. Prod) while
providing a unified networking and security consumption model at the management plane
level.
Transport zones are not security boundaries and should not be implemented for the purpose of
separating workloads at the data plane level. Distributed firewall policies and dedicated routed
topologies are the appropriate NSX tools to achieve segmentation.
Transport zones provide a hard boundary for the span of NSX objects such as segments and
gateways. They provide a mean to avoid mixing objects specifics to a use case or environment
in compute managers or clusters serving other solutions. The implication is that vCenter server
domains, or as a minimum, vSphere clusters, should be dedicated to specific use cases or
environments and not shared between them. This type of design leads to potential
inefficiencies and lower consolidation ratios, but it may be motivated by compliance or risk
management reasons.
Edge nodes can be part of a single overlay transport zones. It implies that dedicated edge
clusters and corresponding peering configuration to the physical network must be deployed if
multiple overlay transport zones are part of the design. If a concern, the multiple peering

355
VMware NSX Reference Design Guide

configuration problem can be solved via a two levels Tier-0 architecture, but the edge node
sprawl cannot be mitigated.

7.6.1.4.1 Single NSX Overlay Transport Zone Design


FIGURE 7-72 presents a single overlay transport node design catering to multiple use cases.
Compared to the multiple transport zone solution, this design has the following advantages:
Single shared edge cluster deployed on dedicated vSphere resources. Dedicated edge
clusters per use case are possible.
Single Tier-0 gateway and peering configuration to the physical network. Multiple Tier-0
Gateways are possible too.
Ability to share compute resources between the use cases.
The drawback of the single overlay transport zone design is:
 Segments and gateways are available for all the use cases or environments. It means that
the VDI administrator could mistakenly connect a VDI to a network dynamically created by
TKG or vRA. If transport zones separate environments (Test vs Prod), production segments
are visible on test vSphere clusters.

Figure 7-72: NSX Design with Single Overlay Transport Zone Design

356
VMware NSX Reference Design Guide

7.6.1.4.2 Separate NSX Overlay Transport Zones Per Use Case Design
FIGURE 7-73 presents a design where vSphere cluster are allocated to each use case. The
rationale for the separation may be based on requirements driving the separation for
compliance or lifecycle management. From an NSX perspective, we can map the same
separation segmented the vSphere clusters in different overlay transport zones based on the
associated use case. The consumption model of cloud automation platforms (e.g. vRA),
container plug-in (e.g., NCP in the Tanzu use case), or automation tools (e.g., Terraform) usually
assume the existence of a pre-created Tier-0 Gateway and an associated overlay transport
zone. When setting up the integration, the administrator pass the Tier-0 gateway, the edge
cluster, and Transport zone where dynamically created object should be placed. The
automation integration will create segments as part of the specified overlay transport zone and
will connect the dynamically created Tier-1 Gateways to the specified Tier-0. This behavior
matches the desired resource placement based on vSphere clusters and NSX edge cluster
dedicated to each use cases
It's important to emphasize that the design choice to use multiple overlay transport zones is
derived by a higher level design decision to segment the compute resources per use case.
Implementing multiple transport zones without such requirements leads to unnecessary
segmentation of resources, lower consolidation ratio, and it is in general a poor choice.
Each use case requires a dedicated NSX edge cluster because edge nodes can be part of a single
overlay transport zone only. The edge nodes can be deployed on the vSphere clusters dedicated
to the use case as depicted in FIGURE 7-73. Or they can be deployed on a shared vSphere cluster
as depicted in FIGURE 7-74.

357
VMware NSX Reference Design Guide

Figure 7-73: NSX Design with Overlay Transport Zone per use case - Edges on Compute Clusters

Figure 7-74: NSX Design with Overlay Transport Zone per use case - Edges on Shared Cluster

358
VMware NSX Reference Design Guide

7.6.1.4.3 Separate NSX Overlay Transport Zones Per Physical fabric


Another use case when multiple overlay transport zone can be used is that of dedicated
physical fabrics. Again, we have a requirement that dictates the inefficient separation of
physical resources, in this case, the physical network, while NSX could efficiently separate
workloads and network topologies at the logical level. In the uncommon situation when the
separation is dictated at the physical network level but not on the compute host, we can
implement the design depicted in FIGURE 7-75. We can map the different physical fabrics to
different overlay transport zones and VDSs on the compute hosts. VMs connected to segments
on the Internal VDS will use the Internal physical fabric; those on segments on the DMZ VDS will
use the DMZ physical fabric. An edge cluster per transport zone is required for physical to
virtual connectivity.

Figure 7-75: Multiple Overlay Transport Zones on the same compute cluster

359
VMware NSX Reference Design Guide

Network Topologies and Logical Design Guidelines


Outlining an appropriate logical design and the corresponding network topology is dependent
on a clear understanding of the platform requirements. The logical design should be derived
from a high-level conceptual design that does not include any technological components but
presents how the workloads are logically grouped, their dependencies, and the desired level of
separation. A simple example of a conceptual design for NSX is presented in FIGURE 7-76. The
three different orgs may represent different business units within the same company, different
environments (PROD, TEST, QA), different use cases (IaaS, VDI, Cloud Native Apps), or different
external customers. Separations based on other criteria are legitimate as long as they are
rooted in the platform requirements and the desired operational model.
As we mentioned in the previous section, NSX DOMAINS DESIGN GUIDELINES , excessive levels of
separation lead to inefficiencies, lower consolidation ratios, and a more complex operational
model. The tenancy model should reflect the required level of separation without limiting the
elasticity and the flexibility of the overall solution.

Figure 7-76: Single-Tier Tenancy Model

FIGURE 7-77 presents a more complex multi-tenancy model incorporating a two-tier


architecture. This two-tier architecture is embedded in some VMware products (i.e., vRealize
Automation) and may be helpful in larger multi-environment, multi-use case deployments. For
example, the first level of tenancy (tenant/org) may represent the environment (TEST, PROD,
QA), while the second (Project) the use case (IaaS, VDI, etc.).
Note that the complexity of the conceptual design is not dependent on the size of deployment
or the number of sites it will encompass. We do not mention larger multi-site environments at

360
VMware NSX Reference Design Guide

this stage. We are still planning our design at a more abstract level. The decision to include a
disaster recovery site or multiple active data centers will depend on non-functional
requirements such as the high availability of SLA and RPO/RTO. The design may cater to a single
application with very stringent non-functional requirements. In such a case, the
conceptual/logical design can be very simple, while the complexity resides in implementing
resilient physical components across multiple sites.

Figure 7-77: Two-Tier tenancy model

Once we have our conceptual model in place, we can now map the abstract elements of the
conceptual design to actual NSX components. FIGURE 7-78 presets an example of such mapping,
where each tenant is mapped to an independent NSX Domain, projects to dedicated compute
managers (vCenter servers) and Tier-0 Gateway, and individual applications, VDI pools, and
Tanzu Namespaces to segments and/or Tier-1 Gateways.

361
VMware NSX Reference Design Guide

Figure 7-78: conceptual model to logical objects mapping

Let’s consider the implications of such mapping in a multi-environment multi-use case design
where the first level of tenancy (tenant/org) is mapped to the environment while the second
(project) is to the use case. An example of such mapping is presented in FIGURE 7-79.

Figure 7-79: Two Tier Multi tenancy design with mapping to NSX constructs

Tenant level considerations:


• The management and lifecycle of each environment are independent
• Complete physical resource separation between environments

362
VMware NSX Reference Design Guide

• IP ranges are dedicated to each environment.


• Environments may still be able to communicate with each other through the physical
network, a gateway or distributed firewall may be used for segmentation. Because the
inventory is not shared, IP ranges must be used in the rules providing segmentation.
• Communication between environments is possible, but the management of cross-
environment rules is cumbersome because of the lack of a shared inventory (IP-based
rules are required). If a high number of flows must be allowed between environments is
required, it may mean that the conceptual separation in different tenants was not
appropriate. NSX Federation may mitigate the issue.
Project level considerations:
• Each use case has dedicated north/south edge resources, limiting contention.
• Edge nodes are dedicated to each use case.
• Compute managers' and hypervisors' lifecycle is independent between use cases.
• NSX lifecycle is common across use cases
• Tier-0 gateways are created as a day one operation and do not require a frequent
lifecycle. They are commonly deployed in Active/Active, and they may be scaled out if
needed.
Application-level considerations:
• Each application is deployed on a dedicated segment. This may lead to segment sprawl if
the number of applications is high.
• Applications expected to communicate intensively with each other should be placed on
segments connected to the same gateway to leverage distributed routing (If the Tier-1
gateways are different, but no services are enabled, routing is fully distributed)
• Segments should be connected to a Tier-1 gateway with services only when they require
such services (i.e., per application NAT)
• Segments and Tier-1 Gateways may be dynamically created via automation. A cloud
automation platform may manage its lifecycle.
• Services are deployed on the Tier-1 Gateways, not on the Tier-0 Gateway. The only
service not available in this topology is L3VPN with redundant remote peering with
failover managed by BGP. Non-redundant remote VPN peers are supported on Tier-1
gateways. Dedicated Tier-0 VPN Gateways can be deployed if needed.
Because the previous section's NSX domains design guidelines delved already in the design
choices regarding NSX domains, the next few examples cover designs including a single NSX
domain.
FIGURE 7-80 presents a design with the following properties:

363
VMware NSX Reference Design Guide

• A dedicated Tier-0 Gateway for each use case: IaaS and VDI. Performance, scalability, and
high availability requirements can be met independently for the two use cases.
• The IaaS Tier-0 gateway is referenced in the cloud management platform (e.g., vRealize
Automation), which dynamically creates Tier-1 Gateways and segments for each
application deployment. The dynamically created Tier-1 gateways are connected to the
IaaS Tier-0 gateway.
• VDI pools are deployed on dedicated segments rather than on a shared one. Distributed
firewall security policies can be applied to the VDIs at the deployment time based on the
network where they reside without human intervention or custom automation
(Segments can be part of groups, both via direct membership and based on tags)
• VDI does not require edge network services. VDI segments are directly connected to an
Active/Active Tier-0 gateway.

Figure 7-80: Dedicated Tier-0 Gateway per use case

FIGURE 7-81 presents an example where the same use cases are implemented, but the chosen
topology is different. The design has the following properties:
• The two use cases share the same Tier-0 gateway. North/South traffic for both goes
through the same edge nodes. The edge design must cater to the requirements of both
use cases at the same time.

364
VMware NSX Reference Design Guide

• The shared Tier-0 should be deployed in Active/Active (no stateful services) to allow for
scaling out the North/South bandwidth.
• A Tier-1 Gateway is deployed for each use case, but no stateful services are required on
them. The separation is mostly logical. It has no impact on the data plane as the routing
is fully distributed across the entire deployment.
• Each application or VDI pool is deployed on a dedicated segment
• Centralized networking services are not required. Security is provided via distributed
firewall.
• If load balancer services are needed, they can be deployed in one-arm mode.

Figure 7-81: Tier-0 Gateway shared across different use cases

FIGURE 7-82 shows an example where the entire NSX Domain is dedicated to a single use case
(IaaS). Within the same NSX domain, two environments are present: Prod and Test. The design
has the following properties:
• The two use cases share the same Tier-0 gateway. North/South traffic for both goes
through the same edge nodes. The edge design must cater to the requirements of both
use cases at the same time.
• The shared Tier-0 are deployed in Active/Active (no stateful services) to allow for scaling
out the North/South bandwidth.

365
VMware NSX Reference Design Guide

• Centralized networking services are not required. Security is provided via distributed
firewall.
• If load balancer services are needed, they can be deployed in one-arm mode.
• Each environment has a dedicated segment and corresponding IP range.
• When an environment grows beyond a single segment limit, segments can be added and
connected to the same Tier-0 gateway.
• Segmentation between environments is provided via DFW based on the IP ranges
associated with each environment. Such policies are manually created by the
administrator.
• The CMP (e.g., vRA) sees the segments as external networks. It means that those
segments must be pre-create by the NSX admin, referenced in the vRA configuration,
and mapped to the two different projects. vRA will place the dynamically created VMs on
those segments based on the environment. Because of the segment placement, the VMs
will inherit the environment segmentation based on IP ranges implemented via DFW.
• The separation between applications within an environment can be achieved via
dynamically created groups and security policies managed by the CMP.
• Larger, shared, and manually configured segments facilitate the integration with disaster
recovery orchestration tools (e.g., VMware Sire Recovery Manager) compared to
segments dynamically created by the CMP. Network mappings between sites can be
configured in advance and do not necessitate frequent updates.

366
VMware NSX Reference Design Guide

Figure 7-82: Pre-created segment per environment

367
VMware NSX Reference Design Guide

NSX Components Placement in a vSphere Deployment


This section covers the different possible arrangements of the NSX Components across a
vSphere deployments. An NSX deployment may include multiple NSX Domains which may
follow different models regarding the components’ placement. Each NSX domain has its own
specific requirement that will drive the design decisions around component placement.

7.6.3.1 Dedicated management, edge, and compute vSphere clusters

Figure 7-83: Dedicated management, edge, compute clusters

This model is the most scalable and provides less resource contention for the NSX components.

7.6.3.1.1 The Management Cluster


The management cluster host the NSX manager appliances and all other infrastructure
management components like the vCenter for all the managed compute domains, the cloud
automation platform, and the monitoring and operations components. These components have
predictable resource requirements that should not dramatically change over time if the design
accounts for the environment's expected growth. Some of the components in the management
cluster (e.g., vCenter, NSX Manager) require management VLAN connectivity as they cannot be
placed on the overlay. For this reason, management clusters are confined to a single rack if the
physical fabric doesn't allow the stretch of VLANs across racks. Multiple management clusters
can be deployed, and the three NSX Manager appliances can be distributed across them for
extra redundancy even when layer three boundaries exist between racks. The vCenter server
deployment, even when deployed in HA, is confined to a single layer two domain allowing
vSphere HA to recover it in case of a failure. Most of the other components generally placed in
the management cluster can be deployed on overlays to allow for rack mobility but check the
documentation for each management component you are considering placing on an overlay
network.
The management cluster can be prepared for NSX if NSX Distributed firewall should protect
management workloads. NSX Manager itself cannot be protected via DFW and should be
protected via an external firewall (NSX edge can be used). NSX Manager should be placed on a
dvpg managed by vCenter and not on a VLAN segment to avoid circular dependencies. When

368
VMware NSX Reference Design Guide

the management cluster is prepared for NSX, overlay networks can be consumed by the
management appliances part of the vRealize suite, but not by vCenter and NSX Manager.

7.6.3.1.2 The Edge Cluster


The vSphere edge clusters only host NSX edge nodes and in most cases, they are not prepared
for NSX (The ESXi are not transport nodes). NSX Edge nodes are purpose-built appliances with
unique resource requirements compared to other workloads in the data center. Dedicating
hardware resources to the edge nodes is highly recommended, and in particular, it is
recommended to have a specific hardware design for the ESXi hosting the NSX Edge nodes. This
kind of customization has the drawback of having a non-uniform hardware footprint in the data
center and should only be embraced based on strong and well-defined requirements for the
edge services (see EDGE SERVICES REQUIREMENTS ).
Servers hosting edge nodes dedicated to North/South traffic are often constrained by their
pNIC capabilities and configuration before the CPU or memory constrains them. Using the same
generic hardware platform for both workload cluster and edge cluster (e.g., 2x25G pNICs with
512GB memory and 64 core count) will most likely lead to underutilized edge vSphere hosts
from a CPU and memory perspective (e.g., a balanced host for N/S edge services would have
4x25Gbs pNICs, 32 cores, 128 GB Memory running two Large Edge Node VMs. More cores are
required if the high performance tuning described in chapter 8 is implemented or more edges
are deployed on the host). In all practical deployments 4 pNICs design brings the most flexibility
and economics to start a design which may use limited Edge resources initially and eventually
lead to dedicated host for Edge.
Additionally, Edge nodes performing advanced security services are often limited by the CPU
resources of the ESXi host. If the edge VM vCPU is maxed out, we can deploy additional edge
VMs on the same host assuming available resources are available. Deploying a mix of
north/South and service edge node VM on the same set of hosts may lead to better CPU
utilization, but it's hard to overcome pNIC and memory designs that were not planned for the
specific requirements of the edge nodes. At the same time, planning a dedicated edge cluster
without well-defined requirements for the edge services will lead to future resource constrain
and lack of flexibility and elasticity to address the likely asymmetric growth of the workload
requirements. Requirements are the most important part of any design. When they are weak or
completely absent, a generic platform incorporating some level of sharing for the edge
resources can be the most proper course of action. One such starting point is a collapsed
edge/compute model which optimizes the available resources and yet remains open to
expansion based on the above guidelines.

369
VMware NSX Reference Design Guide

Figure 7-84: vSphere Edge Cluster - Dedicated Hosts to P2V and Services edges

Figure 7-85: vSphere Edge Cluster - Shared Hosts to P2V and Services Edges

vSphere edge clusters demand specific configurations on the leaf switches to where they
connect. Specifically, they require VLANs for edge management, edge overlay, and edge routing
peering. If we stripe the vSphere edge cluster across different racks, we must extend those
VLANs as well. NSX edge clusters can be deployed across multiple vSphere clusters, and they do
not require layer two connectivity between them. In a layer 3 fabric, it is recommended to keep

370
VMware NSX Reference Design Guide

the vSphere edge cluster confined to a single rack and deploy NSX Edges in different vSphere
clusters if rack availability is required. See FIGURE 7-86.

Figure 7-86: NSX Edge Cluster deployed across 2 different vSphere clusters in different racks

7.6.3.1.3 The Edge Cluster (DRS, vMotion, vSphere HA)


Edge Node VMs support vSphere vMotion, DRS, and vSphere HA. While those vSphere
functionalities provide great value to the NSX infrastructure in general, the benefit they provide
to edge node VMs depends on the use case implemented and the preferred operational model.
ECMP Tier-0 Only Edges
vSphere HA
FIGURE 7-87 presents a scenario where the edge VMs are deployed for North/South traffic only,
with a single Tier-0 Gateway deployed across them (Top of the diagram). If the ESXi hosting
Edge 1 and edge 2 fails, the two edge VMs will be recovered on the second ESXi that is already
hosting Edge 3 and Edge 4 (Bottom of the diagram). The four edge nodes VMs are now
competing for the host resources (CPU and pNICs) with potentially little or no benefit from a
performance perspective. If the hosts are designed to support four edge node VMs, wouldn’t it
make more sense to deploy four edge node VMs per host in the first place? If the design
already includes eight edge node VMs and the ESXi hosts have available spare capacity, vSphere
HA provides a benefit as we can achieve the same level of performance when the system is in a
degraded state (One ESXi fails or is in maintenance mode).
Protecting edge node VMs is generally not recommended when the appropriate resiliency level
is already in place at the edge cluster level (additional edge VMs running on different ESXi
hosts), but it provides the benefit of maintaining the same vSphere HA configuration across the
entire compute environment and a consistent failure recovery mechanism.

371
VMware NSX Reference Design Guide

Note: in some scenarios, the recovery of an edge VM may cause a traffic interruption if the
dynamic routing peering to the physical network and the southbound data plane on the overlay
side are not ready simultaneously. To mitigate this risk, increasing the Tier-0 forwarding up
timer to a higher value than the default 0 seconds is possible. This will allow the edge to be fully
ready before being placed in service. This scenario may affect any operation that involves
adding an edge node to a Tier-0 already serving traffic, including edge VM vSphere HA recovery,
an edge node exiting maintenance mode, or adding an edge node to a tier-0 gateway to scale
up N/S bandwidth.
DRS and vMotion
We recommend setting DRS to fully automated and use DRS should rules to have deterministic
placement of the edge VMs during normal operations while allowing edge VMs to move during
ESXi maintenances.
From an operational perspective, disabling DRS (or setting it to anything other than Fully
Automated) requires manually moving or powering off the edge VMs before placing the ESXi
host in maintenance mode. The design choice is between control and predictability vs.
automation of the lifecycle management that comes with vSphere. When DRS is enabled for
edge node VMs hosting a T0 gateway, we want to limit unnecessary migrations that can cause
interruptions to the north south traffic. Implementing VMs to Host should rules allows to
specify on which host the edge VM will run. DRS will not migrate the edge VM unless the host is
placed in maintenance mode or extreme resource contention occurs.

Figure 7-87: vSphere Edge cluster and vSphere HA for P2V Edges

372
VMware NSX Reference Design Guide

Services Only Edges


When edge nodes are dedicated to services, the CPU is the most critical host resource. The
underlying vSphere cluster should be designed to provide enough resources to support the
required availability model (e.g., N+1, N+2,…). vSphere HA admission control can be used to
ensure that the vSphere cluster is not overprovisioned and unable to support a host failure
without experiencing degraded performances. vSphere HA should always be enabled for edges
running services, even when in conjunction with an Active/Active Tier-0 gateway, as it allows to
recover the lost capacity and redeploy the standby SRs before the standby relocation timeout
kicks in.
DRS in fully automated mode provides efficient utilization of the edge cluster resources. NSX
places the T1 Gateways with services across different edge node VMs, but it does not consider
the traffic or the services they provide. As a consequence, some edge nodes can run hotter
than others. DRS can help balance the utilization across the cluster. As mentioned, Edge node
VMs support vMotion, but it is recommended to set the DRS threshold to a conservative value
to avoid too frequent vMotion operations. The network interruption caused by vMotion, even if
it is extremely brief and hardly disruptive, will impact many workloads when affecting an edge
VM. Some applications may not be affected at all, and others may be. DRS rules, in conjunction
with NSX failure domains, should be in place so that Active/Standby Gateways do not reside on
the same ESXi host during normal operations and maintenance windows (Anti-affinity rules).

Figure 7-88: vSphere edge cluster - vSphere HA and DRS for Service Edge VMs

With a single vSphere edge cluster hosting independent NSX Edge cluster dedicated to different
use cases (e.g., p2v and services), it’s possible to customize the vSphere HA and DRS setting per
edge VM to achieve different behavior per use case. When the NSX edge cluster is shared (e.g.,
the same edge node VMs provide p2v and services), the design decisions around vSphere HA

373
VMware NSX Reference Design Guide

and DRS setting are more complex and should be based on what design properties are the most
important.

374
VMware NSX Reference Design Guide

7.6.3.2 Collapsed Management/Edge cluster

Figure 7-89: Collapsed Edge/Management cluster

When limited compute resources are available, or the footprint of the management and edge
components is limited, collapsing those components in the same vSphere cluster is an option.
When dedicated clusters are not an option, collapsing edge and management functions is
usually the preferred option because of the predictable footprint and growth of the
management components compared to those of the application workloads.
Mixing management and edge components in smaller or simpler deployments is not an issue as
long as the requirements for both workload types are taken into consideration when designing
the hardware resources in the cluster. For example, the four nodes management/edge cluster
depicted in FIGURE 7-90 can be adequate to host essential infrastructure management
components and edge nodes dedicated to North/South traffic and minimal services. Having
dedicated pNICs for the edge node VMs hosting the tier-0 Gateway is always a good idea. Host
traffic (vMotion, storage) should share the pNICs used by the management appliances rather
than those used by the edge VMs. Deploying a management/edge cluster with two pNIC hosts
is supported, but it will most likely represent a compromise in terms of north/south
performances.

375
VMware NSX Reference Design Guide

Figure 7-90: Dedicating pNIC resources to edge VMs in management/edge cluster

Key considerations when deploying a collapsed management/edge cluster:


• The management/edge cluster must support management and edge VLAN requirements
• Expected mobility across racks might be different for management and edge workloads
• Resource contention should be managed by resource planning and vSphere reservations
• Separating management and edge traffic on different pNICs should be considered
• vSphere HA is most likely enabled globally for the cluster to support management
workloads. If vSphere HA for T0 edge VM is not desired, it must be disabled manually (In
case the recovery of the edge VM on a different host will not lead to gains in high
availability or performances). vSphere HA for T1 only service or T0/T1 mixed edges is a
good practice in most situations.
• DRS is most likely enabled for the cluster to support management workloads. VM to Host
Should rules should be implemented to enforce the correct pNICs to T0 edge VM
placement and limit the T0 edge VM mobility to host maintenance events.
• Limit over reservation of CPU/memory resources for the management workloads as it
may lead to DRS moving the edge VMs regardless of the preferential rules.
• DRS rules, in conjunction with NSX edge failure domains, should be in place so that
Active/Standby Gateways do not reside on the same ESXi host during normal operations
and maintenance windows (Anti-affinity rules).

376
VMware NSX Reference Design Guide

Refer to the dedicated sections for management components and edge node connectivity for
detailed guidelines in term of virtual switch design and teaming policies. The diagrams below
present a detailed view of such configuration settings in a two and four pNIC designs.

Collapsed Management & Edge Cluster –


ToR-Left 2 pNICs ToR-Right

P1 P2

VDS

VMK Storage PG VMK vMotion PG VMK Mgmt PG Trunk DVPG-1 A/S Trunk DVPG-2 A/S

vNIC1 vNIC2 vNIC3


Mgt-IP
N-VDS 1
3 x NSX-T Manager Edge Node-VM EN1
Node

ESXi-with-Mgmt_Edge_Node_VM-VDS

Figure 7-91: Collapsed Management and Edge on VDS with 2 pNICs

Dedicated VDS for both Mgmt &


Edge Resources
ToR-Left ToR-Right
4 pNICs
DG - Storage / vMotion DG - Mgt / Overlay

P1 P2 P3 P4

VDS-1 VDS-2

vDS vDS
Storage PG vMotion PG Mgmt PG Mgt PG Trunk DVPG 1 Trunk DVPG 2

vNIC1 vNIC2 vNIC3


Mgt-IP
NSX-T Manager Node N-VDS 1

Edge Node-VM EN1

377
VMware NSX Reference Design Guide

Figure 7-92: Collapsed Management and Edge VM on Separate VDSs with 4 pNICs

7.6.3.3 Collapsed Compute/Edge cluster

Figure 7-93: Collapsed Edge/Compute cluster

Some designs incorporate a collapsed edge/compute cluster design. This model may be
appealing from a resource planning perspective and allows for a simple chargeback model. It
allows the operator to dedicate ESXi resources to a use case or a tenant and use those
resources for both workloads and network services.
A dedicated vSphere edge cluster may mean that multiple use cases or tenants will share the
same compute resources. Sharing the compute hosts dedicated to edge services across multiple
use cases or tenants makes resource planning and measuring consumption more complex.
Splitting the edge services in dedicated vSphere clusters may be unfeasible from a cost
perspective. The co-placement of the edge nodes with the associated workloads may become
appealing to preserve the resource segmentation across the use cases. A collapsed
compute/edge deployment model presents the following challenges:
• Resource contention and planning. Application workloads demands may vary over time
and may affect edge services performance even when the appropriate capacity planning
was performed as part of the initial implementation.
• North/South edge services benefit from dedicated pNICs. Compute clusters are usually
deployed with a two pNIC model. We should not deploy more than one edge per host to
avoid excessive oversubscription.

378
VMware NSX Reference Design Guide

• Compute hosts require minimal configuration on the physical fabric: four VLANs local to
each rack, management, vMotion, Overlay, and storage, even when the cluster is striped
across racks. Edge nodes require additional VLANs that must be available on each host in
the cluster. A layer 2 fabric is now required to stripe the compute cluster across racks if
the edge node VMs are co-located with compute workloads. Edge VMs running T1
Services and no T0 can run without problems on a cluster striped across racks in a layer 2
fabric.
• Edge nodes with T0s should always create routing adjacencies with switches directly
connected to the host where the edge node VMs reside. This may be unfeasible in an
edge/compute cluster spanning multiple racks. An edge VM moving between racks may
lose the BGP peering adjacencies unless the network fabric extends the peering VLANs
between racks (not recommended). A vSphere cluster hosting T0 Edge VMs should be
confined to a single rack. Rack availability can be provided via another vSphere cluster
hosting edge node VMs part of the same NSX edge cluster and T0 Gateway. In case the
cluster must be striped across racks, mobility of the T0 Gateway should be limited to a
single rack even during maintenance operations (Placing the edge node in maintenance
mode and powering it off is an option if adequate compute resources are not available in
the same rack).
• vSphere HA is most likely enabled globally for the cluster to support compute workloads.
If vSphere HA for T0 edge VM is not desired, it must be disabled manually (In case the
recovery of the edge VM on a different host will not lead to gains in high availability or
performances). vSphere HA for T1 only service or T0/T1 mixed edges is a good practice in
most situations.
• vSphere HA does not honor preferential rules (should) but respects mandatory rules. This
behavior should be considered when the placement of edge node VMs is critical (T0 Edge
VM with vSphere cluster striped across racks)
• DRS is most likely enabled for the cluster to support compute workloads. VM to Host
Should rules should be implemented to enforce the desired host to T0 edge VM
placement and limit the T0 edge VMs mobility to host maintenance events.
• Limit over reservation of CPU/memory resources for the compute workloads as it may
lead to DRS moving the T0 edge VMs regardless of the preferential rules.
• DRS rules, in conjunction with NSX edge failure domains, should be in place so that
Active/Standby Gateways do not reside on the same ESXi host during normal operations
and maintenance windows (Anti-affinity rules).
• In a two pNIC host design, the edge node VMs connect to an NVDS or a VDS prepared for
NSX. Connectivity guidelines presented in “FIGURE 7-49: EDGE VM CONNECTED TO NSX
P REPARED VDS – HOST AND EDGE TEP ON DIFFERENT IP/VLAN” or “FIGURE 7-50: EDGE VM
CONNECTED TO NSX P REPARED VDS – HOST AND EDGE TEP ON THE SAME IP/VLAN - NSX 3.1 OR
later” should be followed.

379
VMware NSX Reference Design Guide

7.6.3.4 Fully Collapsed Management/Compute/Edge cluster

Figure 7-94: Fully Collapsed Management/Edge/Compute cluster

In a fully collapsed cluster design, compute workloads, edge node VMs, and management
components all reside in the same vSphere cluster. The NSX EASY ADOPTION DESIGN
GUIDE presents an example of this deployment model in great detail. Please refer to the DC in A
Box section. A collapsed cluster design presents the challenges of both the compute/edge, and
the management/edge collapsed models, but those concerns are usually mitigated by the
smaller size of the deployment, usually confined to a single rack. Key considerations for a fully
collapsed cluster are:
• In a two pNIC host design, the edge node VMs connect to an NVDS or a VDS prepared for
NSX. Connectivity guidelines presented in “FIGURE 7-49: EDGE VM CONNECTED TO NSX
P REPARED VDS – HOST AND EDGE TEP ON DIFFERENT IP/VLAN” or “FIGURE 7-50: EDGE VM
CONNECTED TO NSX P REPARED VDS – HOST AND EDGE TEP ON THE SAME IP/VLAN - NSX 3.1 OR
later” should be followed.
• NSX Manager should not be placed on a VLAN segment but on a vCenter managed dvpg.
The diagrams below outline virtual switch and teaming policies design guidelines for four pNICs
and two pNICs hosts.

380
VMware NSX Reference Design Guide

Figure 7-95: Collapsed Edge/Management/Compute cluster - 4 pNICs

Figure 7-96: Collapsed Edge/Management/Compute cluster - 2 pNICs - separate VLANs for Host and Edge TEPs

381
VMware NSX Reference Design Guide

Figure 7-97 Collapsed Edge/Management/Compute cluster - 2 pNICs - single VLAN for Host and Edge TEPs

7.6.3.5 Use case: implementing a repeatable rack design for a scalable private cloud platform
in a layer 3 fabric
We have seen that vSphere clusters dedicated to management, edge, and compute workloads
have different connectivity and resource requirements. It is possible to cater to such specific
requirements by customizing the hardware design for each class of workloads (e.g., CPU core
count and pNIC design specific to edges hosts) and dedicating racks to a specific cluster function
(e.g., racks dedicated to ESXi hosts running edges with ToR switches configured with the
appropriate VLANs). While such a design may be the most effective at leveraging the available
resources, it may lack the repeatability and consistency required in large-scale deployments.
Large-scale private clouds require a high degree of automation, from the automatic
deployment and configuration of the ToR switches and the provisioning of the ESXi hosts to the
dynamic allocation of resources to different compute pools (e.g., a vSphere cluster or a VCF
workload domain).
This section presents different options to create a repeatable rack layout for computing and
edge workloads. We assume that management workloads can be centralized and do not
require to follow the same repeatable pattern because of their predictable resource
requirements.

382
VMware NSX Reference Design Guide

Figure 7-98: Example of conceptual design for a repeatable infrastructure

7.6.3.5.1 Dedicated edge cluster, one host per rack


The diagram in FIGURE 7-99 presents a three racks layout where vSphere clusters are striped
across the different racks. Each rack represents a failure domain. If VSAN is the adopted storage
technology, the three failure domains imply a storage policy with PFTT=1. We should adopt a
five-racks deployment to achieve PFTT=2. We present a three-rack design for simplicity, but we
can apply the same design principles to five or more rack designs common in critical
infrastructure deployments.

383
VMware NSX Reference Design Guide

Figure 7-99: Resource Block - Dedicated vSphere edge cluster with one host per rack

In FIGURE 7-99, all hosts have the same hardware configuration (e.g., 2x25G pNICs, 32 cores,
512G Memory). One host in each rack is dedicated to NSX edge services. The host runs two or
more NSX edge node VMs where both a T0 Gateway in ECMP and T1 Gateways with services
are hosted. While this configuration may not be optimal for edge performances, it allows for a
repeatable deployment model that enhances the scalability and the manageability of the
overall platform. This model can be extended to a design where T0 and T1 gateways run on
different edge nodes part of different edge clusters. Such model is not discussed in this section.
In FIGURE 7-99, the edge hosts are grouped in a dedicated vSphere edge cluster. The cluster is
striped across racks. We mentioned already that, in general, it’s not a good practice to stripe
vSphere edge clusters across racks, but in this case, the repeatability of the deployment model,
which dictates one edge host per rack, takes precedence. The three hosts part of the edge
cluster have access to the same VSAN Datastore, but they cannot provide VM mobility across
racks for the edge VMs because each rack has specific management, overlay, and layer 3
peering VLANs. If an edge VMs migrates to a different rack, it will lose connectivity
immediately. Design guidelines and implications for the model are:
• We should map edge node VMs to NSX failure domains based on the rack where they are
deployed.
• Edge Node VMs cannot move across racks because the appropriate management,
overlay, and L3 peering networks are not available on every rack.

384
VMware NSX Reference Design Guide

• If the hosts in the edge vSphere cluster are part of the same VDS and the VLAN IDs are
repeated on each rack for consistency, DRS is not aware that the available networks are
different and may move the edge VMs anyway.
• DRS MUST rules should enforce the correct rack placement for each edge node VM to
avoid any edge VM migration.
• If a single edge ESXi host is present in each rack, as depicted in FIGURE 7-99, the host
cannot be evacuated to be placed in maintenance mode. Edge node VMs must be placed
in maintenance mode and then shut down before the host can be placed in maintenance
mode. When the host exits maintenance mode and the edge VM is powered and brought
back online, it is required to manually check the status of its services (e.g., BGP neighbors
are up) before proceeding with the maintenance of another ESXi host and the
corresponding edge VMs.

7.6.3.5.2 Dedicated edge cluster, two hosts per rack


From an operations perspective, the main drawback of the previous design is the inability to
leverage DRS to evacuate the edge ESXi hosts during maintenances. The design in FIGURE 7-100
attempts to alleviate this pain point by deploying two edge ESXi hosts per cluster.

Figure 7-100: Resource Block - Dedicated vSphere edge cluster with two hosts per rack

The design in FIGURE 7-100 has the following guidelines and implications:
• Edge hosts can be evacuated by vMotion to a different host in the same rack during a
maintenance. Edge nodes are not shut down.

385
VMware NSX Reference Design Guide

• Edge Node VMs cannot move across racks because the appropriate management,
overlay, and L3 peering networks are not available on every rack.
• If the hosts in the edge vSphere cluster are part of the same VDS and the VLAN IDs are
repeated on each rack for consistency, DRS is not aware that the available networks are
different and may move the edge VMs anyway.
• DRS MUST rules should enforce the correct rack placement for each edge node VM to
avoid any edge VM migration across racks.
• DRS SHOULD rules will dictate the preferred host placement for each edge VMs. Because
now multiple hosts are present in the same rack, DRS in fully automated mode could
rebalance the edge VMs placement during normal operations. The SHOULD rule will
prevent undesired migrations. A SHOULD rule will be violated when placing the ESXi host
in maintenance mode. The edge VMs will be moved to another host in the same rack. It
won’t be moved to a different rack because MUST rules cannot be violated.
• If DRS rules management is considered too much of an overhead, DRS can be disabled or
set to partially automated. In this case, edge VMs must be migrated manually during
maintenances. This choice carries the risk of human error (e.g., edge node VM migrated
to a different rack by mistake)
• Even with six hosts in the edge cluster spread across three racks, the VSAN storage policy
has an PFTT=1 when each rack is mapped to a VSAN failure domain.
• Two edge hosts per rack may represent a waste of computing resources, especially in
deployments with five racks or more.

7.6.3.5.3 Shared edge/compute cluster with dedicated resources


The design presented in FIGURE 7-101 is a compromise between the two previous designs. All
hosts are grouped in a single edge/compute cluster, but host resources are still dedicated. One
host in each rack is dedicated to the edge VMs, while the rest is dedicated to computing
workloads, but during maintenances, we can move edge node VMs to compute hosts in the
same rack. A MUST and SHOULD rules combination guarantees that the edge VMs are not
migrated across racks. They run preferentially on a dedicated host but can migrate to a
different host in the same rack when the preferred host is placed in maintenance mode.

386
VMware NSX Reference Design Guide

Figure 7-101: Resource Block - Collapsed vSphere edge/compute cluster

Design guidelines and implications for the design in FIGURE 7-101 are:
• A complex set of DRS rules is required.
• One VM to host MUST rule per rack is required to avoid edge VM mobility across racks.
• VM to Host SHOULD rules must be in place to avoid edge VM movements across hosts in
the same rack during normal operations.
• DRS rule (SHOULD or MUST depending on the design requirements) to avoid/discourage
the placement of computing workloads on ESXi hosts dedicated to edge node VMs. This
DRS rule has the most impactful implications in terms of day-2 operations because any
new workload deployed on the cluster needs to be added to the corresponding VM
group to avoid resource contention with the edge node VMs.

387
VMware NSX Reference Design Guide

8 NSX Performance & Optimization


Typical Data Center Workloads
Workloads in the data center are typically TCP-based. Generally, 80% of the traffic flows within
the data center are east/west, that is, communication between various compute nodes. The
remaining 20% is north/south, that is, communication in and out of the data center. The
following FIGURE 8-1: DATA CENTER TRAFFIC P ATTERN WITH NSX shows the typical traffic flow
distribution:

Figure 8-1: Data Center Traffic Pattern with NSX

Because of this underlying data center framework, this chapter primarily focuses on
performance in terms of throughput for TCP-based workloads. There are some niche
workloads such as NFV, where raw packet processing may be ideal, and the enhanced version
of N-VDS called N-VDS (E) was designed to address these requirements. Check out the last part
of this section for more details on N-VDS (E).

Next Generation Encapsulation - Geneve


Geneve, a draft RFC in the IETF standard body co-authored by VMware, Microsoft, Red Hat, and
Intel, grew out of a need for an extensible protocol for network virtualization. With Geneve the

388
VMware NSX Reference Design Guide

new framework for overlay/network virtualization, understanding Geneve is foundational to


rest of this chapter as the rest of the optimizations to be discussed revolve around Geneve.

With its options field length specified for each packet within the Geneve header, Geneve allows
packing the header with arbitrary information into each packet. This flexibility offered by
Geneve opens up doors for new use cases, as additional information may be embedded into
the packet, to help track the packets path or for in depth packet flow analysis.

Figure 8-2: Geneve Header

The above FIGURE 8-2: GENEVE Header shows the location of the Length and Options fields within
the Geneve header and also shows the location of TEP source and destination IP’s.
For further insight into this topic, please check out the following blog post:
HTTPS :// OCTO. VMWARE . COM/GENEVE- VXLAN - NETWORK -VIRTUALIZATION - ENCAPSULATIONS/

Performance Considerations
In this section we will take a look at the factors that influence performance. Performance of a
workload in a virtualized environment, depends on many factors within that environment;
hardware used, drivers and features etc. FIGURE 8-3 shows the different areas that may have an
impact on performance.

389
VMware NSX Reference Design Guide

Figure 8-3: Factors that influence performance

pNIC Feature Support


Features supported by the pNIC and the associated firmware and drivers, have a significant
impact on the performance. For applications leveraging standard datapath, there are three
important features to consider.
• Geneve offload
• RSS
• Geneve Rx / Tx Filters
We will take a closer look at each of these features in the following sections.

8.3.1.1 Geneve Offload


Geneve Offload is simply TCP Segmentation Offload (TSO) tuned to understand Geneve
headers. Since Geneve is not TCP traffic, NIC cards need to be aware of Geneve headers to
perform TCP Segmentation Offload type functionality on the Geneve segments. In addition,
guest VMs need to enable TSO, the default behavior with most modern operating systems.

390
VMware NSX Reference Design Guide

TCP Segmentation Offload (TSO): TCP Segmentation offload is a well-established TCP


optimization scheme of relatively long duration that allows large segments to pass through the
TCP stack, instead of smaller packets as enforced by the MTU on the physical fabric.

8.3.1.1.1 (TSO) applicability to Geneve Overlay Traffic


In the context of Geneve, NICs are aware of the Geneve headers and perform TSO taking
Geneve headers into consideration. The following FIGURE 8-4: NIC BASED GENEVE O FFLOAD - TSO
SHows how the VM would transmit 64K segments, such as switching, routing and firewall and
the ESX TCP Stack, which go through the NSX Components as 64K segments. NIC cards take
care of chopping the segments down to MTU-sized packets before moving them on to the
physical fabric.
While TSO’s primary benefit is in reducing CPU cycles, this features also helps increase
throughput, marginally. However, in cases where the NIC does not have Geneve capability, ESX
automatically falls back to software based Geneve offload mode. While NIC-based offload is
ideal, software-based offload still helps reduce the CPU cycles spent for NSX components.

Figure 8-4: NIC Based Geneve Offload - TSO

391
VMware NSX Reference Design Guide

8.3.1.1.2 NIC Supportability with TSO for Geneve Overlay Traffic


In cases where the NIC card does not support TSO for Geneve overlay traffic, TSO is done in
software by the hypervisor just before moving the MTU-sized packets to the NIC card. Thus,
NSX components are still able to leverage TSO.

The following FIGURE 8-5: SOFTWARE /CPU BASED GENEVE O FFLOAD - TSO shows the process where
the hypervisor divides the larger TSO segments to MTU-sized packets.

Figure 8-5: Software/CPU Based Geneve Offload - TSO

8.3.1.1.3 Confirming Geneve Offload on ESXi-Based Hypervisor


On an ESXi host with a NIC card supporting Geneve-Offload in Hardware with the appropriate
supported driver, the following commands can be used to confirm Geneve-Offload is enabled
on a pNIC – in this case pNIC vmnic3:
[Host-1] vsish -e get /net/pNics/vmnic3/properties | grep
".*Activated.*Geneve"

392
VMware NSX Reference Design Guide

Device Hardware Cap Activated:: 0x793c032b ->


VMNET_CAP_SG VMNET_CAP_IP4_CSUM VMNET_CAP_HIGH_DMA
VMNET_CAP_TSO VMNET_CAP_HW_TX_VLAN VMNET_CAP_HW_RX_VLAN
VMNET_CAP_SG_SPAN_PAGES VMNET_CAP_IP6_CSUM VMNET_CAP_TSO6
VMNET_CAP_TSO256k VMNET_CAP_ENCAP VMNET_CAP_Geneve_OFFLOAD
VMNET_CAP_IP6_CSUM_EXT_HDRS VMNET_CAP_TSO6_EXT_HDRS
VMNET_CAP_SCHED
CLI 1 Check Geneve Offload Support

Look for the tag “VMNET_CAP_Geneve_OFFLOAD”, highlighted in red above. This verbiage
indicates the Geneve Offload is activated on NIC card vmnic3. If the tag is missing, then it
means Geneve Offload is not enabled because either the NIC or its driver does not support it.

8.3.1.2 Receive Side Scaling (RSS) and Rx Filters


Readers familiar with software based VXLAN deployment with NSX-V, are likely familiar with
the immense performance benefits of RSS, including improving the performance of overlay
traffic by four (4) times.

8.3.1.2.1 Benefits with RSS


RSS, another long-standing TCP enhancement, enables use of multiple cores on the receive side
to process incoming traffic. Without RSS, ESX by default will use only one core to process
incoming traffic. Utilizing only one core has a huge impact on the overall throughput as the
receiving node then becomes the bottleneck. RSS on the NIC creates multiple queues to
process incoming traffic and efficiently uses a core for each queue, with most NIC cards being
able to support at least 4 queues. Hence the 4x benefit of using RSS. See FIGURE 8-6: RSS for a
visual representation of how this works.

8.3.1.2.2 RSS for overlay


While RSS itself in general is in fairly common use today, there are NICs which still may not
support RSS for overlay. Hence, our recommendation is to confirm with the NIC vendor
whether RSS for overlay traffic is available in hardware, then also confirm with the VMware
COMPATIBILITY IO GUIDE
(HTTPS:// WWW.VMWARE .COM/RESOURCES/ COMPATIBILITY /SEARCH.PHP?DEVICE CATEGORY =IO) whether
there is a RSS-certified driver.

393
VMware NSX Reference Design Guide

Figure 8-6: RSS

8.3.1.2.3 Enabling RSS for Overlay


Every vendor has their own unique mechanism to enable RSS for overlay traffic. There are also
cases where the setting used to change RSS is different based on the driver version. Please
refer to the concerned vendor documentation for details on enabling RSS for specific NICs.

8.3.1.2.4 Checking whether RSS is enabled


Use the “vsish” command to check whether RSS is enabled. The following example shows how
to check whether RSS (marked blue) is enabled on NIC vmnic (marked in red).

[Host-1] # vsish
/> get /net/pNics/vmnic0/rxqueues/info
rx queues info {
# queues supported:5
# filters supported:126

394
VMware NSX Reference Design Guide

# active filters:0
Rx Queue features:features: 0x1a0 -> Dynamic RSS Dynamic
Preemptible
}
/>
CLI 2 Check RSS

8.3.1.3 Geneve Rx Filters


RSS uses the outer headers to hash flows to different queues. Using outer headers of Geneve
overlay, especially between two hypervisors, may not be optimal as the only varying parameter
is the source port.
The following image shows the fields used by RSS (circled in red) to hash flows across various
CPU cores. Since all the fields are from the outer headers, there is a little variability and in the
worst-case scenarios the only variable may be the source port.

Figure 8-7: RSS: Fields Used for Hashing

To overcome this limitation, the latest NICs ( SEE COMPATIBILITY GUIDE ) support an advanced
feature, known as Rx Filters, which looks at the inner packet headers for hashing flows to
different queues on the receive side. In the following FIGURE 8-8: RX FILTERS: FIELDS USED FOR
Hashing fields used by Rx Filter are circled in red.

395
VMware NSX Reference Design Guide

Figure 8-8: Rx Filters: Fields Used for Hashing

Simply put, Rx Filters look at the inner packet headers for queuing decisions. As driven by NSX,
the queuing decision itself is based on flows and bandwidth utilization. Hence, Rx Filters
provide optimal queuing compared to RSS, which is akin to a hardware-based brute force
method.

8.3.1.3.1 Checking whether Rx Filters are enabled


Rx Filters are enabled by default on a NIC that supports Rx Filters in hardware and has a driver
to use it. Please use the VMware’s COMPATIBILITY GUIDE FOR IO, discussed earlier in this chapter,
to confirm whether Rx Filters are available. In VCG for I/O page, select “Geneve-RxFilter”, and
make sure the right driver is installed on the ESXi host.
On the ESXi host, use the “vsish” command to check whether Rx Filters are enabled. The
following example shows how to check whether the NIC vmnic5 (marked in red) has Rx Filters
Enabled for Geneve (marked in blue)
Check Whether Rx / Tx Filters are Enabled:
[Host-1] vsish
/> cat /net/pNics/vmnic5/rxqueues/info
rx queues info {
# queues supported:8
# filters supported:512
# active filters:0
# filters moved by load balancer:254
# of Geneve OAM filters:2

396
VMware NSX Reference Design Guide

RX filter classes:Rx filter class: 0x1c -> VLAN_MAC


VXLAN Geneve GenericEncap
Rx Queue features:features: 0x82 -> Pair Dynamic
}
/>
CLI 3 Check RxFilters Support

8.3.1.3.2 RSS vs Rx Filters for Edge VM


In the case of Edge VMs, the hypervisor does not encapsulate/decapsulate the overlay packets.
Instead, the packets are sent along with the overlay headers to the Edge VM’s vNIC interfaces.
In this case, RSS is the best mechanism available today to hash packets to separate queues. See
more on RSS for Edges in the Edge sections.

8.3.1.4 NIC Feature Support


To confirm the features supported by a NIC, use VMware’s IO compatibility guide, which is a
publicly accessible online tool and the single source truth:

HTTPS :// WWW. VMWARE . COM/ RESOURCES / COMPATIBILITY / SEARCH . PHP?DEVICE CATEGORY =IO

The following section show the steps to check whether a NIC, Intel 810s in this case, supports
one of the key features, Geneve offload.
1. Access the online tool:
HTTPS :// WWW. VMWARE . COM/ RESOURCES / COMPATIBILITY / SEARCH . PHP?DEVICE CATEGORY = IO

397
VMware NSX Reference Design Guide

Figure 8-9: VMware Compatibility Guide for I/O

2. Specify the
1. Version of ESX
2. Vendor of the NIC card
3. Model if available
4. Select Network as the IO Device Type
5. Select Geneve Offload and Geneve Rx Filters (more on that in the upcoming
section) in the Features box
6. Select Native
7. Click “Update and View Results”

398
VMware NSX Reference Design Guide

Figure 8-10: Steps to Verify Compatibility – Part 1

3. From the results, click on the ESX version for the concerned card. In this example, ESXi
version 7.0 for Intel E810-C with QSFP ports:

Figure 8-11: Steps to Verify Compatibility - Part 2

399
VMware NSX Reference Design Guide

4. Click on the [+] symbol to expand and check the features supported.

Figure 8-12: Steps to Verify Compatibility - Part 3

5. Make sure the concerned driver actually has the “Geneve-Offload and Geneve-Rx
Filters” as listed features.

Figure 8-13: Steps to Verify Compatibility - Part 4

Follow the above procedure to ensure Geneve offload and Geneve Rx Filters are available on
any NIC card you are planning to deploy for use with NSX. As mentioned earlier, not having
Geneve offload will impact performance with higher CPU cycles spent to make up for the lack of
software based Geneve offload capabilities.

400
VMware NSX Reference Design Guide

8.3.1.5 NIC Port Speed


The port speed of the NIC, in combination with all the above features, also has a direct impact
on the achieved throughput. The following graph shows throughput achieved on an Intel®
E810C 100Gbps NIC using 8800 MTU. Thanks to its x16 lane design, this NIC’s performance is
close to line rate at 100Gbps @ 8800 MTU.

TCP Throughput
Intel® Xeon® Platinum 8352Y @ 2.20 Ghz - Intel® 810C
100

80
Throughput (Gbps)

60

40

20

0
Logical Switch Tier1 Router Tier0 Router
8800

Figure 8 14: Throughput with Intel® E810-C 100Gbps NIC

8.3.1.6 PCIe Gen 4


PCIe Gen 4 doubles the bandwidth when compared to PCIe Gen3. Which translates into higher
throughput even when using standard 1500 MTU. Following graph shows the throughput on
PCIe Gen 4 Platform based on AMD EPYCTM 7F72 @ 3.2 GHz – Single-Socket using PCIe Gen 4
NIC Mellanox ConnectX-6 single port using a standard 1500 MTU.

401
VMware NSX Reference Design Guide

Figure 8-15: Throughput with AMD® EPYCTM 7F72 – MTU 1500 Bytes

8.3.1.7 Oversubscription
Oversubscription is a key consideration from a performance perspective. This should be taken
into account from both host hardware design and also from the underlying physical.

8.3.1.7.1 Host hardware level Oversubscription


Hardware level details such as any oversubscription built into the hardware should be
considered. For example, a PCIe Gen 3 dual port 40G NIC may not provide full 80G bidirectional
throughput if it only has 8 lanes. The theoretical max throughput achieved on such a NIC would
be limited by the number of PCIe lanes times the max throughput of each lane. Similarly, a
chassis may have additional components such as IOModules that could limit the max
theoretical throughput based on the type and number of units installed.
Such hardware level details must be considered, as they would restrict max achievable
throughput, at a hardware level, even though the host may be presented with NIC that has
faster port speeds.

8.3.1.7.2 Network Oversubscription


Network oversubscription, going from Top-of-rack (leaf) to the spine layer and above must also
be considered. While the hosts may be presented with high speed ports, the actual throughput

402
VMware NSX Reference Design Guide

could potentially be bottlenecked at the physical network layer based on the oversubscription
ratio.
This is especially true in cases where the traffic flows go beyond the top-of-rack layer to the
next rack or going north out of the NSX domain through an NSX Edge. In these cases, over
subscription ratio should be considered to ensure throughput is not throttled.

pNIC Queues
As discussed in the section on RSS, queuing has considerable impact on the performance for all
types of workloads and specially VM Edge type of workloads. For a given NIC, the number of
RSS engines and the queues available for each engine, may be different based on the driver /
firmware versions. For general purpose workloads, defaults should work fine. For workloads
like VM Edge, following settings will help improve performance at the pNIC level.

8.3.2.1 Receive side queueing


Default queue RSS (DRSS) is basically queuing available by default for all VMs. The following is
an example on how to set up Broadcom NICs with 8 DRSS queues.
#esxcli system module parameters set -m bnxtnet -p ‘DRSS=8'
Similarly, for Mellanox:
#esxcli system module parameters set -m nmlx5_core -p "DRSS=8 RSS=0"
ESX by default, is designed to block a single VM from leveraging all the resources that could
lead to starving resource for other applications. The way ESX achieves this is by moving a VM
that crosses a certain threshold, 2Kpps, out of DRSS. In the case of Edge VM, this would result in
poor performance. To avoid that, increase the threshold where ESX would intervene with the
help of the following two, NIC independent, commands:
#esxcli system settings advanced set -o /Net/NetNetqLoadAvgPktCountShift -i 30
#esxcli system settings advanced set -o /Net/NetNetqLoadAvgByteCountShift -i 50

8.3.2.2 Transmit Side Queuing


To increase the transmit side queuing to 4 queues, use the following commands:
#esxcli system settings advanced set -i 1 -o /Net/NetSchedHClkMQ
#esxcli system settings advanced set -i 4 -o /Net/NetSchedHClkMaxHwQueue
#esxcli system settings advanced set -i 1 -o /Net/NetSchedHClkVnicMQ

Receive side is often the bottleneck, much more than the transmit. Hence, to tune, it's best to
focus on higher number of queues on the receive side and not so much on the transmit side,
following table captures few ways to tune and includes the vCPU count for each tuning model.

403
VMware NSX Reference Design Guide

Default Mid level Higher Tuning


tuning

Rx Queue 4 8 12

Tx Queue 2 4 4

# of vCPUs 6 12 16

Number of pNICs
Adding multiple NICs in a ESXi host will help scale up the network performance of that host.
Multiple NICs help by overcoming the limitations of a single NIC from the port speed
perspective.
In the case of normal data center workloads that can leverage offloads, multiple pNICs help
overcome pNIC hardware level speed limitation. As noted in the RSS section, each queue is
serviced by a core or a thread. While these cores are not dedicated, meaning they are available
for other activities based on need, at peak network utilization consider having enough cores to
service all the used queues.
The following example shows the benefit of having two pNICs compared to a single pNIC using a
single TEP vs Dual TEP. Dual TEP will help achieve twice the throughput of single TEP. The
following image compares throughput with Single TEP vs throughput with Dual TEP using Intel®
XL710 NICs on servers running on Intel® Xeon® Gold 6252 CPU @ 2.10GHz

404
VMware NSX Reference Design Guide

Figure 8-14: Throughput with Single TEP vs Dual TEP with Different MTUs

Note: Intel® XL710s are PCIe Gen 3 x8 lane NICs. On x8 lane NICs, the max throughput is limited
to 64Gbps. To achieve the above near-line rate of 80Gbps, 2 x Intel® XL710s must be used.

vNIC Queues
Once the lower-level details are tuned for optimal performance, the next consideration moves
closer to the workload. At this point the question is, whether the workload is able to consume
packets at the rate which ESXi is tuned to send? Note that this is generally not a concern when
a single ESXi host is hosting many VMs, optimally, this should be twice the number of total
queues provided by the pNIC. That is, if the total queues across all pNICs is 4, that ESXi host
should have at least have 8 VMs. In this typical data center use case, while each VM may not be
able to consume at the ESXi’s packet processing rate, many VMs together may be able to reach
that balance.
Queuing at the vNIC only becomes relevant for instances where the ESXi may be running 1 or 2
VMs and the expectation for the workload is pure packet processing, as in the case of Edge VM.
In fact, enabling multiple queues is the recommended and default option for the Edge VMs.
Similar to the RSS and queuing configuration at the pNIC level, vNIC also allows configuration
for multiple queues instead of the default single queue. Edge VM section has more details on
how to set this up.

405
VMware NSX Reference Design Guide

While vNIC queues can be set for any VM. However, the workload running on the VM, should
be able to leverage the multiple queues, in order to see a benefit. NSX VM Edge is designed to
take advantage of multiple queues and should be leveraged for optimal performance.

vCPU Count
As discussed in the pNIC queues and vNIC queues topics, each queue needs a core/thread to
service it. However, these cores are not dedicated to just servicing the queues but are engaged
on demand when needed. That is, they may be used for workloads when there is no network
traffic. Hence, a system should not only be tuned for max performance via multiple queues,
but care should be taken that the system actually has access to the required number of cores to
service those queues.
While the pNIC queues depend on the system cores, that is physical cores for packet
processing, vNIC queues depend on the vCPU cores. This vCPU count should be considered
when allocating vCPUs for the workload. However, as called out in the vNIC queuing section,
this is only true for pure packet processing focused workloads such as NSX Edge VM. For most
common data center TCP workloads, leveraging offloads such as Geneve offload, this is not a
concern.
Following table lists the vCPU count for different VM edge form factors:

Edge Size Data Path Services # of vCPUs

Small 1 1 2

Medium 2 2 4

Large 4 4 8

Extra Large 8 8 16

Jumbo MTU for Higher Throughput


Maximum Transmission Unit (MTU) denotes the maximum packet size that can be sent on the
physical fabric. When setting this configuration on the ESX hosts and the physical fabric,
Geneve header size has to be taken into consideration. Our general recommendation is to
allow for at least 200 bytes buffer for Geneve headers in order to accommodate the option field
for use cases such as service-insertion. As an example, if the VM’s MTU is set to 1500 bytes,

406
VMware NSX Reference Design Guide

pNIC and the physical fabric should be set to 1700 or more. See FIGURE 8-15: MTU Configuration on
MTU Configuration.

Figure 8-15: MTU Configuration

Why should you care about MTU values? MTU is a key factor for driving high throughput, and
this is true for any NIC which supports 9K MTU also known as jumbo frame. The following
graph FIGURE 8-16: MTU AND Throughput shows throughput achieved with MTU set to 1500 and
8800:

407
VMware NSX Reference Design Guide

Figure 8-16: MTU and Throughput

Our recommendation for optimal throughput is to set the underlying fabric and ESX host’s
pNICs to 9000 and the VM vNIC MTU to 8800.
Notes for FIGURE 8-16: MTU AND Throughput:
• The above graph represents a single pair of VMs running iPerf with 4 sessions.
• For both VM MTU cases of 1500 and 8800, the MTU on the host was 9000 with
demonstrated performance improvements.

Checking MTU on an ESXi host


Use the esxcfg-nics command to check the MTU:
[Host-1] esxcfg-nics -l | grep vmnic5
vmnic5 0000:84:00.0 i40en Up 40000Mbps Full
3c:fd:fe:9d:2b:d0 9000 Intel Corporation Ethernet
Controller XL710 for 40GbE QSFP+
CLI 4 Check MTU on ESXi Host

408
VMware NSX Reference Design Guide

For the VM, use the commands specific to the operating system for checking MTU. For
example, “ifconfig” is one of the commands in Linux to check MTU.

Performance Factors for NSX Edges

Data Plane Development Kit (DPDK)


Data Plane Development Kit (DPDK) is a set of libraries and drivers to enable fast packet
processing on a wide variety of CPU architectures including x86 platforms. DPDK is applicable
for both the VM and bare metal Edge form factors, with the VM edge capable of delivering over
20Gbps throughput for standard TCP based DC workloads. Bare metal edge delivers over
35Gbps throughput for the same TCP-based DC workloads. The bare metal Edge form factor
also excels at processing small packet sizes ~78 Bytes at near line on a 40Gbps port such as
Intel® XL710, which is useful in NFV-style workloads. The following graph FIGURE 8-17: BARE
METAL EDGE P ERFORMANCE TEST (RFC2544) WITH IXIA shows the performance of bare metal Edge
with a standard RFC 2544 test with IXIA.

Figure 8-17: Bare Metal Edge Performance Test (RFC2544) with IXIA

409
VMware NSX Reference Design Guide

Note: For the above test, the overlay lines in blue are calculated by adding throughput
reported by IXIA and the Geneve Overlay header size.

SSL Offload
VMware NSX bare metal Edges also support SSL Offload. This configuration helps in reducing
the CPU cycles spent on SSL Offload and also has a direct impact on the throughput achieved.
In the following image, Intel® QAT 8960s are used to show case the throughput achieved with
SSL Offload.

Figure 8-18: Throughput with SSL Offload

Key Performance Factors


One key to achieve better performance for compute, Edge VM, and bare metal Edge is to
ensure having the right set of tools. While DPDK plays a key role, it’s not the only factor
affecting performance. The combination of DPDK and other capabilities of NIC drivers and
hypervisor enhancements will help achieve optimal performance.

410
VMware NSX Reference Design Guide

Compute Node Performance Factors


For compute clusters, our recommendation is to ensure the following two features are
available on your NIC of choice:
• Geneve Offload
• Geneve Rx Filters
Geneve Offload helps decrease the CPU usage by offloading the task of dividing TSO segments
into MTU-determined packets to the NIC card while also helping to increase throughput.
Geneve Rx Filters help increase the number of cores used to process incoming traffic, which is
in turn increases performance by a factor of 4x times based on the number of hardware queues
available on the NIC. For older cards which do not support Geneve Rx Filters, check whether
they at minimum have RSS capability. The following graph shows the throughput achieved with
and without using Geneve Rx Filters, 10Gbps vs near line rate:

Figure 8-19: Throughput Improvement with Rx Filters

=====================================================================

411
VMware NSX Reference Design Guide

FIGURE 8-19 Note: This test was run with LRO enabled, which is software-supported starting with
ESX version 6.5 and higher on the latest NICs which support the Geneve Rx Filter. Thus, along
with Rx Filters, LRO contributes to the higher throughput depicted here.
=====================================================================

VM Edge Node Performance Factors


In the case of edge clusters, DPDK is the key factor for performance. In the case of VM Edge,
RSS-enabled NICs are best for optimal throughput.

RSS at pNIC
To achieve the best throughput performance, use an RSS-enabled NIC on the host running Edge
VM, and ensure an appropriate driver which supports RSS is also being used. Use the VMWARE
COMPATIBILITY G UIDE FOR I/O to confirm driver support.
RSS on VM Edge
For best results, enable RSS for Edge VMs. Following is the process to enable RSS on Edge VMs:
1. Shutdown the Edge VM
2. Find the “.vmx” associated with the Edge VM (https://kb.vmware.com/s/article/1714)
3. Change the following two parameters, for the Ethernet devices in use, inside the .vmx
file
a. ethernet3.ctxPerDev = "3"
b. ethernet3.pnicFeatures = "4"
4. Start the Edge VM
Alternatively, use the vSphere Client to make the changes:
1. Right click on the appropriate Edge VM and click “Edit Settings”:

412
VMware NSX Reference Design Guide

Figure 8-20: Change VM RSS Settings via vSphere Client – Part 1

413
VMware NSX Reference Design Guide

2. Under “VM Options”, expand “Advanced” and click on “Edit Configuration”:

Figure 8-21: Change VM RSS Settings via vSphere Client – Part 2

3. Add the two configuration parameters by clicking on “Add Configuration” for each item
to add:

414
VMware NSX Reference Design Guide

Figure 8-22: Change VM RSS Settings via vSphere Client – Part 3

The following graph FIGURE 8-23: RSS FOR VM Edge shows the comparison of throughput
between a NIC which supports RSS and a NIC which does not. Note: In both cases, even where
the pNIC doesn’t support RSS, RSS was enabled on the VM Edge:

415
VMware NSX Reference Design Guide

Figure 8-23: RSS for VM Edge

With an RSS-enabled NIC, a single Edge VM may be tuned to drive over 20Gbps throughput. As
the above graph shows, RSS may not be required for 10Gbps NICs as they can achieve close to
~15 Gbps throughput even without enabling RSS.

8.6.2.1 pNIC Count


When deploying VM Edges, if performance is a concern, recommendation is to use dedicated
edge hosts. Given that most modern systems are built by default with dual sockets and
multiple core processors, running a single VM edge on such a server may leave a lot of CPU
power unutilized. Which begs the question of running multiple edge VMs on each hypervisor.
To run multiple Edge VMs on the same hypervisor, we must consider the available CPU count
and also the count of dedicated pNICs available for the VM Edge. For each VM Edge, from a
performance perspective, the recommendation is to dedicate 2 pNICs. Most modern systems
have many cores to spare, and the entire host may not be utilized optimally, based on the
number of available cores.

416
VMware NSX Reference Design Guide

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24

25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48

Edge VM (X-Large)

1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16

In general, when
vNIC 1 vNIC 2 vNIC 3 adding Edge VMs
on a Edge host,
core availability is
not a concern.
TEP TOR-L TOR-R pNICs are
ESXi Host

20 vCPUs (2pNICs x (8 Rx + 2 Tx))

pNIC 1 pNIC 2

Figure 8-24: Medium level tuning 2 pNICs host

In FIGURE 8-24, based on medium level tuning of about 8 Rx queues and 2 Tx queues, the vCPU
count is 20 for both the pNICs together. While in the above image there is only one edge, in
reality the above configuration has been leveraged to run many Edge VMs with optimal
performance. However, given that modern servers have lot of cores to spare, a better way to
utilize the available cores is to add two additional pNICs.
The following image FIGURE 8-25 shows how multiple pNICs can help utilize the available cores
better and help reduce the ESX server host footprint by increasing the Edge VM density per
host.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2122 23 24

25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48 25 26 27 28 29 30 3132 33 34 35 36 37 38 39 40 4142 43 44 45 46 47 48

Edge VM (X-Large) Edge VM (X-Large)

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 9 10 11 12 13 14 15 16
Based on usage
patterns, additional
vNIC 1 vNIC 2 vNIC 3 lighter load Edge vNIC 1 vNIC 2 vNIC 3
VMs may also be
deployed to these
Edge hosts
TEP TOR-L TOR-R TEP TOR-L TOR-R
ESXi Host

20 vCPUs (2pNICs x (8 Rx + 2 Tx)) 20 vCPUs (2pNICs x (8 Rx + 2 Tx))

pNIC 1 pNIC 2 pNIC 3 pNIC 4

Figure 8-25: Medium level tuning 4 pNICs host

417
VMware NSX Reference Design Guide

Note: For workloads that are heavy on packet processing, it would be better to dedicate a pair
of pNICs to a single edge VM.
The following table provides various configurations that can be leveraged when deployed edge
VMs based on the server vCPU count.

Large X-Large Edge Total Edge vCPU # of pNICs # Rx Queues # Tx Queues Total pNIC Total vCPU
Edge VM VM Count Count / pNIC / pNIC vCPU Count
Count

0 4 64 2 12 4 32 96
0 4 64 4 8 2 40 104
2 2 48 4 8 2 40 88
4 0 32 4 8 2 40 72
4 0 32 2 12 4 32 64

In summary, increasing the VM Edge count along with increasing the dedicated pNIC count
could help leverage all the cores on a dedicated Edge host.

Bare Metal Edge Node Performance Factors


Bare metal Edge has specific requirements dependent on the type of NIC used. Please refer to
the NSX INSTALLATION GUIDE for details on currently supported cards.
While VM and bare metal Edges leverage Data Path Development Kit (DPDK), they differ in
deployment. While VM Edge does not have restrictions on the physical NIC which can be used
because VMXNET3 provides DPDK functionality for the VM Edge, bare metal Edge does have
strict restrictions on the NICs which may be used. (Note however, bare metal Edge nodes do
support Intel® QATs for SSL Offload.) Please refer to the relevant hardware requirements
section of the NSX installation guide for specific details on compatibility. The features required
on the NIC for a Bare metal Edge are enabled by default, before the data path gets enabled.
For example, RSS is enabled by default on bare metal edges and does not require any
configuration from the user. To check RSS status on a bare metal edge, following CLI may be
used:
get dataplane | find rss

8.6.3.1 Bare Metal Edge HW Recommendations


Bare metal edge has specific requirements from a hardware perspective. These requirements
are based on the version of NSX that will be deployed. For example, following are the

418
VMware NSX Reference Design Guide

requirements for NSX 4.0: HTTPS:// DOCS.VMWARE .COM/EN/VMWARE -


NSX/4.0/ INSTALLATION /GUID-14C3F618-AB8D-427E-AC88-F05D1A04DE40. HTML For a
successful deployment, care must be taken to make sure the hardware designated for bare
metal edge use, matches the requirements.
This section will focus on recommendations to consider, from a performance point of view.

8.6.3.1.1 NIC
Network Interface Card (NIC) is one of the most critical components to consider, from
performance perspective, for use with bare metal or virtual form factor edges. While the high
level specs such as the port speed of the NICs may be similar, the actual hardware design must
be considered.
For example:
• any single port NIC with 100G port, will either need to have PCIe Gen 3 x16 to fully
leverage the 100G speed or a PCIe gen 4 x8 NIC.
• any dual port NIC with over 40G ports, will need either a PCIe Gen 3 x16 to fully leverage
up to 80G speed or a PCIe Gen 4 x8 NIC.
• Any quad port NIC with over 25G ports, will need a PCIe Gen 3 x16 to fully leverage up
to the 100G speed of a PCIe Gen 4 x8 NIC.
As noted above, PCIe Gen 3 x8 PCIe lanes NICs, the PCIe lanes could potentially become the
bottleneck.
Bare Metal Edge is able to fully leverage any port speed, at line rate, as long as the underlying
HW level specs of the NIC card and the PCIe lanes are sufficient to support the actual port
speed. Considering that, it is not necessary, from a performance perspective, to use multiple
ports / pNICs of lower port speed, such as 4 x 25Gbps. Recommendation is to instead leverage
pNICs with higher port speed, such as 2 x 100Gbps.
Adding it all up, in general, it’s better to err on the side of caution, and choose:
1. x16 PCIe lane NICs along with
2. PCIe Gen 4, if the system supports it and
3. 2 x 100G ports instead of 4 x 25G ports
In the case of Cisco VICs and other such architectures, the type of fabric extenders and any
other hardware that’s in the data path must be considered. If the farbric extenders, or any
another hardware component between the Edge and the ToR switch, is oversubscribed, it may
not be possible to leverage the full bandwidth capacity of VIC.

8.6.3.1.2 Single Socket vs Dual Socket


On the CPU front, apart from making sure that the CPU is in the supported list, consider single
vs dual socket architecture. Intel’s single socket architecture is basically a single NUMA node.

419
VMware NSX Reference Design Guide

With a single NUMA node, traffic is processed within that single NUMA node which avoids any
inter NUMA communication bottlenecks. In a multi NUMA architecture, such as with dual
socket Intel systems, traffic may end up being processed by both the cores as DPDK leverages
only NUMA 0 for all the packet processing. This forces traffic to cross the NUMA boundary, if
services are leveraging a different NUMA node or if the pNIC is attached to a different NUMA
node.
In these cases where traffic needs to cross the inter NUMA boundary, the inter NUMA
interconnect, such as Intel’s Ultra Path Interconnect (UPI) or the older Quick Path Interconnect
(QPI) could potentially become the bottle neck and may also add latency. Theoretical and
practical bandwidth capabilities of the interconnect should be considered. This is especially true
when leveraging multiple 40G / 100G NICs.
For workloads that are heavily focused on packet processing, such as Telco’s, single socket
architecture would help achieve optimal performance.
Note: Bare Metal edge only supports a defined max number of cores (the number depends on
the version, check the official documentation). Hence, total core count must be considered,
especially when going with dual socket platforms.

8.6.3.1.3 Additional resources


We have a dedicated paper on Bare Metal Edge performance considerations and test results:
HTTPS :// NSX . TECHZONE .VMWARE . COM/ RESOURCE/ VMWARE - NSX - BARE - METAL-EDGE -
PERFORMANCE #INTRODUCTION

Summary of Performance – NSX Components to NIC Features


The following table (TABLE 8-1 SUMMARY OF P ERFORMANCE IMPROVEMENTS WITH VARIOUS
Capabilities)provides an overall view and guidance on NSX components and NIC features per
given use case. For common datacenter applications and deployments, Standard N-VDS is the
recommendation. Enhanced Data Path is only meant for NFV style workloads with fast packet
processing requirements.

Compute Transport Compute ESXi nodes Bare Metal


Nodes (Standard) Transport Nodes with VM Edge
(“Enhanced Data Edges
Path”)

420
VMware NSX Reference Design Guide

Features Geneve-Offload: To save Enhanced Data RSS: To leverage DPDK: Poll mode
that on CPU cycles Path: For DPDK-like multiple cores driver with
Matter capabilities memory- related
enhancements to
Geneve-RxFilters: To
help maximize
increase throughput by
packet processing
using more cores and
speed
using software based
LRO
QATs: For high
RSS (if Geneve-RxFilters
encrypt/decrypt
does not exist): To
performance with
increase throughput by
SSL-offload
using more cores

Benefits High Throughput for Maximum PPS for NFV Maximize Maximum PPS
typical TCP-based DC style workloads Throughput for
Maximum
Workloads typical TCP based
Throughput even
Workloads with
for small packet
Edge VM
sizes
VM Tuning + NIC
Maximum
with RSS Support
encrypt/decrypt
Add/Edit two performance with
parameters to the SSL Offload
Edge VM’s vmx
Low latency
file and restart
Maximum Scale
Fast Failover

Table 8-1 Summary of Performance Improvements with Various Capabilities

Results
The following section takes a look at the results achievable under various scenarios with
hardware designed for performance. First, here are the test bed specs and methodology:

Virtual
Compute Edge Bare Metal Test Tools
Machine

421
VMware NSX Reference Design Guide

• CPU: Intel(R) Xeon(R) Gold vCPU: 2 CPU: Intel ® Xeon ® iPerf 2 (2.0.5)
6252 CPU @ 2.10GHz E5-2637 v4 3.5Ghz with
• RAM: 192 GB RAM: 2 GB
• Hyper Threading: Enabled • RAM: 256 • 4 – 12 VM
Network:
• MTU: 1700 GB Pairs
VMXNET3
• NIC: XL710 • Hyper • 4 Threads
• NIC Driver: i40e - 1.3.1- MTU: 1500 Threading: per VM
18vmw.670.0.0.8169922 Enabled Pair
• ESXi 6.7 • MTU: 1700 • 30
• NIC: XL710 seconds
• NIC Driver: per test
In-Box • Average
over
three
iterations
Table 8-2 Specific Configuration of Performance Results

NSX Components
• Segments
• T1 Gateways
• T0 Gateways
• Distributed Firewall: Enabled with default rules
• NAT: 12 rules – one per VM
• Bridging
a. Six Bridge Backed Segments
b. Six Bridge Backed Segments + Routing

The following graph shows in every scenario above, NSX throughput performance stays
consistently close to line rate on an Intel® XL710 40Gbps NIC.

422
VMware NSX Reference Design Guide

Figure 8-26: Throughput Summary for NSX Based Datacenter

Benchmarking Tools

Compute
On the compute side, our recommendation for testing the software components is to use a
benchmarking tool close to the application layer. Application layer benchmarking tools will
help take advantage of many features and show the true performance characteristics of the
system. While application benchmarking tools are ideal, they may not be very easy to setup
and run. In such cases, iPerf is a great tool to quickly setup and check throughput. Netperf is
another tool to help check both throughput and latency.
Here is a github resource for an example script to run iPerf on multiple VMs simultaneously and
summarize results: HTTPS:// GITHUB.COM/ VMWARE -SAMPLES/NSX-PERFORMANCE -TESTING-SCRIPTS

Edges

VM Edge
As VM Edges are designed for typical DC workloads, application layer tools are best for testing
VM Edge performance.

Bare Metal Edge


With the Bare Metal Edges, either application layer benchmarking tools or typical network
benchmarking and packet generation tools of choice, such as Keysight PathWave (formerly IXIA)
or Spirent Network Emulator may be used. One of the challenges with using a hardware
benchmarking tool is to find one which includes provision for Geneve encap/decap. Following

423
VMware NSX Reference Design Guide

is a topology with two bare metal edges, each within its own cluster. A segment, aptly called a
crosslink segment, connects them both over overlay. PathWave sends and receives only non-
overlay packets. However, the segment between the two Edge clusters forces the packets to
use overlay. Check the image below for details.

Figure 8-27: Example Topology to Use Geneve Overlay with Hardware IXIA or Spirent

An alternative approach is to use a software tool such as Pktgen


(HTTPS:// GITHUB.COM/PKTGEN/P KTGEN-DPDK).

Conclusion
To drive enhanced performance, NSX uses a number of features supported in hardware.
On the compute side, these are:
1. Geneve Offload for CPU cycle reduction and marginal performance benefits
2. Geneve Rx Filters, an intelligent queuing mechanism to multiply throughput
3. RSS an older hardware-based queuing mechanism – alternative if Rx Filters are missing
4. Jumbo MTU an age-old trick to enable high throughput on NICs lacking above features
5. NIC port speed
6. Number of NICs – single vs dual TEP
7. PCIe Gen 3 with x16 lane NICs or multiple PCIe Gen 3 x8 NICs
8. PCIe Gen 4

424
VMware NSX Reference Design Guide

For compute workloads focused on high packet processing rate for primarily small packet sizes,
Enhanced Data Path Enabled NICs provide a performance boost.

For the Edges, if its VM Edges, then:

1. NIC cards which support RSS and


2. Enabling RSS on both the pNIC and the Edge VM NIC (vNIC)

For the Bare Metal Edges, leveraging optimal SSL offload performance such as Intel® QAT 8960s
and deploying supported hardware from the VMware NSX install guide will result in
performance gains.

425
VMware NSX Reference Design Guide

9 DPU-based Acceleration for NSX


The requirements of Modern Applications from Networking, Security and Storage keep evolving
and datacenters are transitioning from 10 Gbps to 25 Gbps, 100 Gbps and beyond. The
transition to higher speed will require significant computing resources, both for networking
traffic and to support existing and next generation of security technologies like IDS/IPS,
Encryption etc. which are fundamental to zero trust attestations.
DPU-based acceleration for NSX offloads and accelerates NSX networking, security and
observability services from host CPUs to DPUs. DPUs’ hardware accelerators along with
programmable cores allow DPUs/ SmartNICs to deliver low latency and high-performance
networking and security services with little to no host CPU utilization.

DPU Architecture
DPU stands for Data Processing Unit, a term used interchangeably for SmartNICs. A DPU is a
PCIe card that plugs into a server and goes beyond simple connectivity. It implements the
network traffic processing that the server previously performed. Every DPU has its own
onboard ARM processor as well as a programmable hardware accelerator that assists with
networking and security functions. These server CPU cores, previously consumed for
infrastructure support, are now freed up and can be repurposed to improve workload density
and enhance application performance.
DPUs have dual High-Speed Ethernet ports with bandwidth ranging from 25-100 Gbps. In
addition, a dedicated 1 Gbps Ethernet port for management access. However, in the first phase
of DPU-based Acceleration for NSX, DPUs are not independently managed. The lifecycle of the
DPU is tied to the lifecycle of the hypervisor.

Figure 9-1: DPU Architecture

426
VMware NSX Reference Design Guide

Server requirements to enable offload to DPU


A commonly asked question is whether existing servers can be retrofitted with
SmartNICs/DPUs. The short answer to this question is currently no. Existing servers cannot have
their NICs simply replaced by SmartNICs/DPUs. OEMs like DELL, HPE, and Lenovo have made
architectural enhancements to enable communication between the hypervisor and the DPU,
and only these supported platforms can be used to offload vSphere and NSX services to the
DPU. A full list of up-to-date supported server platforms with compatible SmartNIC vendor
options can be found at the following location:
VMWARE COMPATIBILITY GUIDE - SYSTEM SEARCH, under “Features” field, select “Distributed
Services Engine”.

Offloading NSX to the DPU


When ESXI is installed on a hypervisor backed by a DPU, a lightweight component of ESXI is
installed on the DPU as well. This further enables the creation of a distributed virtual switch
that is offloaded to the DPU. When NSX is installed on the offloaded VDS, NSX VIBs are also
programmed on the DPU. From this point on, all the networking and security functionalities
supported by NSX are hosted on the DPU. However, the UI and API consumption of NSX
Manager remains unchanged with the introduction of DPUs.
It is important to note that powerful CPU cores on the server are NOT just being replaced with
ARM cores on the DPU. The reliance on ARM cores to support infrastructure services is reduced
by leveraging the evolving capabilities of the programmable hardware accelerator on the DPU.
Such capabilities are delivered through firmware updates by the DPU vendors and do not
require a hardware refresh. More networking and security decisions are programmed and
enabled by hardware, thereby accelerating performance while simultaneously reducing the
consumption of CPU cores on the server platforms.
Additionally, there is complete isolation of the networking and security stack of the compute
platform from that of the application. Security policies are enforced at the point of entry of the
traffic, i.e., at the NIC itself, thereby considerably improving the security posture.
To summarize, the most important takeaways about the DPU-based Acceleration for NSX are:
1. Reduced server CPU utilization for infrastructure services
2. Improved network and security performance
3. Network and Security stack isolation (Air-gap)

Evolution of the NSX Datapath


To fully understand the value of offloading NSX to the DPU, it is important to have a clear
understanding of the Datapath models that exist today.

427
VMware NSX Reference Design Guide

Standard Datapath
The Standard Datapath is the default datapath mode that is enabled for vSphere Networking as
well as NSX today. In this interrupt-driven Datapath model, when a packet arrives on the wire, it
triggers a CPU interrupt. The CPU suspends whatever it is currently processing, saves the
current CPU context, and switches to processing the new packet detected on the wire. At very
high packet rates, this interrupt-driven and context switch model can lead to very high and
unpredictable latency, which makes it less favorable for applications requiring high network
performance.

Figure 9-2: Standard VDS

High-performance Datapath Models

For applications in need of high throughput and low latency, the Datapath models of choice are
as follows.

9.2.2.1 Single Root I/O Virtualization (SR-IOV)


SR-IOV is a hardware standard that allows a PCI Express device – typically a network interface
card (NIC) – to present itself as several virtual NICs to a hypervisor. The SR-IOV standard defines
the master NIC as the Physical Function (PF) and its VM-facing virtual NICs as the Virtual
Functions (VFs). The VMs map one-to-one with the Virtual Functions. Therefore there is a

428
VMware NSX Reference Design Guide

complete bypass of the x86 hypervisor, which further ensures low latency and the ability to
process higher packets per second (PPS). However, the applications hosted on the VMs need to
be compatible with the NIC Vendor Poll Mode Driver (PMD) and, therefore, be aware of the
underlying hardware. Due to the software bypass and hardware dependency, none of the high
availability (HA) features like vMotion and DRS, or the networking and security functions, is
available with this Datapath model.
Unfortunately, the low latency and high PPS benefits are outweighed by the inability to
leverage the HA, Network, and Security features, and this Datapath model is minimally
adopted.

Figure 9-3: SR-IOV

9.2.2.2 Enhanced Datapath - Performance (EDP-Performance)


The Enhanced Datapath Performance Mode (EDP) is available only with NSX. A predetermined
number of CPU cores on the server (x86) are reserved for packet processing. When a packet
arrives on the wire, it is copied to a shared memory location. These shared memory locations
are constantly polled, and when a packet is detected, the first lookup is performed in the Fast
Path Flow Cache. If there is an entry in the Fast Path that is a match for the current flow, the
packet is either dropped or forwarded to the corresponding destination VM based on
Networking and Security configuration downloaded from the NSX Manager. If there is a cache
miss in the Fast Path Flow Cache, the flow is evaluated in the Slow Path, and a decision is made

429
VMware NSX Reference Design Guide

on whether to drop/forward the existing packet in addition to programming the Fast Path for
faster processing of subsequent flows. The applications running on the workloads need to be
compatible with the para-virtualized VMXNET3 Poll Mode Driver and therefore do not need to
be aware of the underlying hardware. All the HA features, like vMotion and DRS, and
Networking and Security features, are available in this Datapath Mode.
However, some CPU cores are reserved at the time of host preparation solely for the purpose
of packet processing. Irrespective of the traffic conditions (lower or higher packet rates), these
cores are completely reserved for packet processing and cannot be utilized by the workloads.
CPU cores reserved for EDP-Performance will always be at 100% utilization. The CPU core usage
will vary based on traffic conditions, but this core is unavailable for other functions even in low-
traffic conditions.

Figure 9-4: EDP Performance

9.2.2.3 Enhanced Datapath -Standard (EDP-Standard)


The Enhanced Datapath Standard Mode is also available with NSX. Unlike its counterpart, EDP-
Performance, this datapath mode is interrupt driven. That means no cores need to be allocated
for packet processing at the time of preparing NSX on the Transport Node. Rather than based
on traffic conditions and packet rates, cores are consumed on demand. This datapath mode
benefits from enhanced performance provided by the software enhancements leveraging the
Fast Path Flow Cache.

430
VMware NSX Reference Design Guide

For NSX Offload to the DPUs, Enhanced Datapath is a requirement (either Performance or
Standard). The recommendation is to go with EDP-Standard. This is because with the
introduction of DPUs we rely less on the software slow/fast path, and offload decision making
to the HW acceleration when possible. EDP-Performance would reserve CPU resources from
the DPU/SmartNIC and could lead to underutilization.

Figure 9-5: EDP-Standard

VM-modes available with the introduction of DPUs


Once the successful installation of ESXi on a transport node/hypervisor backed by DPU is
complete, an offloaded VDS is created. After this step, when NSX is installed on the VDS, the
Datapath mode used for the NSX preparation in the case of DPU-backed hypervisors will have
to be Enhanced Datapath (either Enhanced Datapath-Standard or Enhanced Datapath-
Performance). From this point on, the entire networking and security stack is hosted on the
DPU, and consumes computing resources from the DPU (both ARM cores and the
programmable accelerator as supported/required)
The APIs for creating networking artifacts like T1 Gateways, Segments, and security policies
remain unchanged with DPU-based Acceleration for NSX.

431
VMware NSX Reference Design Guide

When a workload is attached to a segment associated with an offloaded VDS [an offloaded VDS
is defined as a Distributed Virtual switch with uplinks that are mapped to the ports of a DPU],
every individual vNIC can be deployed in 2 modes.

9.2.3.1 UPTv2 Mode (Uniform PassThrough)


This VM mode marries the advantages of both the SRIOV Datapath Mode (low latency and x86
bypass) and the Enhanced Datapath (availability of HA and Networking and Security
functionality). By hosting the Enhanced Networking Stack (ENS) on the DPU and leveraging the
ARM cores and programmable accelerator, the CPU cores on the server platform previously
consumed by ENS are now freed up.
Additionally, DPU physical ports are able to present themselves as several virtual NICs/virtual
functions to a hypervisor so that the UPTv2 VMs can map to them one-to-one. In this scenario,
the x86 is completely bypassed, but software functions for HA and Networking/Security are still
made available by leveraging the compute on the DPU.
The number of UPTv2 VMs on a hypervisor is limited to the number of VFs the DPU ports
support. They also require a complete upfront reservation of Guest Memory. Only VMs hosting
applications requiring low network latency should be configured in this mode.
Additionally, the vNICs in UPTv2 mode need a specific version 7 of the VMXNET3 paravirtualized
driver. This is available by default with linux kernel version 6 and has also been upstreamed to
some linux based distributions outlined in this KB article:
HTTPS :// KB. VMWARE . COM/ S/ ARTICLE /90548

9.2.3.2 Emulated Mode


This VM mode continues to benefit from the performance benefits of offloading NSX to the
DPU and leveraging the HW Accelerator. However, it overcomes the limitations of UPTv2,
wherein the number of VMs in UPTv2 mode on a hypervisor is limited by the number of ports
the DPU can expose. In Emulated/MUX mode, the VMs are backward compatible (do not
require the VMXNET3V7 Driver) and do not require complete guest memory reservation.
However, there is a small CPU overhead incurred by implementing the multiplexer switch in
order to increase the number of VMs that can be hosted in the emulated mode. The CPU
footprint is, however, less than the CPU that would be consumed should NSX be running on the
hypervisor.

432
VMware NSX Reference Design Guide

Figure 9-6: UPTv2 vs Emulated Mode

433

You might also like