NSXT 23 Troubleshoot
NSXT 23 Troubleshoot
Troubleshooting Guide
Modified March 2019
VMware NSX-T Data Center 2.3
NSX-T Data Center Troubleshooting Guide
You can find the most up-to-date technical documentation on the VMware website at:
https://docs.vmware.com/
If you have comments about this documentation, submit your feedback to
[email protected]
VMware, Inc.
3401 Hillview Ave.
Palo Alto, CA 94304
www.vmware.com
Copyright © 2017 – 2019 VMware, Inc. All rights reserved. Copyright and trademark information.
VMware, Inc. 2
Contents
3 Troubleshooting Installation 24
4 Troubleshooting Routing 28
6 Troubleshooting Firewall 34
Determining Firewall Rules that Apply on an ESXi Host 34
Determining Firewall Rules that Apply on a KVM Host 37
VMware, Inc. 3
NSX-T Data Center Troubleshooting Guide
VMware, Inc. 4
NSX-T Data Center Troubleshooting Guide
The NSX-T Data Center Troubleshooting Guide provides information on how to troubleshoot issues that
might occur in an NSX-T Data Center environment.
Intended Audience
This guide is for system administrators of NSX-T Data Center. A familiarity with virtualization, networking,
and datacenter operations is assumed.
VMware, Inc. 5
Logs and Services 1
Logs can be helpful in many troubleshooting scenarios. Checking the status of services is also important.
n Checking Services
Log Messages
Log messages from all NSX-T Data Center components, including those running on ESXi hosts, conform
to the syslog format as specified in RFC 5424. Log messages from KVM hosts are in the RFC 3164
format. The log files are in the directory /var/log.
On NSX-T Data Center appliances, you can run the following NSX-T Data Center CLI command to view
the logs:
On hypervisors, you can use Linux commands such as tac, tail, grep, and more to view the logs. You
can also use these commands on NSX-T Data Center appliances.
For more information about RFC 5424, see https://tools.ietf.org/html/rfc5424. For more information about
RFC 3164, see https://tools.ietf.org/html/rfc3164.
<facility * 8 + severity> version UTC-TZ hostname APP-NAME procid MSGID [structured-data] msg
Every message has the component (comp) and sub-component (subcomp) information to help identify the
source of the message.
VMware, Inc. 6
NSX-T Data Center Troubleshooting Guide
NSX-T Data Center produces regular logs (facility local6, which has a numerical value of 22) and audit
logs (facility local7, which has a numerical value of 23). All API calls trigger an audit log.
An audit log that is associated with an API call has the following information:
n An external request ID parameter ereqId if the API call contains the header X-NSX-
EREQID:<string>.
n An external user parameter euser if the API call contains the header X-NSX-EUSER:<string>.
All logs with a severity of emergency, alert, critical, or error contain a unique error code in the structured
data portion of the log message. The error code consists of a string and a decimal number. The string
represents a specific module.
The MSGID field identifies the type of message. For a list of the message IDs, see Log Message IDs.
Remote logging is supported on NSX Manager, NSX Controller, NSX Edge, and hypervisors. You must
configure remote logging on each node individually.
On an KVM host, the NSX-T Data Center installation package automatically configures the rsyslog
daemon by putting configuration files in the /etc/rsyslog.d directory.
Prerequisites
VMware, Inc. 7
NSX-T Data Center Troubleshooting Guide
Procedure
a Run the following command to configure a log server and the types of messages to send to the
log server. Multiple facilities or message IDs can be specified as a comma delimited list, without
spaces.
For more information about this command, see the NSX-T CLI Reference. You can run the
command multiple times to add multiple logging server configurations. For example:
nsx> set logging-server 192.168.110.60 proto udp level info facility syslog messageid
SYSTEM,FABRIC
nsx> set logging-server 192.168.110.60 proto udp level info facility auth,user
b you can view the logging configuration with the get logging-server command. For example,
a Run the following commands to configure syslog and send a test message:
*.* @<ip>:514;RFC5424fmt
VMware, Inc. 8
NSX-T Data Center Troubleshooting Guide
GROUPING IP sets
Mac sets
NSGroups
NSServices
NSService groups
VNI Pool
IP Pool
VMware, Inc. 9
NSX-T Data Center Troubleshooting Guide
MONITORING SNMP
Port connection
Traceflow
n If the protocol is TLS, set the protocol to UDP to see if there is a certificate mismatch.
n If the protocol is TLS, verify that port 6514 is open on both ends.
n Remove the message ID filter and see if logs are received by the server.
n Restart the rsyslog service with the command restart service rsyslogd.
VMware, Inc. 10
NSX-T Data Center Troubleshooting Guide
Checking Services
Services that stop running or fail to start can cause problems. It is important to make sure that all services
are running normally.
VMware, Inc. 11
NSX-T Data Center Troubleshooting Guide
In the example above, the http service is stopped. You can start the http service with the following
command:
SSH Service
If the SSH service was not enabled when deploying the appliance, you can log in to the appliance as
admin and enable it with the following command:
You can configure SSH to start when the host starts with the following command:
To enable SSH root login, you can log in to the appliance as root, edit the file /etc/ssh/sshd_config and
replace the line
PermitRootLogin prohibit-password
Alternatively, you can enable the SSH service and enable SSH root access by powering off the appliance
and modifying its vApp properties.
with
PermitRootLogin yes
/etc/init.d/ssh restart
VMware, Inc. 12
NSX-T Data Center Troubleshooting Guide
If you choose to download the bundles to your machine, you get a single archive file consisting of a
manifest file and support bundles for each node. If you choose to upload the bundles to a file server, the
manifest file and the individual bundles are uploaded to the file server separately.
NSX Cloud Note If you want to collect the support bundle for CSM, log in to CSM, go to System >
Utilities > Support Bundle and click on Download. The support bundle for PCG is available from
NSX Manager using the following instructions. The support bundle for PCG also contains logs for all the
workload VMs.
Procedure
1 From your browser, log in with admin privileges to NSX Manager at https://nsx-manager-ip-address.
The available types of nodes are Management Nodes, Controller Nodes, Edges, Hosts, and Public
Cloud Gateways.
5 (Optional) Specify log age in days to exclude logs that are older than the specified number of days.
6 (Optional) Toggle the switch that indicates whether to include or exclude core files and audit logs.
Note Core files and audit logs might contain sensitive information such as passwords or encryption
keys.
Depending on how many log files exist, each node might take several minutes.
The status field shows the percentage of nodes that completed support bundle collection.
10 Click Download to download the bundle if the option to send the bundle to a file server was not set.
VMware, Inc. 13
Troubleshooting Layer 2
Connectivity 2
If there is a communication failure between two virtual interfaces (VIFs) that are connected to the same
logical switch, for example, you cannot ping one VM from another, you can follow the steps in this section
to troubleshoot the failure.
Before you start, make sure that there is no firewall rule blocking traffic between the two logical ports. It is
recommended that you follow the order of the topics in this section to troubleshoot the connectivity issue.
n Troubleshoot Packet Loss for a VLAN logical Switch or When ARP Is Resolved
VMware, Inc. 14
NSX-T Data Center Troubleshooting Guide
Procedure
1 Run the following CLI command on the NSX Manager to make sure the status is stable.
2 Run the following CLI command on an NSX Controller to make sure the status is active.
3 Run the following CLI command on an NSX Controller to make sure it is connected to the
NSX Manager.
Procedure
1 From the NSX Manager GUI, get the logical ports UUIDs.
2 Make the following API call for each logical port to make sure the logical ports are on the same logical
switch.
GET https://<nsx-mgr>/api/v1/logical-ports/<logical-port-uuid>
3 Make the following API call for each logical port to make sure the status is up.
GET https://<nsx-mgr>/api/v1/logical-ports/<logical-port-uuid>/status
VMware, Inc. 15
NSX-T Data Center Troubleshooting Guide
Procedure
u Make the following API call to get the state of the transport node.
GET https://<nsx-mgr>/api/v1/transport-nodes/<transport-node-ID>/state
If the call returns the error RPC timeout, perform the following troubleshooting steps:
n To see if nsx-mpa is connected to the NSX Manager, check the nsx-mpa heartbeat logs.
n To see if opsAgent is connected to the NSX Manager, check the nsx-opsAgent log. You will see
the following message if opsAgent is connected to the NSX Manager.
n To see if opsAgent is stuck processing HostConfigMsg, check the nsx-opsAgent log. If so, you
will see an RMQ request message but the reply is not sent or sent after a long delay.
n To see if the RMQ messages are taking a long time to be delivered to the host, compare the
timestamps of log messages on the NSX Manager and the host.
If the call returns the error partial_success, there are many possible causes. Start by looking at the
nsx-opsAgent logs. On the ESXi host, check hostd.log and vmkernel.log. On KVM, syslog
holds all the logs.
Procedure
u Make the following API call to get the state of the logical switch.
GET https://<nsx-mgr>/api/v1/logical-switches/<logical-switch-ID>/state
If the call returns the error partial_success, the reply will contain a list of transport nodes where the
NSX Manager failed to push the logical switch configuration or did not get a reply. The
troubleshooting steps are similar to those for the transport node. Check the following:
VMware, Inc. 16
NSX-T Data Center Troubleshooting Guide
n Grep the logical switch ID in nsxa.log and nsxaVim.log to see if the logical switch configuration
was received by the transport node.
n Check the nsxa and nsx-mpa uptime. Find out when nsxa was started and stopped by grepping
nsxa log messages in the syslog file.
n Find out nsxa's connection time to the switching vertical. If the logical switch configuration is sent
to the host when nsxa is not connected to the switching vertical, the configuration might not be
delivered to the host.
On KVM, no logical switch configuration is pushed to the host. Therefore, most of the logical switch
issues are likely to be in the management plane.
On ESXi, an opaque network is mapped to the logical switch. To use the logical switch, users connect
VMs to the opaque network using vCenter Server or vSphere API.
Procedure
u Run the following CLI command on an NSX Controller to make sure that the logical switch is present.
Note This CLI command does not list VLAN-backed logical switches.
Prerequisites
Find the controller that the logical switch is on. See Check the CCP for the Logical Switch.
Procedure
2 Run the following command and verify that the controller shows the hypervisors that are connected to
this VNI.
VMware, Inc. 17
NSX-T Data Center Troubleshooting Guide
You should see a Config session on one of the CCP nodes in the CCP cluster. For every overlay
logical switch, you should see an L2 session to one of the CCP nodes in the CCP cluster. For VLAN
logical switches, there are no CCP connections.
Procedure
1 Make the following API call to see if MPA is connected to the NSX Manager.
GET https://<nsx-mgr>/api/v1/logical-ports/<logical-port-uuid>
4 Run the following command to see if the NSX Manager pushed the CCP information to the host.
cat /etc/vmware/nsx/config-by-vsm.xml
5 If config-by-vsm.xml has CCP information, check if a transport node is configured on the hypervisor.
The NSX Manager sends the host certificate for the hypervisor in the transport node creation step.
The CCP must have the host certificate before it accepts connections from the host.
The certificate must be the same as the one that the NSX Manager has for the host.
On ESXi:
/etc/init.d/netcpad status
On KVM:
/etc/init.d/nsx-agent status
VMware, Inc. 18
NSX-T Data Center Troubleshooting Guide
/etc/init.d/netcpad start
/etc/init.d/netcpad restart
/etc/init.d/nsx-agent start
/etc/init.d/nsx-agent restart
9 If the config session is still not up, collect the technical support bundles and contact VMware support.
Procedure
2 Run the following command to see if the logical switch is present on the host.
port 63eadf53-ff92-4a0e-9496-4200e99709ff:
com.vmware.port.extraConfig.opaqueNetwork.id = … <- this should match the logical switch UUID
com.vmware.port.opaque.network.id = …. <- this should match the logical switch UUID
com.vmware.port.opaque.network.type = nsx.LogicalSwitch , propType = RUNTIME
com.vmware.common.port.block = false, ... <- Make sure the value is false.
com.vmware.vswitch.port.vxlan = …
com.vmware.common.port.volatile.status = inUse ... <- make sure the value is inUse.
If the logical port ends up in the blocked state, collect the technical support bundles and contact
VMware support. In the meantime, run the following command to get the DVS name:
VMware, Inc. 19
NSX-T Data Center Troubleshooting Guide
On KVM, run ovs-vsctl list interface and verify that the interface with the corresponding VIF
UUID is present and admin_state is up. You can see the VIF UUID in OVSDB in external-
ids:iface-id.
If the VMs are on the same hypervisor, go to Troubleshoot ARP Issues for an Overlay Logical Switch.
Procedure
1 Run the following command on the controller that has the logical switch to see if CCP has the correct
VTEP list:
2 On each hypervisor, run the following NSX CLI command to see if it has the correct VTEP list:
On ESXi:
Alternatively, you can run the following shell command for the VTEP information:
On KVM:
3 Check to see if the VTEPs on the hypervisors can ping each other.
a Make sure the transport VLAN specified when creating the transport node matches what the
underlay expects. If you are using access ports in the underlay, the transport VLAN should be set
to 0. If you are specifying a transport VLAN, the underlay switch ports that the hypervisors
connect to should be configured to accept this VLAN in trunk mode.
VMware, Inc. 20
NSX-T Data Center Troubleshooting Guide
On ESXi, run net-vdl2 -M bfd and look at the response. For example,
BFD count: 1
===========================
Local IP: 192.168.48.35, Remote IP: 192.168.197.243, Local State: up, Remote State: up, Local
Diag: No Diagnostic, Remote Diag: No Diagnostic, minRx: 1000000, isDisabled: 0
If you don’t know the interface name, run ovs-vsctl find Interface type=geneve to return all
tunnel interfaces. Look for BFD information.
If you cannot find an GENEVEinterface to remote VTEP, check if nsx-agent is running and OVS
integration bridge is connected to nsx-agent.
If the VMs are on the same hypervisor and all the configuration and runtime states are normal, go to
Troubleshoot ARP Issues for an Overlay Logical Switch.
Procedure
u Check that the underlay is configured for the VLAN for the logical switch in trunk mode.
On ESXi, verify VLAN is configured on the logical port by running net-dvs and looking for the logical
port. For example:
port 63eadf53-ff92-4a0e-9496-4200e99709ff:
com.vmware.common.port.volatile.vlan = VLAN 1000 propType = RUNTIME VOLATILE
VMware, Inc. 21
NSX-T Data Center Troubleshooting Guide
On KVM, the VLAN logical switch is configured as an openflow rule on integration bridge. In other
words, for traffic received from the VIF, tag it with VLAN X and forward it on the patch port to the PIF
bridge. Run ovs-vsctl list interface and verify the presence of the patch port between the
NSX-managed bridge and the NSX-switch bridge.
For a VLAN-backed logical switch, go to Troubleshoot Packet Loss for a VLAN logical Switch or When
ARP Is Resolved.
Before performing the following troubleshooting steps, run the command arp -n on each VM. If ARP is
successfully resolved on both VMs, you do not need to perform the steps in this section. Instead, go to
the next section Troubleshoot Packet Loss for a VLAN logical Switch or When ARP Is Resolved.
Procedure
u If both endpoints are ESXi and ARP proxy is enabled on the logical switch (only supported for overlay
logical switches), check the ARP table on the CCP and the hypervisor.
On the CCP:
On the hypervisor, start NSX CLI and run the following command:
Fetching the ARP table only tells us whether we have the correct ARP proxy state. If the ARP
response is not received via proxy, or if the host is KVM and does not support ARP proxy, the
datapath should broadcast the ARP request. There might be a problem with BUM traffic forwarding.
Try the following steps:
n If the replication mode for the logical switch is MTEP, change the replication mode to SOURCE for
the logical switch from the NSX Manager GUI. This might fix the issue and ping will start working.
n Add static ARP entries and see if the rest of the datapath works.
To run the traceflow tool, from the NSX Manager GUI, navigate to Tools > Traceflow. For more
information, see the NSX-T Administration Guide.
VMware, Inc. 22
NSX-T Data Center Troubleshooting Guide
Procedure
On ESXi, run net-stats -l to get the switchport ID of the VIFs. If the source and destination VIFs
are on the same hypervisor, run the following commands:
If the source and destination VIFs are on different hypervisors, on the hypervisor hosting the source
VIF, run the following commands:
On the hypervisor hosting the destination VIF, run the following commands:
On KVM, if the source and destination VIFs are on the same hypervisor, run the following command:
ovs-dpctl dump-flows
VMware, Inc. 23
Troubleshooting Installation 3
This section provides information about troubleshooting installation issues.
n DNS
Make sure that firewall is not blocking traffic between NSX-T components and hypervisors. Make sure
that the required ports are open between the components.
To flush the DNS cache on the NSX Manager, SSH as root to the manager and run the following
command:
VMware, Inc. 24
NSX-T Data Center Troubleshooting Guide
VMware, Inc. 25
NSX-T Data Center Troubleshooting Guide
^C
<truncated output>
root@kvm-01:/home/vmware#
n Go to Fabric > Nodes > Hosts, edit the host and remove all IP addresses except the management
one.
You can run the command df -h to check available storage. If the /boot directory is at 100%, you can do
the following:
n Run sudo dpkg --list 'linux-image*' | grep ^ii to see all the kernels installed.
n Run uname -r to see your currently running kernel. Do not remove this kernel (linux-image).
n Use apt-get purge to remove images you don't need anymore. For example, run sudo apt-get
purge linux-image-3.13.0-32-generic linux-image-3.13.0-33-generic.
VMware, Inc. 26
NSX-T Data Center Troubleshooting Guide
Restarting the edge datapath service and then the VM should resolve the issue.
NSX Manager will not do any validations as to whether you have any active VMs running on the host. You
are responsible for deleting the N-VDS and VIBs. If you have the node added through Compute Manager,
delete the Compute Manager first and then delete the node. The transport node will be deleted as well.
VMware, Inc. 27
Troubleshooting Routing 4
NSX-T has built-in tools to troubleshoot routing issues.
Traceflow
You can use Traceflow to inspect the flow of packets. You can see delivered, dropped, received, and
forwarded packets. If a packet is dropped, a reason is displayed. For example, a packet can be dropped
because of a firewall rule.
edge01> vrf 1
edge01(tier0_sr)> get route
VMware, Inc. 28
NSX-T Data Center Troubleshooting Guide
interface : 6af81d72-4d32-5f66-b7ae-403e617290e5
ifuid : 270
mode : blackhole
interface : 015e709d-6079-5c19-9556-8be2e956f775
ifuid : 269
mode : cpu
interface : 3f40f838-eb8a-4f35-854c-ea8bb872dc47
ifuid : 272
name : bp-sr0-port
mode : lif
IP/Mask : 169.254.0.2/28
MAC : 02:50:56:56:53:00
VNI : 25489
LS port : 770a208d-27fa-4f8d-afad-a9c41ca6295b
urpf-mode : NONE
admin : up
MTU : 1500
interface : 00003300-0000-0000-0000-00000000000a
ifuid : 263
mode : loopback
IP/Mask : 127.0.0.1/8
Advertising T1 Routes
You must advertise T1 routes so that they are visible on T0 router and upwards. There are different types
of routes that you can advertise: NSX Connected, NAT, Static, LB VIP, and LB SNAT.
VMware, Inc. 29
Troubleshooting the Central
Control Plane 5
This section contains information about troubleshooting central control plane (CCP) issues.
Cause
n The API call has invalid input, for example, an incorrect node ID.
Solution
VMware, Inc. 30
NSX-T Data Center Troubleshooting Guide
Cause
n the controller nodes become temporarily unavailable during the process of removing a controller from
its cluster.
Solution
n From the NSX Manager UI, verify that the status of the controller cluster is healthy. If the controller to
be deleted is healthy and the controller cluster is healthy, delete the controller again.
n If either the controller to be deleted or the controller cluster is not healthy, take steps to make sure
that both are healthy. Then delete the controller again.
Cause
n The host where the controller VM is running on might not have enough memory to power on the VM.
Solution
u Log in to vCenter Server and investigate why the controller VM does not power on. If there is
insufficient memory, delete the VM, free up memory on the host and redeploy the controller VM.
Alternatively, redeploy the VM on a different host.
Cause
Solution
n If there is no option in the NSX Manager UI to delete the controller, call the API
POST /api/v1/cluster/nodes/deployments/<node-id>?action=delete to delete the controller.
n Check network connectivity between the controller and the NSX Manager, such as IP address, subset
and firewall settings.
VMware, Inc. 31
NSX-T Data Center Troubleshooting Guide
Cause
Solution
Cause
n The controller cluster becomes unstable or unreachable during the clustering operation.
n The shared secret provided to the controller does not match the shared secret used by the controller
cluster.
Solution
n From the NSX Manager UI, check the cluster status of the nodes in the controller cluster.
n Check that the shared secret used by the new controller is the same as the shared secret used by the
controller cluster.
Cause
Solution
n From the NSX Manager UI, check that the cluster status of the nodes in the controller cluster is Up.
VMware, Inc. 32
NSX-T Data Center Troubleshooting Guide
VMware, Inc. 33
Troubleshooting Firewall 6
This section provides information about troubleshooting firewall issues.
[root@esxi-01:~] summarize-dvfilter
<TRUNCATED OUTPUT>
world 70181 vmm0:app-01a vcUuid:'50 35 9c 70 18 8e 99 1d-3c f9 8e cc 6b 27 4c 6f'
port 50331655 app-01a.eth0
vNic slot 2
name: nic-70181-eth0-vmware-sfw.2
agentName: vmware-sfw
state: IOChain Attached
vmState: Detached
failurePolicy: failClosed
slowPathID: none
filter source: Dynamic Filter Creation
world 70179 vmm0:web-02a vcUuid:'50 35 2b f3 4a 4b 10 83-54 72 50 f7 25 10 d8 64'
port 50331656 web-02a.eth0
vNic slot 2
name: nic-70179-eth0-vmware-sfw.2
agentName: vmware-sfw
state: IOChain Attached
vmState: Detached
failurePolicy: failClosed
slowPathID: none
filter source: Dynamic Filter Creation
VMware, Inc. 34
NSX-T Data Center Troubleshooting Guide
Determine firewall rules that apply to a specific dvfilter (in this example, nic-70227-eth0-vmware-sfw.2
is the dvfilter name):
ruleset mainrs_L2 {
rule 1 at 1 inout ethertype any stateless from any to any accept;
}
VMware, Inc. 35
NSX-T Data Center Troubleshooting Guide
}
addrset b695c8df-9894-4068-a5e7-5504fe48d459 {
ip 172.16.30.11,
mac 52:54:00:64:0e:4f,
}
addrset rdst3076 {
ip 172.16.10.13,
ip 172.16.30.11,
mac 52:54:00:42:4d:38,
mac 52:54:00:64:0e:4f,
}
VMware, Inc. 36
NSX-T Data Center Troubleshooting Guide
Get the list of VIFs that are subject to firewall rules on the KVM host:
If the output is empty, look for connectivity issues between the node and the controllers.
Get the list of rules applied to a specific VIF (in this example, da95fc1e-65fd-461f-814d-
d92970029bf0 is the VIF ID):
Vif ID : da95fc1e-65fd-461f-814d-d92970029bf0
ruleset d035308b-cb0d-4e7e-aae5-a428b461db46 {
rule 3072 inout protocol tcp from any to addrset 48822ec3-2670-497b-82f9-524618c16877 port 443 accept
with log;
rule 3072 inout protocol tcp from any to addrset 48822ec3-2670-497b-82f9-524618c16877 port 80 accept
with log;
rule 3074 inout protocol tcp from addrset 48822ec3-2670-497b-82f9-524618c16877 to addrset 8b9e75e7-
bc62-4d7f-9a58-a872f393448e port 8443 accept with log;
rule 3074 inout protocol tcp from addrset 48822ec3-2670-497b-82f9-524618c16877 to addrset 8b9e75e7-
bc62-4d7f-9a58-a872f393448e port 22 accept with log;
rule 3075 inout protocol tcp from addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e to addrset
b695c8df-9894-4068-a5e7-5504fe48d459 port 3306 accept with log;
}
ruleset 3027fed3-60b1-483e-aa17-c28719275704 {
rule 3076 inout protocol tcp from 192.168.110.10 to addrset b695c8df-9894-4068-a5e7-5504fe48d459 port
443 accept with log;
rule 3076 inout protocol icmp type 8 code 0 from 192.168.110.10 to addrset b695c8df-9894-4068-
a5e7-5504fe48d459 accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset b695c8df-9894-4068-a5e7-5504fe48d459 port
22 accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset b695c8df-9894-4068-a5e7-5504fe48d459 port
80 accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e port
443 accept with log;
rule 3076 inout protocol icmp type 8 code 0 from 192.168.110.10 to addrset 8b9e75e7-bc62-4d7f-9a58-
a872f393448e accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e port
22 accept with log;
VMware, Inc. 37
NSX-T Data Center Troubleshooting Guide
rule 3076 inout protocol tcp from 192.168.110.10 to addrset 8b9e75e7-bc62-4d7f-9a58-a872f393448e port
80 accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset 48822ec3-2670-497b-82f9-524618c16877 port
443 accept with log;
rule 3076 inout protocol icmp type 8 code 0 from 192.168.110.10 to addrset
48822ec3-2670-497b-82f9-524618c16877 accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset 48822ec3-2670-497b-82f9-524618c16877 port
22 accept with log;
rule 3076 inout protocol tcp from 192.168.110.10 to addrset 48822ec3-2670-497b-82f9-524618c16877 port
80 accept with log;
}
ruleset 5e9bdcb3-adba-4f67-a680-5e6ed5b8f40a {
rule 2 inout protocol any from any to any accept with log;
}
ruleset ddf93011-4078-4006-b8f8-73f979d7a717 {
rule 1 inout ethertype any stateless from any to any accept;
}
8b9e75e7-bc62-4d7f-9a58-a872f393448e {
}
b695c8df-9894-4068-a5e7-5504fe48d459 {
mac 52:54:00:64:0e:4f,
ip 172.16.30.11,
}
Check connections through the Linux Conntrack module. In this example, we look for flows between two
specific IP addresses.
VMware, Inc. 38
NSX-T Data Center Troubleshooting Guide
The log file is /var/log/dfwpktlogs.log for both ESXi and KVM hosts.
# tail -f /var/log/dfwpktlogs.log
2018-03-27T10:23:35.196Z INET TERM 3072 IN TCP FIN 100.64.80.1/60688->172.16.10.11/80 8/7 373/5451
2018-03-27T10:23:35.196Z INET TERM 3074 OUT TCP FIN 172.16.10.11/46108->172.16.20.11/8443 8/9 1178/7366
2018-03-27T10:23:35.196Z INET TERM 3072 IN TCP RST 100.64.80.1/60692->172.16.10.11/80 9/6 413/5411
2018-03-27T10:23:35.196Z INET TERM 3074 OUT TCP RST 172.16.10.11/46109->172.16.20.11/8443 9/7 1218/7262
2018-03-27T10:23:37.442Z 71d32787 INET match PASS 3074 IN 60 TCP 172.16.10.12/35770->172.16.20.11/8443
S
2018-03-27T10:23:38.492Z INET match PASS 2 OUT 1500 TCP 172.16.10.11/80->100.64.80.1/60660 A
2018-03-27T10:23:39.934Z INET match PASS 3072 IN 52 TCP 100.64.80.1/60720->172.16.10.11/80 S
2018-03-27T10:23:39.944Z INET match PASS 3074 OUT 60 TCP 172.16.10.11/46114->172.16.20.11/8443 S
2018-03-27T10:23:39.944Z 71d32787 INET match PASS 3074 IN 60 TCP 172.16.10.11/46114->172.16.20.11/8443
S
2018-03-27T10:23:42.449Z 71d32787 INET match PASS 3074 IN 60 TCP 172.16.10.12/35771->172.16.20.11/8443
S
2018-03-27T10:23:44.712Z INET TERM 3074 IN TCP RST 172.16.10.11/46109->172.16.20.11/8443 9/7 1218/7262
2018-03-27T10:23:44.712Z INET TERM 3074 IN TCP FIN 172.16.10.12/35766->172.16.20.11/8443 9/10 1233/7418
2018-03-27T10:23:44.712Z INET TERM 3074 IN TCP FIN 172.16.10.11/46110->172.16.20.11/8443 9/9 1230/7366
2018-03-27T10:23:44.712Z INET TERM 3074 IN TCP FIN 172.16.10.12/35767->172.16.20.11/8443 9/10 1233/7418
2018-03-27T10:23:44.939Z INET match PASS 3072 IN 52 TCP 100.64.80.1/60726->172.16.10.11/80 S
2018-03-27T10:23:44.957Z INET match PASS 3074 OUT 60 TCP 172.16.10.11/46115->172.16.20.11/8443 S
2018-03-27T10:23:44.957Z 71d32787 INET match PASS 3074 IN 60 TCP 172.16.10.11/46115->172.16.20.11/8443
S
2018-03-27T10:23:45.480Z INET TERM 2 OUT TCP TIMEOUT 172.16.10.11/80->100.64.80.1/60528 1/1 1500/56
Problem
After configuring a layer-2 firewall rule with one MAC set as source and another MAC set as destination,
the getrules command on the host shows the destination MAC set as
01:00:00:00:00:00/01:00:00:00:00:00. For example,
ruleset mainrs_L2 {
# generation number: 0
VMware, Inc. 39
NSX-T Data Center Troubleshooting Guide
The internal OUT rule with the address 01:00:00:00:00:00/01:00:00:00:00:00 is created by design to
handle outbound broadcasting packets and does not indicate a problem.
Solution
Problem
A loopback or hairpin is created when a Tier-0 router has multiple uplinks, and traffic ingresses on one of
the uplinks and egresses on another uplink. When this occurs, firewall rules and NAT are only processed
while the packet ingresses on the original uplink. This causes the reply returning on the second uplink to
not match the original session, and the packet may be dropped.
Cause
Services are processed once during the hairpinning process, and not on both interfaces. This causes the
reply to be considered another flow, rather than part of the original flow, because the direction of the
packet for both the initial and the reply is IN.
Solution
u If no destination NAT rules are present on the SR, add one. A destination NAT rule will cause the
reply be tried to be matched against the original session, rather than being treated as a new session,
and the packet will not be dropped. See Configure Source and Destination NAT on a Tier-0 Router in
the NSX-T Data Center Administration Guide.
VMware, Inc. 40
Other Troubleshooting
Scenarios 7
This section describes how to troubleshoot various error scenarios.
Problem
2 The host is removed as a transport node. However, transport node deletion fails. The state of the
transport node is Orphaned.
5 The host is added as a transport node with a new transport zone and switch. This step results in the
error Failed/Partial Success.
VMware, Inc. 41
NSX-T Data Center Troubleshooting Guide
Cause
In step 2, if you wait for a few minutes, the transport node deletion will succeed because NSX Manager
will retry the deletion. When you delete the fabric node immediately, NSX Manager cannot retry because
the host is removed from NSX-T Data Center. This results in incomplete cleanup of the host, with the
switch configuration still present, which causes step 5 to fail.
Solution
1 Delete all vmknics from vCenter Server on the host that are connected to the NSX-T Data Center
switch.
2 Get the switch name using the esxcfg-vswitch -l CLI command. For example:
esxcfg-vswitch -l
Switch Name Num Ports Used Ports Configured Ports MTU Uplinks
vSwitch0 1536 4 128 1500 vmnic0
3 Delete the switch name using the esxcfg-vswitch -d <switch-name> --dvswitch CLI
command. For example:
Problem
An ESXi transport node is normally connected to a specific controller in a controller cluster. You can find
the connected controller with the CLI command get controllers. If the connected controller goes
down, it takes about 5 minutes for the transport node to be connected to another controller.
Cause
The transport node attempts to re-connect to the controller that is down for a certain amount of time
before giving up and connecting to another controller. The whole process takes about 5 minutes. This is
expected behavior.
VMware, Inc. 42
NSX-T Data Center Troubleshooting Guide
Problem
Other CLI commands might also return an error. The get support-bundle command indicates that
the /tmp directory has become read-only. For example,
Nov 17 07:26:48 no kernel: NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [qemu-kvm:4386]
Cause
One or more file systems on the NSX Manager appliance were corrupted. Some possible causes are
documented in https://access.redhat.com/solutions/22621.
To resolve the issue, you can repair the corrupt file systems or perform a restore from a backup.
Solution
1 Option 1: Repair the corrupt file systems. The following steps are specifically for NSX Manager
running on a KVM host.
a Run the virsh destroy command to stop the NSX Manager VM.
b Run the virt-rescue command in write mode on the qcow2 image. For example,
c In the virt-rescue command prompt run the e2fsck command to fix the tmp file system. For
example,
d If necessary, run the e2fsck /dev/nsx/tmp again until there are no more errors.
VMware, Inc. 43
NSX-T Data Center Troubleshooting Guide
Problem
Some operations, such as when a VM vnic tries to attach to a logical switch, fail.
The /var/run/log/nsx-opsagent.log has messages such as:
Cause
In a large-scale environment, some operations might take longer than usual and fail because the default
timeout values are exceeded.
Solution
a On the ESXi host, stop the NSX opsAgent with the following command:
/etc/init.d/nsx-opsagent stop
"mp" : {
/* timeout for VIF operation */
"vifOperationTimeout" : 25,
Note This timeout value must be less than the hostd timeout value that you set in step 2.
/etc/init.d/nsx-opsagent start
VMware, Inc. 44
NSX-T Data Center Troubleshooting Guide
a On the ESXi host, stop the hostd agent with the following command:
/etc/init.d/hostd stop
<opaqueNetwork>
<!-- maximum message size allowed in opaque network manager IPC, in bytes. -->
<!-- <maxMsgSize> 65536 </maxMsgSize> -->
<!-- maximum wait time for opaque network response -->
<!-- <taskTimeout> 30 </taskTimeout> -->
/etc/init.d/hostd start
Problem
From the NSX Manager GUI, adding an ESXi hosts fails with the error File path of ... is claimed
by multiple non-overlay VIBs". The log file shows messages such as the following:
Cause
Some VIBs from a previous install are still on the host, probably because a clean uninstall did not occur.
Solution
1 From the error message, get the names of VIBs that are causing the failure.
Problem
After a controller is powered off and on a number of times, the other controllers report that it is inactive
when it is up and running.
VMware, Inc. 45
NSX-T Data Center Troubleshooting Guide
Cause
An internal error involving the ZooKeeper module sometimes occurs when a controller is powered off and
on and causes a communication failure between this controller and the other controllers in the cluster.
Solution
u Remove the controller node that is reported to be inactive from the cluster, remove the cluster
configuration from the node and rejoin the node to the cluster. For more information, see the section
"Replace a Member of the NSX Controller Cluster" in the NSX-T Administration Guide.
Problem
When you enable IPFIX for multiple VMs on the same host and you set the sampling rate to be 100%,
there can be a large amount of IPFIX traffic. This can impact management traffic, causing the
management IPs to be intermittently unreachable, even if the production traffic and management traffic go
through different OVSes.
Cause
The workload is too stressful for the host and the VMs.
Solution
u Reduce the load on the host by reducing the number of VMs with IPFIX enabled or reducing the
sampling rate.
Problem
During the upgrade process, the following events might fail because they do not complete within a
specific period of time. The Upgrade Coordinator reports a timeout error for the event and the upgrade
fails.
VMware, Inc. 46
NSX-T Data Center Troubleshooting Guide
Solution
n For the maintenance mode issue, log in to vCenter Server and check the status of tasks related to the
host. Take actions to resolve any issues.
n For the host reboot issue, check the host to see why it failed to reboot.
n For the NSX service issue, log in to the NSX Manager UI, go to the Fabric > Nodes page and see if
the host has an installation error. If so, you can resolve it from the NSX Manager UI. If the error
cannot be resolved, you can refer to the upgrade logs to determine the cause of the failure.
Problem
An NSX Edge transport node has a status that is degraded, but the data plane is functioning normally.
The Edge transport node degraded state also triggers the corresponding transport zone status to be
degraded. The Edge transport node has one or more network interfaces in a Down state.
Cause
The NSX-T Data Center management plane declares an Edge transport node to be in a degraded state if
any interface is down, regardless of whether that interface is used or configured. If the Edge is a virtual
machine, the vNIC may be disconnected. If you have a bare metal Edge, the NIC port may not be
connected, or may have a link state down.
If the interface that is Down is not used, then there is no functional impact to the Edge.
Solution
u To avoid a degraded state, ensure all Edge interfaces are connected and Up, regardless of whether
they are in use or not.
VMware, Inc. 47