Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
Contents
Introduction
Background Information
Step 2: Determine the CPU Queue that Causes the High CPU Usage Condition
Sample Embedded Event Manager (EEM) Script for the Cisco Catalyst 3850 Series Switch
Related Information
Introduction
This document describes how to troubleshoot CPU usage concerns, primarily due to interrupts, on
the new Cisco IOS®-XE platform. Additionally, the document introduces several new commands on
this platform that are integral in order to troubleshoot such problems.
Background Information
It is important to understand how Cisco IOS-XE is built. With Cisco IOS-XE, Cisco has moved to a
Linux kernel and all of the subsystems have been broken down into processes. All of the
subsystems that were inside Cisco IOS before - such as the modules drivers, High Availability
(HA), and so on - now run as software processes within the Linux Operating System (OS). Cisco
IOS itself runs as a daemon within the Linux OS (IOSd). Cisco IOS-XE retains not only the same
look and feel of the classic Cisco IOS, but also its operation, support, and management.
Here are some useful de nitions:
Forwarding Engine Driver (FED): This is the heart of the Cisco Catalyst 3850 Series Switch and
is responsible for all hardware programming/forwarding.
IOSd: This is the Cisco IOS daemon that runs on the Linux kernel. It is run as a software process
within the kernel.
Packet Delivery System (PDS): This is the architecture and process of how packets are
delivered to and from various subsystem. As an example, it controls how packets are delivered
from the FED to the IOSd and vice versa.
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 1/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
From the output, it is clear that the Cisco IOS daemon consumes a major portion of the CPU along
with the FED, which is the heart of this box. When CPU usage is high due to interrupts, you see
that IOSd and FED use a major portion of the CPU, and these subprocesses (or a subset of these)
use the CPU:
FED Punject TX
FED Punject RX
FED Punject replenish
FED Punject TX complete
You can zoom into any of these processes with the show process cpu detailed <process>
command. Since IOSd is responsible for the majority of the CPU usage, here is a closer look into
that.
The output (IOSd CPU output) shows that ARP Snoop, IP Host Track Process, and ARP Input are
high. This is commonly seen when the CPU is interrupted due to ARP packets.
Step 2: Determine the CPU Queue that Causes the High CPU Usage Condition
The Cisco Catalyst 3850 Series Switch has a number of queues that cater to di erent types of
packets (the FED maintains 32 RX CPU queues, which are queues that go directly to the CPU). It is
important to monitor these queues in order to discover which packets are punted to the CPU and
which are processed by the IOSd. These queues are per ASIC.
Note: There are two ASICs: 0 and 1. Ports 1 through 24 belong to ASIC 0.
In order to look at the queues, enter the show platform punt statistics port-asic <port-asic>
cpuq <queue> direction <rx|tx> command.
In the show platform punt statistics port-asic 0 cpuq -1 direction rx command, the -1 argument
lists all of the queues. Therefore, this command lists all receive queues for Port-ASIC 0.
Now, you must identify which queue pushes a large number of packets at a high rate. In this
example, an examination of the queues revealed this culprit:
<snip>
RX (ASIC2CPU) Stats (asic 0 qn 16 lqn 16):
RXQ 16: CPU_Q_PROTO_SNOOPING
----------------------------------------
Packets received from ASIC : 79099152
Send to IOSd total attempts : 79099152
Send to IOSd failed count : 1240331
RX suspend count : 1240331
RX unsuspend count : 1240330
RX unsuspend send count : 1240330
RX unsuspend send failed count : 0
RX dropped count : 0
RX conversion failure dropped : 0
RX pkt_hdr allocation failure : 0
RX INTACK count : 0
RX packets dq'd after intack : 0
Active RxQ event : 9906280
RX spurious interrupt : 0
<snip>
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 3/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
Determine the tag for which the most packets have been allocated. In this example, it is 65561.
Then, enter this command:
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 4/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
In the results of the show pds tag all command, notice a handle, 7296672, is reported next to the
Punt Rx Proto Snoop.
Use this handle in the show pds client <handle> packet last sink command. Notice that you must
enable debug pds pktbuf-last before you use the command. Otherwise you encounter this error:
This command dumps the last packet received by the sink, which is IOSd in this example. This
shows that it dumps the header and it can be decoded with Terminal-based Wireshark (TShark).
The Meta-data is for internal use by the system, but the Data output provides actual packet
information. The Meta-data, however, remains extremely useful.
Notice line that starts with 0070. Use the rst 16 bits after that as shown here:
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 5/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
Slot : 2
Unit : 20
Slot Unit : 20
Acitve : Y
SNMP IF Index : 22
GPN : 84
EC Channel : 0
EC Index : 0
ASIC : 0
ASIC Port : 14
Port LE Handle : 0x514cd990
Non Zero Feature Ref Counts
FID : 48(AL_FID_L2_PM), Ref Count : 1
FID : 77(AL_FID_STATS), Ref Count : 1
FID : 51(AL_FID_L2_MATM), Ref Count : 1
FID : 13(AL_FID_SC), Ref Count : 1
FID : 26(AL_FID_QOS), Ref Count : 1
Sub block information
FID : 48(AL_FID_L2_PM), Private Data : 0x54072618
FID : 26(AL_FID_QOS), Private Data : 0x514d31b8
The culprit interface is identi ed here. Gig2/0/20 is where there is a tra c generator that pumps
ARP tra c. If you shut this down, then it would resolve the problem and minimize the CPU usage.
1. Enable detail tracking. By default, event tracing is on. You must enable detail tracing in order
to capture the actual packets:
2. Fine-tune the capture bu er. Determine how deep your bu ers are for detail tracing and
increase as needed.
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 6/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
3. Add capture lters. You now need to add various lters for the capture. You can add di erent
lters and either choose to match all or match any of those for your capture.
Now you must link things together. Remember the culprit queue that was identi ed in Step 2 of
this troubleshoot process? Since queue 16 is the queue that pushes a large number of packets
towards the CPU, it makes sense to trace this queue and see what packets are punted to the
CPU by it.
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 7/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
You must choose either a match all or a match any for your lters and then enable the trace:
4. Display ltered packets. You can display the packets captured with the show mgmt-infra
trace messages fed-punject-detail command.
=========
[11/25/13 07:05:53.814 UTC 2eb0cd 5661]
[11/25/13 07:05:53.814 UTC 2eb0ce 5661] PUNT PATH (fed_punject_rx_process
830):RX: Q: 16, Tag: 65561
[11/25/13 07:05:53.814 UTC 2eb0cf 5661] PUNT PATH (fed_punject_get_physic
579):RX: Physical IIF-id 0x104d88000000033
[11/25/13 07:05:53 814 UTC 2eb0d0 5661] PUNT PATH (fed punject get src l3
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 8/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
[11/25/13 07:05:53.814 UTC 2eb0d0 5661] PUNT PATH (fed_punject_get_src_l3
434):RX: L3 IIF-id 0x101b6800000004f
[11/25/13 07:05:53.814 UTC 2eb0d1 5661] PUNT PATH (fed_punject_fd_2_pds_m
RX: l2_logical_if = 0x0
[11/25/13 07:05:53.814 UTC 2eb0d2 5661] PUNT PATH (fed_punject_get_source
RX: Source Cos 0
[11/25/13 07:05:53.814 UTC 2eb0d3 5661] PUNT PATH (fed_punject_get_vrf_id
RX: VRF-id 0
[11/25/13 07:05:53.814 UTC 2eb0d4 5661] PUNT PATH (fed_punject_get_src_zo
RX: Zone-id 0
[11/25/13 07:05:53.814 UTC 2eb0d5 5661] PUNT PATH (fed_punject_fd_2_pds_m
RX: get_src_zoneid failed
[11/25/13 07:05:53.814 UTC 2eb0d6 5661] PUNT PATH (fed_punject_get_acl_lo
695): RX: : Invalid CMI2
[11/25/13 07:05:53.814 UTC 2eb0d7 5661] PUNT PATH (fed_punject_fd_2_pds_m
get_acl_log_direction failed
[11/25/13 07:05:53.814 UTC 2eb0d8 5661] PUNT PATH (fed_punject_get_acl_fu
724):RX: DI 0x513b ACL Full Direction 1
[11/25/13 07:05:53.814 UTC 2eb0d9 5661] PUNT PATH (fed_punject_get_source
RX: Source SGT 0
[11/25/13 07:05:53.814 UTC 2eb0da 5661] PUNT PATH (fed_punject_get_first_
RX: FirstHeaderType 0
[11/25/13 07:05:53.814 UTC 2eb0db 5661] PUNT PATH (fed_punject_rx_process
RX: fed_punject_pds_send packet 0x1f00 to IOSd with tag 65561
[11/25/13 07:05:53.814 UTC 2eb0dc 5661] PUNT PATH (fed_punject_rx_process
RX: **** RX packet 0x2360 on qn 16, len 128 ****
[11/25/13 07:05:53.814 UTC 2eb0dd 5661]
buf_no 0 buf_len 128
<snip>
This output provides plenty of information and should typically be enough to discover where the
packets come from and what is contained in them.
The rst part of the header dump is again the Meta-data that is used by the system. The
second part is the actual packet.
You can choose to trace this source MAC address in order to discover the culprit port (once you
have identi ed that this is the majority of the packets that are being punted from queue 16; this
output only shows one instance of the packet and the other output/packets are clipped).
However, there is a better way. Notice that logs that are present after the header information:
The rst log clearly tells you from which queue and tag this packet comes. If you were not aware
of the queue eariler, this is a easy way to identify which queue it was.
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 9/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
The second log is even more useful, because it provides the physical Interface ID Factory (IIF)-
ID for the source interface. The hex value is a handle that can be used in order to dump
information about that port:
You have once again identi ed the souce interface and culprit.
Tracing is a powerful tool that is critical in order to troubleshoot high CPU usage problems and
provides plenty of information in order to succesfully resolve such a situation.
Sample Embedded Event Manager (EEM) Script for the Cisco Catalyst 3850 Series
Switch
Use this command in order to trigger a log to be generated at a speci c threshold:
process cpu threshold type total rising <CPU %> interval <interval in secon
switch <switch number>
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 10/11
7/1/2018 Catalyst 3850 Series Switch High CPU Usage Troubleshoot - Cisco
The total CPU utilization at the time of the trigger. This is identi ed by Total CPU
Utilzation(total/Intr) :50/0 in this example.
Top processes - these are listed in the format of PID/CPU%. So in this example, these are:
8622/25 - 8622 is PID for IOSd and 25 implies that this process is us
5753/12 - 5733 is PID for FED and 12 implies that this process is usi
Note: The process cpu threshold command does not currently work in the 3.2.X train.
Another point to remember is that this command looks at the average CPU utilzation among
the four cores and generates a log when this average reaches the percentage that has been
de ned in the command.
Related Information
What is Cisco IOS XE?
Cisco Catalyst 3850 Switches - Data Sheets and Literature
Technical Support & Documentation - Cisco Systems
https://www.cisco.com/c/en/us/support/docs/switches/catalyst-3850-series-switches/117594-technote-hicpu3850-00.html 11/11