Troubleshooting Common VMware ESX Host Server Problems PDF
Troubleshooting Common VMware ESX Host Server Problems PDF
Panicking at the onset of a high impact technical problem can cause impulsive
decision making that enhances the problem. Before trying to troubleshoot any
problem, pause and relax to approach the task with a clear mind, then
address each symptom, possible cause and resolution appropriately.
In this series, I offer solutions for many common problems that arise
with VMware ESX host servers, VirtualCenter, and virtual machines in
general. Let's begin by addressing common issues with VMware ESX host
servers.
Windows server administrators have long been familiar with the dreaded Blue
Screen of Death (BSOD), which signifies a complete halt by the server.
VMware ESX has a similar state called the purple screen of death (PSOD)
which is typically caused by hardware problems or a bug in the VMware code.
Unfortunately, other than recording the information on the screen, your only option
when experiencing a PSOD is to power the server off and back on. Once the server
reboots you should find a vmkernel-zdump-* file in your server /root directory. This
file will be valuable for determining the cause. You can use the vmkdump utility to
extract the vmkernel log file from the file (vmkdump –l ) and examine it for clues as
to what caused the PSOD. VMware support will usually want this file also. One
common cause of PSOD's is defective server memory; the dump file will help identify
which memory module caused the problem so it can be replaced.
While RAM check is running it will log all activity and any errors to the
/var/log/vmware directory in files called ramcheck.log and ramcheck-err.log. One
drawback, however, is that it's hard to test all of your RAM with this utility if you
have virtual machines (VMs) running, as it will only test unused RAM in the ESX
system. A more thorough method of testing your server's RAM is to shutdown ESX,
boot from a CD, and run Memtest86+.
Alternatively, you can generate the same file by using the VMware Infrastructure
Client (VI Client). Select Administration, then Export Diagnostic Data, and select
your host (VirtualCenter data optional) and a directory on your local PC to store the
file that will be created.
As part of the troubleshooting process, often times you'll need to find out the version
of various ESX components and which patches are applied. Below are some
commands you can run from the service console to do this:
Type vmware –v to check ESX Server version, i.e., VMware ESX Server 3.0.1
build-32039
Type vpxa to check the ESX Server management version, i.e. VMware
–v
VirtualCenter Agent Daemon 2.0.1 build-40644.
Type rpm –qa | grep VMware-esx-tools to check the ESX Server VMware
Tools installed version – i.e., VMware-esx-tools-3.0.1-32039.
NOTE: ESX 3.0.1 contained a bug that would restart all your VMs if your ESX server
was configured to use auto-startups for your VMs. This bug was fixed in a patch for
3.0.1 and also in 3.0.2, but appeared again in ESX 3.5 with another patch released to
fix it. It's best to temporarily disable auto-startups before you run this command.
In some cases restarting the vmware-vpxa service when you restart the host agent will
fix problems that occur between ESX and both the VI Client and VirtualCenter. This
service is the management agent that handles all communication between ESX and its
clients. To restart it, log into the ESX host and type service vmware-vpxa restart. It
is important to note that restarting either of these services will not impact the
operation of your virtual machines (with the exception of the bug noted above).
If you are able to shutdown or move your VMs, then you can try rebooting the server
by issuing the reboot command through the VI Client or alternate consoles. If not,
cold-booting the server is your only option.
Purple Screen of Death or commonly known as PSOD is something which we see most of
the times when we run an ESXi host.
Usually when we experience PSOD, we reboot the host (which is a must) and then gather
the logs and upload it to VMware support for analysis (where I spend a good amount of time
going through it)
Step 1:
I am going to simulate a PSOD on my ESXi host. You need to be logged into the host's
SSH. The command is
And when you open a DCUI to the ESXi host, you can see the PSOD
Step 2:
Sometimes, we might miss out on the screenshot of PSOD. Well that's alright! If we have
core-dump configured for the ESXi, we can extract the dump files to gather the crash logs.
Reboot the host, if it is in the PSOD screen. Once the host is back up, login to the
SSH/Putty of the host and go to the core directory. The core directory is the location where
your PSOD logging go to.
# cd var/core
Then list out the files here:
# ls -lh
Here you can see the vmkernel dump file, and the file is in the zdump format.
Step 3:
How do we extract it?
Well, we have a nice extract script that does all the job, " vmkdump_extract ". This
command must be executed against the zdump.1 file, which looks something like this:
# vmkdump_extract vmkernel-zdump.1
Step 4:
Open the vmkernel-log.1 file using the below command:
# less vmkernel-log.1
Skip to the end of the file by pressing Shift+G. Now let's slowly go to the top by pressing
PageUp.
You will come across a line that says @BlueScreen: <event>
2015-12-17T20:34:03.603Z
cpu3:47209)0x412461a5dc10:[0x41802128d249]PanicvPanicInt@vmkernel#nover+0x575
stack: 0x726f632000000008
2015-12-17T20:34:03.603Z
cpu3:47209)0x412461a5dc70:[0x41802128d48d]Panic_NoSave@vmkernel#nover+0x49 stack:
0x412461a5dcd0
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dd60:[0x41802157a63b]CrashMeCurrentCore@vmkernel#nover+0x55
3 stack: 0x100000278
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dda0:[0x41802157a8ca]CrashMe_VsiCommandSet@vmkernel#nover
+0x13e stack: 0x0
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5de30:[0x41802160c3c7]VSI_SetInfo@vmkernel#nover+0x2fb stack:
0x41109d630330
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5dec0:[0x4180217bd7a7]UWVMKSyscallUnpackVSI_Set@<none>#<no
ne>+0xef stack: 0x412461a67000
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df00:[0x418021783a47]User_UWVMKSyscallHandler@<none>#<none
>+0x243 stack: 0x412461a5df20
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df10:[0x4180212aa90d]User_UWVMKSyscallHandler@vmkernel#nove
r+0x1d stack: 0xffbc0bb8
2015-12-17T20:34:03.604Z
cpu3:47209)0x412461a5df20:[0x4180212f2064]gate_entry@vmkernel#nover+0x64 stack: 0x0
The first line @BlueScreen: Tells the crash exception like Exception 13/14, in my case it is
CrashMe which is for a manual crash.
The VMKuptime tells the Kernel up-time before the crash.
The logging after that is the information that we need to be looking for, the cause as to why
the crash occurred.
Now, here the crash dump varies for every crash. These issues can range from hardware
errors / driver issues / issues with ESXi build and a lot more.
Each dump analysis would be different. But the basic is the same.
So, you can try analyzing the dumps by yourself. However, if you are entitled to VMware
support, I will do the job for you.
Cheers!