Onefs Upgrade Failures: Emc Isilon Customer Troubleshooting Guide
Onefs Upgrade Failures: Emc Isilon Customer Troubleshooting Guide
Onefs Upgrade Failures: Emc Isilon Customer Troubleshooting Guide
Abstract
This guide helps you troubleshoot OneFS upgrade failures and error
messages received during upgrades.
September 15, 2015
1. Follow these
steps.
2. Perform
troubleshooting
steps in order.
Start Troubleshooting
Page 5
Simultaneous Upgrade
Page 11
Rolling Upgrade
Page 12
3. Appendices
Appendix A
If you need further assistance
Appendix B
How to use this flow chart
CAUTION!
If the node, subnet, or pool you are working on goes down during the course of
troubleshooting and you do not have any other way to connect to the cluster, you could
experience data unavailability.
Therefore, make sure you have more than one way to connect to the cluster before you
start this troubleshooting process. The best method is to have a serial cable available.
That way, if you are unable to connect through the network, you will still be able to
connect to the cluster physically.
For specific requirements and instructions for making a physical connection to the
cluster, see article 16744 on the EMC Online Support site.
Before you begin troubleshooting, confirm that you can either connect through another
subnet or pool, or that you have physical access to the cluster.
3. Run the following command to capture all input and output of the session:
screen -L
This will create a file called screenlog.0 that will be appended to during your session.
4. Perform troubleshooting.
Most upgrade problems occur during rolling upgrades that are initiated
from the OneFS web administration interface.
For best results, do the following:
Troubleshooting
Analysis
Introduction
Start troubleshooting here. If you need
help understanding the flow chart
conventions used in this guide, see
Appendix B: How to use this flow chart.
Note
Most upgrade problems
occur during rolling
upgrades that are initiated
from the OneFS web
administration interface.
Therefore, we will use the
command-line interface
exclusively to troubleshoot
your issue and get your
upgrade restarted. For
more information, see
"Best practices and useful
information" on page 4.
Start
Did the
upgrade fail with a
specific error displayed
on the screen?
No
Go to Page 6
No
Go to Page 6
Yes
Can the
upgrade be completed
successfully now?
Yes
End troubleshooting
Troubleshooting, continued
Analysis, continued
You could have arrived here from:
Page
5 - Analysis
______________
Page
8 - Nodes did not all come back online
____________________________________
Page
6
Run the following command to see which nodes were successfully upgraded.
After
running the command,
do you see this error?
ERROR Client connected from
an unprivileged port number
50230. Refusing the connection
[Errno 54] RPC session
disconnected
Yes
No
Go to Page 7
Troubleshooting, continued
Analysis, continued
You could have arrived here from:
Page
7
Do any
nodes report as down?
A down node means that it
failed to join the cluster
following the
upgrade.
Yes
Go to Page 8
No
Go to Page 9
No
Using the
output of the
isi_for_array -s "uname -a"
command from Page 6,
are all the nodes running
the new version
of OneFS?
Yes
End troubleshooting
Troubleshooting, continued
Nodes did not all come back online
You could have arrived here from:
Has it been
at least 15 minutes since the
nodes rebooted as part of
the upgrade?
No
Yes
Wait 15 minutes
Go back to Page 6
Troubleshooting, continued
Analysis, continued
You could have arrived here from:
Did you
follow the steps in the
"Planning an Upgrade" and
"Completing pre-upgrade tasks"
sections of the OneFS Upgrade
Planning and Process Guide
before beginning the
upgrade?
Yes
No
Go to Page 10
Troubleshooting, continued
Analysis, continued
You could have arrived here from:
Page 9 - Analysis, continued
Page
10
Did you
perform a simultaneous
upgrade or a rolling
upgrade?
Simultaneous
Go to Page 11
Rolling
Go to Page 12
Troubleshooting, continued
Simultaneous upgrade
Page
11
Yes
Do any
nodes report as down?
A down node means that it
failed to join the cluster
following the
upgrade.
No
Go to Page 14
Troubleshooting, continued
Rolling upgrade
You could have arrived here from:
Page
12
________________________
Page 10 - Analysis, continued
Run the following command to determine which nodes did not get upgraded:
isi_for_array -s "uname -a"
The output provides a list of all the nodes and indicates which version of
OneFS each is running. For an example of the output, see Appendix
C.
_________
For each node that did not get upgraded, run the following command to check
that node's /var/log/messages file to see if there are errors with a timestamp
that occurred during the upgrade. In the command, replace <YYYY-MM-DD>
with the date of the upgrade:
grep '^<YYYY-MM-DD>' /var/log/update_engine*
For example:
Are there
errors on a node that did
not get upgraded?
No
Go to Page 14
Yes
Is the
following error present?
Unable to claim upgrade
daemon on one or
more nodes.
Yes
Go to Page 13
No
Go to Page 14
Troubleshooting, continued
Rolling upgrade, continued
Page
13
Page
12 - Rolling upgrade, continued
_____________________________
Is the
service enabled or
disabled?
Disabled
Enabled
Go to Page 14
Troubleshooting, continued
Restart the upgrade
You could have arrived here from:
Page
14
__________________________
Page
11 - Simultaneous upgrade
Page
12 - Rolling upgrade, continued
______________________________
Page 13 - Rolling upgrade, continued
______________________________
Open an SSH connection to the
highest-numbered node in the cluster,
and log in using the root account.
Open a screen session by running the following command, where <session name> is a name that
you provide. Record the name in case you need to use it later. The screen session enables you to
easily reconnect to the upgrade process if the session gets disconnected during the upgrade.
screen -S <session name>
If you get disconnected, you can use the following command to reconnect:
screen -x <session name>
Note: If you are running OneFS 7.1.1.2 or 7.1.0.6, skip this step. The screen session feature
does not work in OneFS 7.1.1.2 or 7.1.0.6.
Yes
Did the
upgrade
restart?
No
Go to Page 15
Troubleshooting, continued
Restart the upgrade, continued
You could have arrived here from:
Page
15
Yes
Have all
of the nodes been
upgraded?
Go to Page 16
No
Troubleshooting, continued
Post-upgrade checks
You could have arrived here from:
Page
16
__________________________________
Page 15 - Restart the upgrade, continued
Yes
Do any
nodes report as down?
A down node means that it
failed to join the cluster
following the
upgrade.
Go to Page 17
No
End troubleshooting
Troubleshooting, continued
Nodes did not all join the cluster
Page
17
isi status -q
In the output, look at the Health DASR column to see if any nodes
report -D- (Down). For an example of the output, see__________
Appendix D.
Yes
Do any
nodes report as down?
A down node means that it
failed to join the cluster
following the
reboot.
No
End troubleshooting
Upload node log files and the screen log file to EMC Isilon Technical Support
1. When troubleshooting is complete, type exit to end your screen session.
2. Gather and upload the node log set and include the SSH screen log file by using the command appropriate for your
method of uploading files. If you are not sure which method to use, then use FTP.
ESRS:
isi_gather_info --esrs --local-only -f /ifs/data/Isilon_Support/screenlog.0
FTP:
isi_gather_info --ftp --local-only -f /ifs/data/Isilon_Support/screenlog.0
HTTP:
isi_gather_info --http --local-only -f /ifs/data/Isilon_Support/screenlog.0
SMTP:
isi_gather_info --email --local-only -f /ifs/data/Isilon_Support/screenlog.0
SupportIQ:
Copy and paste the following command.
Note: When you copy and paste the command into the command-line interface, it will appear on multiple lines (exactly
as it appears on the page), but when you press Enter the command will run as it should.
isi_gather_info --local-only -f /ifs/data/Isilon_Support/screenlog.0 --noupload \
--symlink /var/crash/SupportIQ/upload/ftp
3. If you receive a message that the upload was unsuccessful, refer to ___________
article 16759 on the EMC Online Support site for
directions for uploading files over FTP.
Note
Page
#
Yes
No
Decision diamond
CAUTION!
Caution boxes warn that
a particular step needs
to be performed with
great care, to prevent
serious consequences.
Go to Page #
End point
Document Shape
Calls out supporting documentation
for a process step. When possible,
these shapes contain links to the
reference document.
Sometimes linked to a process step
with a colored dot.
_______________________
Page
6 - Analysis, continued
Page
12 - Rolling upgrade, continued
______________________________
Page
15 - Restart the upgrade, continued
__________________________________