Cluster Installation Trouble Shooting
Cluster Installation Trouble Shooting
• Installing XtremApp
• Support Documentation
XtremApp 4.0
XtremApp Installation
XMS
...
root:/var/lib/xms/images
Install menu xmsupload:/images
-------------------------------------
1. Configuration
2. Check configuration
3. Display configuration
4. Display installed XtremApp Version
5. Perform XMS installation only
...
99. Exit
> 5
Enter Installation image filename:
> upgrade-to-4.0.2-xx.tgz
Installing XMS
...
...
Please enter management Storage Controller
Install menu > <X1-SC1-MGMT-IP-ADDR>
------------------------------------- Please enter expected number of bricks:
1. Configure XMS > 2
2. Check XMS configuration Please enter installation image filename:
3. Display XMS configuration > upgrade-to-4.0.2-xx.tgz
4. Display XMS version
5. Install XMS only Running: /xtremapp/utils/fresh_install.py ...
6. Install storage controllers only
7. Configure ESRS IP Client
...
15. Restore XMS data
16. Disable root ssh access
99. Exit install menu
> 6
8. Upgrade SC firmware
Root Cause:
• The XMS will connect the port 11111 of the management IP to discover all SC ip
• If the process in management SC is not running, the installation will fail due to not
connecting to that port
Resolution:
• Try ssh to all SCs then run the command “xtremapp-restart” to restart processes
• Or power cycle the SC
Root Cause:
• The XMS will discover all SC IPs by IB, then it will compare with the bricks # if matching
the number of SCs IP equals bricks*2(we have 2 SCs for each brick)
• If the SCs are in different version or IB has connection Issue, then the discovered IP
number may be less than expected
Resolution:
• Make sure you input a correct brick numbers
• Make sure the IB connection are connected
• Make sure all the SC are at least in 4.0-49 base image
• Manually input the SC IPs if not discovered by script
This is an enhancement in 4.0. If in previous release, you have to reimage all SCs to base
image
Root Cause:
• This is a protection mechanism designed in 4.0
• The XMS get the cluster name from existing management SC, if it can get this
information, then XMS thought this is a working cluster with data
• The XMS will ask you input the PSNT to make sure DO NOT ERASE a production cluster
Resolution:
Please double confirm the PSNT you have, input the correct PSNT, if it’s correct, then the
installation will continue, else it will abort
Root Cause:
This is due the XMS not be able to detect the cluster status
Resolution:
1. Reset the cluster status to the default by the following steps:
• Ssh to all SCs
• #xtremapp-reformat
• #xtremapp-restart
2. Or reimage the SCs and configure the information
Root Cause:
• When the XMS trying to find all SCs, it will connect to management SC(X1-SC1) then
discovered other SCs by IB network
• Then XMS checkes the space on SCs by ssh to all SCs in Ethernet
• If it cannot SSH to any SC, it will fail with this error
Resolution:
Try ssh all SCs from XMS to see it is able to, then troubleshoot it with customer’s network
admin
Resolution:
• Connect to XMS and all SCs in ssh, then run “ethtool eth0” ,make sure the speed is 1g
and the duplex mode are Full, if not
a. Check the setting on switch
b. Replace the cable
• SCP installation file from XMS to each SC in order to find which one is slow
#scp /var/lib/xms/images/upgrade-to-4.0.2-80.tar [email protected]:/var/common
Root Cause:
Since xms will SCP installation files to SCs, it’s possible happens in those scenarios:
• DNS configured in XMS; there is a server named “none” in customer domain and it
supports ssh
• There is a existing server in customer’s network with the same IP address of the SC
Resolution:
• Disable the DNS setting in XMS server
• Ssh to SC IP with the default username and password, if fails, check if there is the IP
conflict within customer’s network
Root Cause:
• Once the XMS transferred the installation file to all SCs, it will extract the file in SC and
run the scripts from the packages
• During to the installation, this file could be corrupted, it may fail to call some scripts
Resolution:
• Please login to xms then check the md5 of installation file, e.g,
#md5sum /var/lib/xms/images/upgrade-to-4.0.2-80.tar
302c797636385eedc91237b74f00c98b /var/lib/xms/images/upgrade-to-4.0.2-80.tar
• Make sure it’s the same value as in download.emc.com
Root Cause:
Like XtremApp installation, the XMS will connect to management SC IP, the failure might be
reported if SC
• Not reachable from XMS
• Or XtremApp process is not running
Resolution:
• Verify if you could SSH from the management IP to each SC
• Run “xtremapp-restart” or reboot the SC to restart the XtremApp processes
Root Cause:
• The installation will discover all hardware components and verify the connection, then
added to SYM database, if anything unexpected, the XMS will report the errors.
• The detailed error message would be included in xms.log
Resolution:
Follow the error message and try run the following commands in all SCs accordingly
• Check BBUs
#upsc eaton1550 |grep ser
In single brick, the serial number should be different and match to both BBUs, if not,
then identify the wrong part of cable/bbu/SC by replacing and switching test
In multiple bricks, verify if both SC in the same brick have the same serial number, if not,
check the com cable connection; if not being reported, then identify if have the wrong
part number
The rate should be 40G for every port, so you should see 2 lines
• Check fiber channel card
#python -O /xtremapp/utils/qla_wwn.pyo –g
Root Cause:
• This is a general message for HW failure, here is an example of SSD
• The detailed error message would be included in xms.log as following
Wrong: 2016-04-29 18:59:56,197 - XmsLogger - INFO - mom::check_property_change_events:813 - txda1xio02: txda1xio02:
Threshold crossing of SSD very high utilization; value changed to healthy, There are 0 KB remaining.
Resolution:
1. Run “lsscsi |wc –l” to check if SSD as expected
2. Ask CE to verify if all SSD LCC lights are good.
3. Ask CE to power cycle the DAE if SC could not see all disks
4. Create the cluster again
Root Cause:
• This is a general message for HW failure during initialization due to memory issue
• The detailed error message would be included in xms.log as following
Wrong:
Resolution:
1. Run the “free” command to check the SC memory, GEN2 is 256G and GEN3 is 512G
2. This issue always happened at GEN3. For GEN2, if memory not equal 256G, the SC could
not boot up
3. Replace the SC if not in 512G
Root Cause:
• This is a general message for HW failure due to SAS cable
• The detailed error message would be included in xms.log
Wrong:2015-12-29 21:23:52,431 - XmsLogger - WARNING - executor::<lambda>:9576 - 172.22.185.127 reformat error:
xtremapp-reformat[229417]: found 50 drives
Resolution:
1. Run “/xtremapp/utils/mpt_status |grep "link speed” to check SAS connection, like
[root@XtremIO_91-x1-n1 utils]# /xtremapp/utils/mpt_status |grep "link speed"
Negotiated link speed: aa
Negotiated link speed: aa
Negotiated link speed: aa
Root Cause:
• The xms will bring the eth0 down then bring it up during the cluster creation
• For physical xms, this could take longer before lost the connection to SC
2015-07-07 13:58:03,686 - XmsLogger - INFO - executor::_expand_cluster:11157 - Added slot info for slot 24
2015-07-07 13:58:10,546 - XmsLogger - ERROR - executor::poll_nodes_for_keepalive_in_thread:1342 - System error: [Errno
113] No route to host
2015-07-07 13:58:10,547 - XmsLogger - ERROR - executor::poll_nodes_for_keepalive_in_thread:1345 - Keep Alive Error for
Node X1-SC1 [1]: [Errno 113] No route to host
Resolution:
1. Use virtual XMS
2. Comment the line in installation scripts /xtremapp/bin/network_config.py on xms
#ifdown_poor_simulator("eth0") # Disable the DHCP brought interface
system("ip -f %s addr add %s/%s dev eth0 ; ip link set eth0 up" % (family, ip, cidr))
system("ip -f %s route del default" % family)
Root Cause:
From 4.0, if in multiple cluster environment, the higher version of XMS(4.0.2) could be used
to manager lower version XtremIO (4.0.1), but it don’t support to create the cluster in such
combination
Resolution:
1. Deploy a new XMS with the same version as SC
2. Upgrade the SC to new version
Root Cause:
When create a expanded cluster from existing SCs + New SCs, for some reason, the PSNT
of new SCs is not the same with existing SCs
Resolution:
• Login in SC and verify the PSNT, like
#/xtremapp/utils/get_psnt.sh
CKM00XXXXXXXXX
Root Cause:
It happens when reinstall a 4.0.x XtremIO from 3.0.x XtremIO. In the older code, we don’t
validate the SC’s part number. In 4.0.x, we do the validation. The Part Number should be
900-586-xxx, but for some old SCs, it's 100-586-xxx. We need manually modify it.
Resolution:
• Login in SC and verify the PN, like
#/xtremapp/utils/get_psnt.sh
100-586-007-00
Correct PNs
• Burn the correct PN to SCs 900-586-002 XtremIO HW Gen2 400GB
900-586-003 XtremIO HW Gen2 800GB Encrypt Capbl
#/xtremapp/utils/burn_psnt.sh -n 900-586-002
900-586-004 XtremIO HW Gen2 400GB Encrypt Capbl
900-586-005 XtremIO HW Gen2 400GB Expandable
• Verify the PSNT again with get_psnt 900-586-006 XtremIO HW Gen3 40TB
Root Cause:
• This is a protection mechanism designed in 4.0.x
• The XMS will check if the SC belong to any cluster, for example, you will see such errors
when you had ever created this cluster but failed
• The XMS will ask you input the cluster PSNT, please DO NOT ERASE a production cluster
Resolution:
Double confirm the PSNT, then run the command as
#create-cluster expected-number-of-bricks=1 sc-mgr-host="xx.xx.xx.xx" cluster-
name="xxxxx" cluster-psnt="xxxxxx"