IBM Spectrum Scale
IBM Spectrum Scale
Brian Yaeger
[email protected]
March 2017
Spectrum Scale Software Support
Agenda
Support Managers
Follow the sun support – Aligning support staff to customer time zone
• Spectrum Scale Support is growing to better meet customer needs.
• Beginning late 2016 we substantially grew the support team in Beijing, China,
with experienced Spectrum Scale staff.
• Improved response time on severity 1 production outages; reducing customer
waiting time before L2 is engaged as well as time to resolution.
• Positive impact to timely client L2 communication for severity 2, 3, and 4
PMRs within our customer time zone.
• Full Beijing L2 team integration in follow the sun queue coverage scheduling
starting in May.
• Additional improvements in queue coverage during customer time zone
expected in 2017.
Spectrum Scale Software Support
Support Executive
Andrew Giblon: [email protected]; 1-905-316-2582
Spectrum Scale Software Support
7
Spectrum Scale Field Issues IBM Stor
8
Spectrum Scale Field Issues IBM Stor
9
Scenario 1:
Spectrum Scale isn’t starting after an upgrade.
Review the FAQ, Check the relevant IBM Spectrum Scale Operating system support tables
as well as the Linux kernel support table (if appropriate) to ensure if the kernel version has
been tested by IBM.
https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html
10
Scenario 2:
Spectrum Scale cannot find its NSDs.
Spectrum Scale cannot find any disks following a firmware update, operating system upgrade,
storage upgrade, or storage configuration change.
Common causes:
1) Don’t risk your disks! It’s overkill but if your unsure, the safest thing during firmware and
operating system updates is to isolate the machine if possible from the disks luns prior to
performing the action.
Lun isolation can typically be performed at either the SAN or Controller level through
zoning. In order to verify zoning was performed correctly, get the Network address of the
system hba(s) for zoning (using systool for Linux or lscfg for AIX) and cross reference.
Overwriting GPFS NSDs by mistake during a Linux operating system upgrade isn’t that
hard to do by mistake prior to the introduction of NSD v2 format in GPFS 4.1. NSD v2
format introduces a GUID Partition Table (GPT) which allows system utilities to recognize
the disk is used by GPFS.
11
Scenario 2:
Spectrum Scale cannot find its NSDs.
Spectrum Scale cannot find any disks following a firmware update, operating system upgrade,
storage upgrade, or storage configuration change.
Even after you’ve migrated all nodes to GPFS 4.1.0.0 or higher, NSD v2 format GPT does
not apply unless the minReleaseLevel (mmlsconfig) AND current file system version
(mmmlsfs) is updated. These steps of the migration process is important and very often
forgotten (sometimes for multiple release upgrades).
12
Scenario 2:
Spectrum Scale cannot find its NSDs.
Spectrum Scale cannot find any disks following a firmware update, operating system upgrade,
storage upgrade, or storage configuration change.
2) You’ve changed disk device type (i.e. generic to powerpath) and “mmchconfig
updateNsdType” needs to be run.
3) User exit /var/mmfs/etc/nsddevices
4) Ensure monitoring software is disabled during maintenance periods to avoid running
commands that require an internal file system mount.
13
Scenario 2:
Spectrum Scale cannot find its NSDs.
Helpful commands when troubleshooting a missing NSD:
mmfsadm test readdescraw /dev/sdx #allows you to get information from a GPFS disk
descriptor written when an NSD is created or modified.
Usage: tspreparedisk -s # list all logical disks (both physical and virtual) with valid
PVID (maybe impacted by nsddevices exit script)
Usage: tspreparedisk -S # list locally attached disks with valid PVID, Input list derived
from "mmcommon getDiskDevices" which on AIX requires disks to show up in output of
"/usr/sbin/getlvodm -F“
multipath -ll #display multipath device id's and information regarding dev names
sg_inq -p 0x83 /dev/sdk #Linux -can be used to get wwn wwid from device directly
lscfg -vl fcs0 #Aix
systool -c fc_host -v #Linux (not typically installed by default)
systool -c fc_transport –v #Linux (not typically installed by default)
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parall
el%20File%20System%20(GPFS)/page/mmfsadm
14
Scenario 2:
Spectrum Scale cannot find its NSDs.
Typical recovery steps:
1. mmnsddiscover -a -N all
2. mmlsnsd -X to check if there are still any "device not found" problems
3. If all the disks can be found, try mmchdisk start -a
else confirm if the disk can be repaired,
if not, run mmchdisk start -d “all_unrecovered/down_disks" except the disks where were not
found
Not: This last step is an important difference. mmchdisk start –a is not the same as running with
a specific reduced list of disks. In a replicated file system, mmchdisk start –a can still fail due to
a missing disk where mmchdisk start –d may be able to succeed and restore file system
access.
15
Scenario 3:
The cluster is expelling nodes and lost quorum
Unexpected expels are often reported after a quorum node with a hardware failure (i.e.
motherboard or OS disk) is repaired and re-introduced to the cluster. Customers will often
restore the Spectrum Scale configuration files (mmsdrfs ect.) using mmsdrrestore, but
operating system configuration is not always what it should be.
Common causes:
1) Mis-matched MTU size: Jumbo Frames enabled on some or all nodes but not on the
network switch. Results: Dropped Packets, Expels.
2) Firewalls running or misconfigured. RHEL iptables firewall will block ephemeral ports by
default. Nodes in this state may be able to join the cluster but as soon as a client attempts
to mount the file system expels will occur.
3) Old adapter firmware levels and/or OFED software are utilized
4) OS specific (TCP/IP, Memory) tuning has not been re-applied.
5) High speed InfiniBand network isn’t utilized (RDMA failed to start)
16
Scenario 3:
The cluster is expelling nodes and lost quorum
The cluster manager node receives a request to expel a node and much decide what action
to take. (Expel the requestor or requestee?)
Assuming we (the cluster manager) has evidence that both nodes are still up. In this case,
give preference to
1. quorum nodes over non-quorum nodes
2. local nodes over remote nodes
3. manager-capable nodes over non-manager-capable nodes
4. nodes managing more FSs over nodes managing fewer FSs
5. NSD server over non-NSD server
Otherwise, expel whoever joined the cluster more recently.
After all these criteria are applied, we also give a chance to the user script
to reverse the decision.
17
Scenario 3:
The cluster is expelling nodes and lost quorum
When reintroducing nodes back into the cluster, first verify two way communication is
successful between the node and all other nodes. This doesn’t mean just checking if SSH
works. Utilize mmnetverify (new in 4.2.2 but also requires minReleaseLevel update) or
system commands such as nmap or even a rudimentary telnet (if other tools cannot be
used) to ensure port 1191 is reachable and ephemeral ports are not blocked.
18
Scenario 3:
The cluster is expelling nodes and lost quorum
Add the node back in as a client node first. Quorum and Manager nodes are given priority
in expel logic. After you bring the node in the cluster with mmsdrrestore, reduce the
chances of a problem by changing the node designation with mmchnode ‐‐nonquorum ‐‐
nomanager if possible before any mmstartup is done. Deleting the node from the cluster
and adding it back in as a client first is also another option. If the node is simply a client
when it’s added back into the cluster, it’s much less likely to cause any impact if trouble
arises. Tip: You might want to save mmlsconfig output in case you had applied unique
configuration options to this node and need to re-apply.
If the node’s GPFS configuration hasn’t been restored, deleting the node from the cluster
with mmdelnode will still succeed as long as it’s not ping-able. If you need to delete a node
that is still ping-able, contact support to verify it’s safe to use the undocumented force flag.
Once its been verified that the newly joined node is accessing the file system, mmchnode
can be used to add quorum responsibility back on-line without an outage.
19
Scenario 3:
The cluster is expelling nodes and lost quorum
Network adapters are configured with less than the supported maximums. Increase buffer
sizes to help avoid frame loss and overruns.
Ring buffers on the NIC are important to handle bursts of incoming packets especially if
there is some delay when the hardware interrupt handler schedules the packet receiving
software interrupt (softirq). NIC ring buffer sizes vary per NIC vendor and NIC grade. By
increasing the Rx/Tx ring buffer size as shown below, you can decrease the probability of
discarding packets in the NIC during a scheduling delay. The Linux command used to
change ring buffer settings is ethtool.
These settings will be lost after a reboot. To persist these changes across reboots
reference the NIC vendor documentation for the ring buffer parameter(s) to set in the NIC
device driver kernel module.
20
Scenario 3:
The cluster is expelling nodes and lost quorum
Network adapters are configured with less than the supported maximums.
In general these can be set as high as 2 or 4K but often default to only 256.
Additional reading:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Perform
ance_Tuning_Guide/s-network-common-queue-issues.html
21
Scenario 3:
The cluster is expelling nodes and lost quorum
Make sure to review the mmfs logs of the cluster manager node and the newly joined node.
Without the high speed network communication occurring via RDMA, Spectrum Scale will
fall back to using the default daemon IP interface (typically just 1Gbit) resulting often in
network overload issues and sometimes triggering false positives on deadlock data capture
or even expels.
22
Scenario 3:
The cluster is expelling nodes and lost quorum
Look for signs of problems in the mmfs logs such as evidence that the system was
struggling to keep up with lease renewals per the “6027-2725 Node xxxx lease renewal is
overdue. Pinging to check if it is alive” messages. Consider collecting system performance
data such as AIX perfpmr or IBMs lpcpu.
Linux lpcpu:
http://ibm.co/download-lpcpu
AIX PerfPMR:
http://www-01.ibm.com/support/docview.wss?uid=aixtools-42612263
Tuning the ping timers can also allow more time for latency. You can adjust
MissedPingTimeout values to cover things like short network glitches such as a central
network switch failure timeout that may be longer than leaseRecoveryWait. It may prevent
false node down conditions but will extend the time for node recovery to finish which may
block other nodes making progress if the failing node held tokens for many shared files.
23
Scenario 3:
The cluster is expelling nodes and lost quorum
So if you believe these network or system problems are only temporary, and you do not
need fast failure detection, then you can consider also increasing leaseRecoveryWait to
120 seconds. This will increase the time it takes for a failed node to reconnect to the cluster
as it cannot connect until recovery is finished. Making this value smaller increases the risk
that there may be IO in flight from the failing node to the disk/controller when recovery
starts running. This may result in out of order IOs between the FS manager and the dying
node.
Example commands:
mmchconfig minMissedPingTimeout=120 (default is 3)
mmchconfig maxMissedPingTimeout=120 (default is 60)
mmchconfig leaseRecoveryWait=120 (default is 35)
The mmfsd daemon needs to be refreshed for the changes to take affect. You can make
the change on one node, then "mmchmgr -c to force the cluster manager to another node
and make the change on the cluster manager
24
Scenario 3:
The cluster is expelling nodes and lost quorum
Ensure you’ve given some consideration in TCP/IP tuning for Spectrum Scale.
AFM recommendations:
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r
22.doc/bl1adm_tuningbothnfsclientnfsserver.htm
25
Scenario 3:
The cluster is expelling nodes and lost quorum
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r
2.ins.doc/bl1ins_suse.htm?cp=STXKQY
Note: Some customers have been able to get away with as little as 1 to 2% depending on
the configuration and workload.
26
Scenario 4:
Performance delays
– The workerThreads parameter controls an integrated group of variables that tune the file
system performance in environments that are capable of high sequential and random
read and write workloads and small file activity.
– This variable controls both internal and external variables. The internal variables include
maximum settings for concurrent file operations, for concurrent threads that flush dirty
data and metadata, and for concurrent threads that prefetch data and metadata.
27
Scenario 5:
Data capture Best practices
28
Scenario 5:
Data capture Best practices
2) Limit the nodes on which data is collected using the '-N' flag to gpfs.snap. By default data
will be collected on all nodes, with additional master data (cluster aware commands) being
collected from the initiating node.
For the case of problem such as the failure on a given node (this could be a transient
condition, e.g. such as the temporary expelling of a node from the cluster) a good starting
point might be to collect data on just the failing node. If we had a failure on two nodes, say
three days ago, we might limit data two the two failing nodes and only collect data from the
last three days, e.g.:
gpfs.snap -N service5,service6 --limit-large-files 3
Note: Please avoid using the –z flag on gpfs.snap unless supplementing an existing master
snap or you are unable to run a master snap.
29
Scenario 5:
Data capture Best practices
3) To clean up old data over time, it's recommended that gpfs.snap be run occasionally with
the '--purge-files' flag to clean up 'large debug files' that are over the specified number of days
old.
gpfs.snap --purge-files KeepNumberOfDaysBack
Specifies that large debug files will be deleted from the cluster nodes based on the
KeepNumberOfDaysBack value. If 0 is specified, all of the large debug files will be deleted. If
a value greater than 0 is specified, large debug files that are older than the number of days
specified will be deleted. For example, if the value 2 is specified, the previous two days of
large debug files are retained.
This option is not compatible with many of the gpfs.snap options because it only removes
files and does not collect any gpfs.snap data.
The 'debug files' referred to above are typically stored in the /tmp/mmfs directory but this
directory can be changed by changing the value of the GPFS 'dataStructureDump'
configuration parameter, e.g.:
mmchconfig dataStructureDump=/name_of_some_other_big_file_system
30
Scenario 5:
Data capture Best practices
Note that this state information (possibly large amounts of data in the form of GPFS dumps
and traces) can be dumped automatically as part of GPFS's first failure data capture
mechanisms, and can accumulate in the (default /tmp/mmfs) directory defined by the
dataStructureDump configuration parameter. It is recommended that a cron job (such as
/etc/cron.daily/tmpwatch) be used to remove dataStructureDump directory data that is older
than two weeks, and that such data be collected (e.g. via gpfs.snap) within two weeks of
encountering any problem requiring investigation.
This cleaning up of debug data could also be accomplished by gpfs.snap with the '-purge-
files' flag. For example, once a week, the following cron job could be used to clean-up debug
files that are older than one week:
/usr/lpp/mmfs/bin/gpfs.snap --purge-files 7
31
Scenario 6:
mmccr internals (at your own risk)
This will rebuild the configuration files on all nodes to match the
CCR repository.
TIP: Don’t disable CCR on a cluster with Protocols enabled unless you
are prepared to re-configure.
Additional files typically stored in CCR include but not limited to:
gpfs.install.clusterdefinition.txt, cesiplist, smb.ctdb.nodes,
gpfs.ganesha.main.conf, gpfs.ganesha.nfsd.conf, gpfs.ganesha.log.conf,
gpfs.ganesha.exports.conf, gpfs.ganesha.statdargs.conf, idmapd.conf,
authccr, KRB5_CONF, _callhomeconfig, clusterEvents, protocolTraceList,
gui, gui_jobs
33
Spectrum Scale Announce forums
Monitor the Announce forums for news on the latest problems fixed, technotes, security
bulletins and Flash advisories.
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-000
0-0000-000000001606&ps=25
34
Additional Resources
Tuning parameters change history:
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r
22.doc/bl1adm_changehistory.htm?cp=STXKQY
ESS best practices:
https://www.ibm.com/support/knowledgecenter/en/SSYSP8_3.5.0/com.ibm.spectrum.scale.
raid.v4r11.adm.doc/bl1adv_planning.htm
Tuning Parameters:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20P
arallel%20File%20System%20(GPFS)/page/Tuning%20Parameters
Share Nothing Environment Tuning Parameters:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20P
arallel%20File%20System%20%28GPFS%29/page/IBM%20Spectrum%20Scale%20Tunin
g%20Recommendations%20for%20Shared%20Nothing%20Environments
Further Linux System Tuning:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20t
o%20High%20Performance%20Computing%20(HPC)%20Central/page/Linux%20System%
20Tuning%20Recommendations
35
THANK YOU!
Brian Yaeger
Email: [email protected]
March 2017