HACMP Troubleshooting Guide
HACMP Troubleshooting Guide
Troubleshooting Guide
Version 5.4.1
SC23-5177-04
Contents
Contents
9 13
14 15 15 16 16 17 18 18 18 22 22 22 23 23 23 24 24 24 28 28 29 29 30 31 31 31 32 33 33 33 34 34 35
Troubleshooting Guide
Contents
Chapter 2:
37
37 42 44 49 51 52 53 53 55
Tracking Resource Group Parallel and Serial Processing in the hacmp.out File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Serial Processing Order Reflected in Event Summaries . . . . . . . . . Parallel Processing Order Reflected in Event Summaries . . . . . . . Job Types: Parallel Resource Group Processing . . . . . . . . . . . . . . . Disk Fencing with Serial or Parallel Processing . . . . . . . . . . . . . . . Processing in Clusters with Dependent Resource Groups or Sites . Managing a Nodes HACMP Log File Parameters . . . . . . . . . . . . Logging for clcomd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Redirecting HACMP Cluster Log Files . . . . . . . . . . . . . . . . . . . . . . ..................................................... 57 57 58 63 64 67 68 68 69
Chapter 3:
71
71 72 72 73 Checking HACMP Components . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Checking for Cluster Configuration Problems . . . . . . . . . . . . . . . . 74 Checking a Cluster Snapshot File . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Checking the Logical Volume Manager . . . . . . . . . . . . . . . . . . . . . 79 Checking Volume Group Definitions . . . . . . . . . . . . . . . . . . . . . . . 79 Checking the Varyon State of a Volume Group . . . . . . . . . . . . . . . 80 Checking Physical Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Checking File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 Checking Mount Points, Permissions, and File System Information 84 Checking the TCP/IP Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Checking Point-to-Point Connectivity . . . . . . . . . . . . . . . . . . . . . . 87 Checking the IP Address and Netmask . . . . . . . . . . . . . . . . . . . . . . 88 Checking Heartbeating over IP Aliases . . . . . . . . . . . . . . . . . . . . . 89 Checking ATM Classic IP Hardware Addresses . . . . . . . . . . . . . . 89 Checking the AIX Operating System . . . . . . . . . . . . . . . . . . . . . . . . 90 Checking Physical Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Checking Disks, Disk Adapters, and Disk Heartbeating Networks 90 Recovering from PCI Hot Plug NIC Failure . . . . . . . . . . . . . . . . . . 91
4 Troubleshooting Guide
Contents
Checking Disk Heartbeating Networks . . . . . . . . . . . . . . . . . . . . . . 91 Checking the Cluster Communications Daemon . . . . . . . . . . . . . . 93 Checking System Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 HACMP Installation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 Cannot Find File System at Boot Time . . . . . . . . . . . . . . . . . . . . . . 94 cl_convert Does Not Run Due to Failed Installation . . . . . . . . . . . 94 Configuration Files Could Not Be Merged During Installation . . . 95 HACMP Startup Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 ODMPATH Environment Variable Not Set Correctly . . . . . . . . . . 95 clinfo Daemon Exits after Starting . . . . . . . . . . . . . . . . . . . . . . . . . 96 Node Powers Down; Cluster Manager Will Not Start . . . . . . . . . . 96 configchk Command Returns an Unknown Host Message . . . . . . . 97 Cluster Manager Hangs during Reconfiguration . . . . . . . . . . . . . . 97 clcomdES and clstrmgrES Fail to Start on Newly installed AIX Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Pre- or Post-Event Does Not Exist on a Node after Upgrade . . . . . 98 Node Fails During Configuration with 869 LED Display . . . . . . 98 Node Cannot Rejoin Cluster after Being Dynamically Removed . . 98 Resource Group Migration Is Not Persistent after Cluster Startup . 99 SP Cluster Does Not Startup after Upgrade to HACMP 5.4.1 . . . . 99 Verification Problems When Nodes Have Different Fileset Levels 100 Disk and File System Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 AIX Volume Group Commands Cause System Error Reports . . . 101 Verification Fails on Clusters with Disk Heartbeating Networks . 101 varyonvg Command Fails on a Volume Group . . . . . . . . . . . . . . 101 cl_nfskill Command Fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 cl_scdiskreset Command Fails . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 fsck Command Fails at Boot Time . . . . . . . . . . . . . . . . . . . . . . . . 103 System Cannot Mount Specified File Systems . . . . . . . . . . . . . . . 103 Cluster Disk Replacement Process Fails . . . . . . . . . . . . . . . . . . . . 104 Automatic Error Notification Fails with Subsystem Device Driver 104 File System Change Not Recognized by Lazy Update . . . . . . . . . 105 Network and Switch Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Unexpected Network Interface Failure in Switched Networks . . . 106 Cluster Nodes Cannot Communicate . . . . . . . . . . . . . . . . . . . . . . 107 Distributed SMIT Causes Unpredictable Results . . . . . . . . . . . . . 107 Token-Ring Network Thrashes . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 System Crashes Reconnecting MAU Cables after a Network Failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 TMSCSI Will Not Properly Reintegrate when Reconnecting Bus 108 Recovering from PCI Hot Plug NIC Failure . . . . . . . . . . . . . . . . . 108 Unusual Cluster Events Occur in Non-Switched Environments . . 109 Cannot Communicate on ATM Classic IP Network . . . . . . . . . . . 110 Cannot Communicate on ATM LAN Emulation Network . . . . . . 111 IP Label for HACMP Disconnected from AIX Interface . . . . . . . 112 TTY Baud Rate Setting Wrong . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 First Node Up Gives Network Error Message in hacmp.out . . . . 113 Network Interface Card and Network ODMs Out of Sync with Each Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Non-IP Network, Network Adapter or Node Failures . . . . . . . . . 114
Troubleshooting Guide 5
Contents
Networking Problems Following HACMP Fallover . . . . . . . . . . . 114 Packets Lost during Data Transmission . . . . . . . . . . . . . . . . . . . . 114 Verification Fails when Geo Networks Uninstalled . . . . . . . . . . . 115 Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks . . . . . . . . . . . . . . . . . . . . . . . . 115 Cluster Communications Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Message Encryption Fails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 Cluster Nodes Do Not Communicate with Each Other . . . . . . . . . 116 HACMP Takeover Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 varyonvg Command Fails During Takeover . . . . . . . . . . . . . . . . 117 Highly Available Applications Fail . . . . . . . . . . . . . . . . . . . . . . . . 118 Node Failure Detection Takes Too Long . . . . . . . . . . . . . . . . . . . 119 HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Group Services Sends GS_DOM_MERGE_ER Message . . . . . . 120 cfgmgr Command Causes Unwanted Behavior in Cluster . . . . . . 121 Releasing Large Amounts of TCP Traffic Causes DMS Timeout 121 Deadman Switch Causes a Node Failure . . . . . . . . . . . . . . . . . . . 122 Deadman Switch Time to Trigger . . . . . . . . . . . . . . . . . . . . . . . . . 123 A device busy Message Appears after node_up_local Fails . . . 123 Network Interfaces Swap Fails Due to an rmdev device busy Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 MAC Address Is Not Communicated to the Ethernet Switch . . . . 125 Client Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Network Interface Swap Causes Client Connectivity Problem . . 125 Clients Cannot Access Applications . . . . . . . . . . . . . . . . . . . . . . . 126 Clients Cannot Find Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Clinfo Does Not Appear to Be Running . . . . . . . . . . . . . . . . . . . . 126 Clinfo Does Not Report That a Node Is Down . . . . . . . . . . . . . . . 127 Miscellaneous Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Limited Output when Running the tail -f Command on /var/hacmp/log/hacmp.out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 CDE Hangs after IPAT on HACMP Startup . . . . . . . . . . . . . . . . . 128 Cluster Verification Gives Unnecessary Message . . . . . . . . . . . . 128 config_too_long Message Appears . . . . . . . . . . . . . . . . . . . . . . . . 129 Console Displays SNMP Messages . . . . . . . . . . . . . . . . . . . . . . . 130 Device LEDs Flash 888 (System Panic) . . . . . . . . . . . . . . . . . . 130 Unplanned System Reboots Cause Fallover Attempt to Fail . . . . 130 Deleted or Extraneous Objects Appear in NetView Map . . . . . . . 131 F1 Does not Display Help in SMIT Panels . . . . . . . . . . . . . . . . . . 131 /usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 View Event Summaries Does Not Display Resource Group Information as Expected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Application Monitor Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Cluster Disk Replacement Process Fails . . . . . . . . . . . . . . . . . . . . 133 Resource Group Unexpectedly Processed Serially . . . . . . . . . . . . 133 rg_move Event Processes Several Resource Groups at Once . . . . 133 File System Fails to Unmount . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Dynamic Reconfiguration Sets a Lock . . . . . . . . . . . . . . . . . . . . . 134
6 Troubleshooting Guide
Contents
WebSMIT Does Not See the Cluster . . . . . . . . . . . . . . . . . . . . . 135 Problems with WPAR-Enabled Resource Group . . . . . . . . . . . . . 135
Troubleshooting Guide
Contents
Troubleshooting Guide
5.4.1 5.4 5.3 last update 7/2006 5.3 update 8/2005 5.3 5.2 last update 10/2005
Troubleshooting Guide Troubleshooting Guide Troubleshooting Guide Troubleshooting Guide Troubleshooting Guide Administration and Troubleshooting Guide
IBM System p system components (including disk devices, cabling, and network adapters) The AIX operating system, including the Logical Volume Manager subsystem The System Management Interface Tool (SMIT) Communications, including the TCP/IP subsystem.
Highlighting
This guide uses the following highlighting conventions: Italic Bold Monospace Identifies new terms or concepts, or indicates emphasis. Identifies routines, commands, keywords, files, directories, menu items, and other items whose actual names are predefined by the system. Identifies examples of specific data values, examples of text similar to what you might see displayed, examples of program code similar to what you might write as a programmer, messages from the system, or information that you should actually type.
Troubleshooting Guide
ISO 9000
ISO 9000 registered quality systems were used in the development and manufacturing of this product.
HACMP Publications
The HACMP software comes with the following publications:
HACMP for AIX Release Notes in /usr/es/sbin/cluster/release_notes describe issues relevant to HACMP on the AIX platform: latest hardware and software requirements, last-minute information on installation, product usage, and known issues. HACMP on Linux Release Notes in /usr/es/sbin/cluster/release_notes.linux/ describe issues relevant to HACMP on the Linux platform: latest hardware and software requirements, last-minute information on installation, product usage, and known issues. HACMP for AIX: Administration Guide, SC23-4862 HACMP for AIX: Concepts and Facilities Guide, SC23-4864 HACMP for AIX: Installation Guide, SC23-5209 HACMP for AIX: Master Glossary, SC23-4867 HACMP for AIX: Planning Guide, SC23-4861 HACMP for AIX: Programming Client Applications, SC23-4865 HACMP for AIX: Troubleshooting Guide, SC23-5177 HACMP on Linux: Installation and Administration Guide, SC23-5211 HACMP for AIX: Smart Assist Developers Guide, SC23-5210 IBM International Program License Agreement.
HACMP/XD Publications
The HACMP Extended Distance (HACMP/XD) software solutions for disaster recovery, added to the base HACMP software, enable a cluster to operate over extended distances at two sites. HACMP/XD publications include the following:
HACMP/XD for Geographic LVM (GLVM): Planning and Administration Guide, SA23-1338 HACMP/XD for HAGEO Technology: Concepts and Facilities Guide, SC23-1922 HACMP/XD for HAGEO Technology: Planning and Administration Guide, SC23-1886 HACMP/XD for Metro Mirror: Planning and Administration Guide, SC23-4863.
HACMP Smart Assist for DB2 Users Guide, SC23-5179 HACMP Smart Assist for Oracle Users Guide, SC23-5178
10
Troubleshooting Guide
HACMP Smart Assist for WebSphere Users Guide, SC23-4877 HACMP for AIX 5L: Smart Assist Developers Guide, SC23-5210 HACMP Smart Assist Release Notes.
RS/6000 SP High Availability Infrastructure, SG24-4838 IBM AIX v.5.3 Security Guide, SC23-4907 IBM Reliable Scalable Cluster Technology for AIX and Linux: Group Services Programming Guide and Reference, SA22-7888 IBM Reliable Scalable Cluster Technology for AIX and Linux: Administration Guide, SA22-7889 IBM Reliable Scalable Cluster Technology for AIX: Technical Reference, SA22-7890 IBM Reliable Scalable Cluster Technology for AIX: Messages, GA22-7891.
Accessing Publications
Use the following Internet URLs to access online libraries of documentation: AIX, IBM eServer pSeries, and related products: http://www.ibm.com/servers/aix/library AIX v.5.3 publications: http://www.ibm.com/servers/eserver/pseries/library/ WebSphere Application Server publications: Search the IBM website to access the WebSphere Application Server Library DB2 Universal Database Enterprise Server Edition publications: http://www.ibm.com/cgi-bin/db2www/data/db2/udb/winos2unix/support/v8pubs.d2w/ en_main#V8PDF Tivoli Directory Server publications: http://publib.boulder.ibm.com/tividd/td/IBMDirectoryServer5.1.html
Title and order number of this book Page number or topic related to your comment.
Troubleshooting Guide
11
When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you.
Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States or other countries:
AFS AIX DFS eServer Cluster 1600 Enterprise Storage Server HACMP IBM NetView RS/6000 Scalable POWERParallel Systems Series p Series x Shark SP WebSphere Red Hat Enterprise Linux (RHEL) SUSE Linux Enterprise Server RPM Package Manager for Linux and other Linux trademarks.
UNIX is a registered trademark in the United States and other countries and is licensed exclusively through The Open Group. Linux is a registered trademark in the United States and other countries and is licensed exclusively through the GNU General Public License. Other company, product, and service names may be trademarks or service marks of others.
12
Troubleshooting Guide
Chapter 1:
This chapter presents the recommended troubleshooting strategy for an HACMP cluster. It describes the problem determination tools available from the HACMP main SMIT menu. This guide also includes information on tuning the cluster for best performance, which can help you avoid some common problems. For details on how to use the various log files to troubleshoot the cluster see Chapter 2: Using Cluster Log Files. For hints on how to check system components if using the log files does not help with the problem, and a list of solutions to common problems that may occur in an HACMP environment see Chapter 3: Investigating System Components and Solving Common Problems. For information specific to RSCT daemons and diagnosing RSCT problems, see the following IBM publications:
IBM Reliable Scalable Cluster Technology for AIX and Linux: Group Services Programming Guide and Reference, SA22-7888 IBM Reliable Scalable Cluster Technology for AIX and Linux: Administration Guide, SA22-7889 IBM Reliable Scalable Cluster Technology for AIX: Technical Reference, SA22-7890 IBM Reliable Scalable Cluster Technology for AIX: Messages, GA22-7891.
Note: This chapter presents the default locations of log files. If you redirected any logs, check the appropriate location. For additional information, see Chapter 2: Using Cluster Log Files. The main sections of this chapter include:
Troubleshooting an HACMP Cluster Overview Using the Problem Determination Tools Configuring Cluster Performance Tuning Resetting HACMP Tunable Values Sample Custom Scripts.
Troubleshooting Guide
13
Becoming aware that a problem exists Determining the source of the problem Correcting the problem.
End user complaints, because they are not able to access an application running on a cluster node One or more error messages displayed on the system console or in another monitoring program.
There are other ways you can be notified of a cluster problem, through mail notification, or pager notification and text messaging:
Mail Notification. Although HACMP standard components do not send mail to the system administrator when a problem occurs, you can create a mail notification method as a preor post-event to run before or after an event script executes. In an HACMP cluster environment, mail notification is effective and highly recommended. See the Planning Guide for more information. Remote Notification. You can also define a notification methodnumeric or alphanumeric page, or an text messaging notification to any address including a cell phonethrough the SMIT interface to issue a customized response to a cluster event. For more information, see the chapter on customizing cluster events in the Planning Guide.
Pager Notification. You can send messages to a pager number on a given event. You can send textual information to pagers that support text display (alphanumeric page), and numerical messages to pagers that only display numbers. Text Messaging. You can send cell phone text messages using a standard data modem and telephone land line through the standard Telocator Alphanumeric Protocol (TAP)your provider must support this service. You can also issue a text message using a Falcom-compatible GSM modem to transmit SMS (Short Message Service) text-message notifications wirelessly. SMS messaging requires an account with an SMS service provider. GSM modems take TAP modem protocol as input through a RS232 line or USB line, and send the message wirelessly to the providers cell phone tower. The provider forwards the message to the addressed cell phone. Each provider has a Short Message Service Center (SMSC).
For each person, define remote notification methods that contain all the events and nodes so you can switch the notification methods as a unit when responders change.
14
Troubleshooting Guide
Note: Manually distribute each message file to each node. HACMP does not automatically distribute the file to other nodes during synchronization unless the File Collections utility is set up specifically to do so. See the Managing HACMP File Collections section in Chapter 7: Verifying and Synchronizing a Cluster Configuration of the Administration Guide.
Troubleshooting Guide
15
If all else fails, stop the HACMP cluster services on all cluster nodes. Then, manually start the application that the HACMP cluster event scripts were attempting to start and run the application without the HACMP software. This may require varying on volume groups, mounting file systems, and enabling IP addresses. With the HACMP cluster services stopped on all cluster nodes, correct the conditions that caused the initial problem.
tar archive of directory /var/hacmp /usr/sbin/rsct/bin/phoenix.snap tar archives of directories /etc/es/objrepos and /usr/es/sbin/cluster/etc/objrepos/active snap -cgGLt
For more information on the snap command, see the AIX Version 6.1 Commands Reference, Volume 5.
Local HACMP cluster running HACMP 5.2 or greater Cluster worksheets file created from SMIT or from Online Planning Worksheets.
You can use a worksheets file to view information for a cluster configuration and to troubleshoot cluster problems. The Online Planning Worksheets application lets you review definition details on the screen in an easy-to-read format and lets you create a printable formatted report. WARNING: Although you can import a cluster definition and save it, some of the data is informational only. Making changes to informational components does not change the actual configuration on the system if the worksheets file is exported. For information about informational components in a worksheets file, see the section Entering Data in Chapter 9: Using Online Planning Worksheets in the Planning Guide.
16 Troubleshooting Guide
Note: Cluster definition files and their manipulation in the Online Planning Worksheets application supplement, but do not replace cluster snapshots. For more information about using cluster definition files in the Online Planning Worksheets application, see Chapter 9: Using Online Planning Worksheets in the Planning Guide.
clRGinfo provides information about resource groups and for troubleshooting purposes. For more information see Chapter 10: Monitoring an HACMP Cluster in the Administration Guide. clstat reports the status of key cluster componentsthe cluster itself, the nodes in the cluster, the network interfaces connected to the nodes, the service labels, and the resource groups on each node. For more information see Chapter 10: Monitoring an HACMP Cluster in the Administration Guide. clsnapshot allows you to save in a file a record of all the data that defines a particular cluster configuration. For more information see the section Using the Cluster Snapshot Utility to Check Cluster Configuration. and Creating (Adding) a Cluster Snapshot section in Chapter 18: Saving and Restoring Cluster Configurations in the Administration Guide. cldisp utility displays resource groups and their startup, fallover, and fallback policies. For more information Chapter 10: Monitoring an HACMP Cluster in the Administration Guide. SMIT Problem Determination Tools, for information see the section Using the Problem Determination Tools in this chapter.
Troubleshooting Guide
17
HACMP Verification Viewing Current State HACMP Log Viewing and Management Recovering from HACMP Script Failure Restoring HACMP Configuration Database from an Active Configuration Release Locks Set by Dynamic Reconfiguration Clear SSA Disk Fence Registers HACMP Cluster Test Tool HACMP Trace Facility HACMP Event Emulation HACMP Error Notification Opening a SMIT Session on a Node.
HACMP Verification
Select this option from the Problem Determination Tools menu to verify that the configuration on all nodes is synchronized, set up a custom verification method, or set up automatic cluster verification. Verify HACMP Configuration Configure Custom Verification Method Automatic Cluster Configuration Monitoring Select this option to verify cluster topology resources. Use this option to add, show and remove custom verification methods. Select this option to automatically verify the cluster every twenty-four hours and report results throughout the cluster.
18
Troubleshooting Guide
Troubleshooting Guide
19
Enter the name of an output file in which to store verification output. By default, verification output is also stored in the /usr/es/sbin/cluster/wsm/logs/ wsm_smit.log file. Select no to run all verification checks that apply to the current cluster configuration. Select yes to run only the checks related to parts of the HACMP configuration that have changed. The yes mode has no effect on an inactive cluster. Note: The yes option only relates to cluster Configuration Databases. If you have made changes to the AIX configuration on your cluster nodes, you should select no. Only select yes if you have made no changes to the AIX configuration.
Logging
Selecting on displays all output to the console that normally goes to the /var/hacmp/clverify/ clverify.log. The default is off.
Name of the node where verification was run Date and time of the last verification Results of the verification.
20
Troubleshooting Guide
This information is stored on every available cluster node in the HACMP log file /var/hacmp/log/clutils.log. If the selected node became unavailable or could not complete cluster verification, you can detect this by the lack of a report in the /var/hacmp/log/clutils.log file. In case cluster verification completes and detects some configuration errors, you are notified about the following potential problems:
The exit status of cluster verification is communicated across the cluster along with the information about cluster verification process completion. Broadcast messages are sent across the cluster and displayed on stdout. These messages inform you about detected configuration errors. A cluster_notify event runs on the cluster and is logged in hacmp.out (if cluster services is running).
More detailed information is available on the node that completes cluster verification in /var/hacmp/clverify/clverify.log. If a failure occurs during processing, error messages and warnings clearly indicate the node and reasons for the verification failure. Configuring Automatic Verification and Monitoring of Cluster Configuration Make sure the /var filesystem on the node has enough space for the /var/hacmp/log/clutils.log file. For additional information, see the section The Size of the /var Filesystem May Need to be Increased in Chapter 10: Monitoring an HACMP Cluster in the Administration Guide. To configure the node and specify the time where cluster verification runs automatically: 1. Enter smit hacmp 2. In SMIT, select Problem Determination Tools > HACMP Verification > Automatic Cluster Configuration Monitoring. 3. Enter field values as follows: * Automatic cluster configuration verification Node name Enabled is the default. Select one of the cluster nodes from the list. By default, the first node in alphabetical order will verify the cluster configuration. This node will be determined dynamically every time the automatic verification occurs. Midnight (00) is the default. Verification runs automatically once every 24 hours at the selected hour.
4. Press Enter. 5. The changes take effect when the cluster is synchronized.
Troubleshooting Guide
21
View, save or delete Event summaries View detailed HACMP log files Change or show HACMP log file parameters Change or show Cluster Manager log file parameters Change or show a cluster log file directory Change all Cluster Logs directory Collect cluster log files for problem reporting.
See Chapter 2: Using Cluster Log Files and Chapter 8: Testing an HACMP Cluster in the Administration Guide for complete information.
22
Troubleshooting Guide
Troubleshooting Guide
23
Run only one instance of the event emulator at a time. If you attempt to start a new emulation in a cluster while an emulation is already running, the integrity of the results cannot be guaranteed. Each emulation is a stand-alone process; one emulation cannot be based on the results of a previous emulation. clinfoES must be running on all nodes. Add a cluster snapshot before running an emulation, just in case uncontrolled cluster events happen during emulation. Instructions for adding cluster snapshots are in the chapter on Saving and Restoring Cluster Configurations in the Administration Guide. The Event Emulator can run only event scripts that comply with the currently active configuration. For example:
24
Troubleshooting Guide
The Emulator expects to see the same environmental arguments used by the Cluster Manager; if you define arbitrary arguments, the event scripts will run, but error reports will result. In the case of swap_adapter, you must enter the ip_label supplied for service and non-service interfaces in the correct order, as specified in the usage statement. Both interfaces must be located on the same node at emulation time. Both must be configured as part of the same HACMP logical network.
For other events, the same types of restrictions apply. If errors occur during emulation, recheck your configuration to ensure that the cluster state supports the event to be emulated.
The Event Emulator runs customized scripts (pre- and post-event scripts) associated with an event, but does not run commands within these scripts. Therefore, if these customized scripts change the cluster configuration when actually run, the outcome may differ from the outcome of an emulation. When emulating an event that contains a customized script, the Event Emulator uses the ksh flags -n and -v. The -n flag reads commands and checks them for syntax errors, but does not execute them. The -v flag indicates verbose mode. When writing customized scripts that may be accessed during an emulation, be aware that the other ksh flags may not be compatible with the -n flag and may cause unpredictable results during the emulation. See the ksh man page for flag descriptions.
Troubleshooting Guide
25
Emulating a Node Down Event To emulate a Node Down event: 1. Select Node Down Event from the HACMP Event Emulation panel. SMIT displays the panel. 2. Enter field data as follows: Node Name Node Down Mode Enter the node to use in the emulation. Indicate the type of shutdown to emulate:
Bring Resource Groups Offline. The node that is shutting down releases its resources. The other nodes do not take over the resources of the stopped node. Move Resource Groups. The node that is shutting down releases its resources. The other nodes do take over the resources of the stopped node. Unmanage Resource Groups. HACMP shuts down immediately. The node that is shutting down retains control of all its resources. Applications that do not require HACMP daemons continue to run. Typically, you use the UNMANAGE option so that stopping the Cluster Manager does not interrupt users and clients. Note that enhanced concurrent volume groups do not accept the UNMANAGE option if they are online.
3. Press Enter to start the emulation. Emulating a Network Up Event To emulate a Network Up event: 1. From the HACMP Event Emulation panel, select Network Up Event. SMIT displays the panel. 2. Enter field data as follows: Network Name Node Name Enter the network to use in the emulation. (Optional) Enter the node to use in the emulation.
3. Press Enter to start the emulation. Emulating a Network Down Event To emulate a Network Down event: 1. From the HACMP Event Emulation panel, select Network Down Event. SMIT displays the panel. 2. Enter field data as follows: Network Name Node Name Enter the network to use in the emulation. (Optional) Enter the node to use in the emulation.
Emulating a Fail Standby Event To emulate a Fail Standby event: 1. Select Fail Standby Event from the HACMP Event Emulation panel. SMIT displays the Fail Standby Event panel. 2. Enter field data as follows: Node Name IP Label Enter the node to use in the emulation. Enter the IP label to use in the emulation.
3. Press Enter to start the emulation. The following messages are displayed on all active cluster nodes when emulating the Fail Standby and Join Standby events:
Adapter $ADDR is no longer available for use as a standby, due to either a standby adapter failure or IP address takeover. Standby adapter $ADDR is now available.
Emulating a Join Standby Event To emulate a Join Standby event: 1. From the HACMP Event Emulation panel, select Join Standby Event. SMIT displays the Join Standby Event panel. 2. Enter field data as follows: Node Name IP Label Enter the node to use in the emulation. Enter the IP label to use in the emulation.
3. Press Enter to start the emulation. Emulating a Swap Adapter Event To emulate a Swap Adapter event: 1. From the HACMP Event Emulation panel, select Swap Adapter Event. SMIT displays the Swap Adapter Event panel. 2. Enter field data as follows: Node Name Network Name Boot Time IP Label of available Network Interface Service Label to Move Enter the node to use in the emulation. Enter the network to use in the emulation. The name of the IP label to swap. The Boot-time IP Label must be available on the node on which the emulation is taking place. The name of the Service IP label to swap, it must be on the and available to the same node as the Boot-time IP Label
Troubleshooting Guide
27
28
Troubleshooting Guide
AIX high and low watermarks for I/O pacing AIX syncd frequency rate.
Set the two AIX parameters on each cluster node. You may also set the following HACMP network tuning parameters for each type of network:
You can configure these related parameters directly from HACMP SMIT. Network module settings are propagated to all nodes when you set them on one node and then synchronize the cluster topology.
Troubleshooting Guide
29
Although the most efficient high- and low-water marks vary from system to system, an initial high-water mark of 33 and a low-water mark of 24 provides a good starting point. These settings only slightly reduce write times and consistently generate correct fallover behavior from the HACMP software. See the AIX Performance Monitoring & Tuning Guide for more information on I/O pacing. To change the I/O pacing settings, do the following on each node: 1. Enter smit hacmp 2. In SMIT, select Extended Configuration > Extended Performance Tuning Parameters Configuration > Change/Show I/O Pacing and press Enter. 3. Configure the field values with the recommended HIGH and LOW watermarks: HIGH water mark for pending write I/Os per file 33 is recommended for most clusters. Possible values are 0 to 32767.
LOW watermark for pending write 24 is recommended for most clusters. I/Os per file Possible values are 0 to 32766.
Changing the Failure Detection Rate of a Network Module after the Initial Configuration
If you want to change the failure detection rate of a network module, either change the tuning parameters of a network module to predefined values of Fast, Normal and Slow, or set these attributes to custom values. Also, use the custom tuning parameters to change the baud rate for TTYs if you are using RS232 networks that might not handle the default baud rate of 38400. For more information, see the Changing the Configuration of a Network Module section in the chapter on Managing the Cluster Topology in the Administration Guide.
30
Troubleshooting Guide
Before resetting HACMP tunable values, HACMP takes a cluster snapshot. After the values have been reset to defaults, if you want to go back to your customized cluster settings, you can restore them with the cluster snapshot. HACMP saves snapshots of the last ten configurations in the default cluster snapshot directory, /usr/es/sbin/cluster/snapshots, with the name active.x.odm, where x is a digit between 0 and 9, with 0 being the most recent. Stop cluster services on all nodes before resetting tunable values. HACMP prevents you from resetting tunable values in a running cluster.
In some cases, HACMP cannot differentiate between user-configured information and discovered information, and does not reset such values. For example, you may enter a service label and HACMP automatically discovers the IP address that corresponds to that label. In this case, HACMP does not reset the service label or the IP address. The cluster verification utility detects if these values do not match. The clsnapshot.log file in the snapshot directory contains log messages for this utility. If any of the following scenarios are run, then HACMP cannot revert to the previous configuration:
cl_convert is run automatically cl_convert is run manually clconvert_snapshot is run manually. The clconvert_snapshot utility is not run automatically, and must be run from the command line to upgrade cluster snapshots when migrating from HACMP (HAS) to HACMP 5.1 or greater.
User-supplied information.
Network module tuning parameters such as failure detection rate, grace period and heartbeat rate. HACMP resets these parameters to their installation-time default values.
31
Troubleshooting Guide
Cluster event customizations such as all changes to cluster events. Note that resetting changes to cluster events does not remove any files or scripts that the customization used, it only removes the knowledge HACMP has of pre- and post-event scripts. Cluster event rule changes made to the event rules database are reset to the installation-time default values. HACMP command customizations made to the default set of HACMP commands are reset to the installation-time defaults.
Automatically generated and discovered information, generally users cannot see this information. HACMP rediscovers or regenerates this information when the cluster services are restarted or during the next cluster synchronization. HACMP resets the following:
Local node names stored in the cluster definition database Netmasks for all cluster networks Netmasks, interface names and aliases for disk heartbeating (if configured) for all cluster interfaces SP switch information generated during the latest node_up event (this information is regenerated at the next node_up event) Instance numbers and default log sizes for the RSCT subsystem.
32
Troubleshooting Guide
Making cron jobs highly available Making print queues highly available.
This will ensure that the cron table for root has only the no resource entries at system startup. 4. You can use either of two methods to activate the root.resource cron table. The first method is the simpler of the two.
Troubleshooting Guide
33
Run crontab root.resource as the last line of the application start script. In the application stop script, the first line should then be crontab root.noresource. By executing these commands in the application start and stop scripts, you are ensured that they will activate and deactivate on the proper node at the proper time. Run the crontab commands as a post_event to node_up_complete and node_down_complete.
Upon node_up_complete on the primary node, run crontab root.resources. On node_down_complete run crontab root.noresources.
The takeover node must also use the event handlers to execute the correct cron table. Logic must be written into the node_down_complete event to determine if a takeover has occurred and to run the crontab root.resources command. On a reintegration, a pre-event to node_up must determine if the primary node is coming back into the cluster and then run a crontab root.noresource command.
When HACMP Does Not Save the AIX Environment for node_up_complete events
If your scripts to start or stop application servers depend on any information in /etc/environment, you should explicitly define that information in the scripts. PATH and NLSPATH are two commonly needed variables that are not set to the values contained in /etc/environment during the execution of application start and stop scripts. For example, add this line to the application scripts:
export PATH=/usr/bin:/bin:/sbin:/usr/sbin:/usr/local/bin
Stop the print queues Stop the print queue daemon Mount /prtjobs over /var/spool/lpd/qdir Mount /prtdata over /var/spool/qdaemon
34
Troubleshooting Guide
Restart the print queue daemon Restart the print queues. Stop the print queues Stop the print queue daemon Move the contents of /prtjobs into /var/spool/lpd/qdir Move the contents of /prtdata into /var/spool/qdaemon Restart the print queue daemon Restart the print queues.
In the event of a fallover, the surviving node will need to do the following:
To do this, write a script called as a post-event to node_down_complete on the takeover. The script needs to determine if the node_down is from the primary node.
Troubleshooting Guide
35
36
Troubleshooting Guide
Chapter 2:
This chapter explains how to use the HACMP cluster log files to troubleshoot the cluster. It also includes sections on managing parameters for some of the logs. Major sections of the chapter include:
Viewing HACMP Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File Managing a Nodes HACMP Log File Parameters Logging for clcomd Redirecting HACMP Cluster Log Files.
Troubleshooting Guide
37
Contains time-stamped, formatted messages from all AIX subsystems, including scripts and daemons. For information about viewing this log file and interpreting the messages it contains, see the section Understanding the System Error Log. Recommended Use: Because the system error log contains time-stamped messages from many other system components, it is a good place to correlate cluster events with system events.
tmp/clconvert.log
Contains a record of the conversion progress when upgrading to a recent HACMP release. The installation process runs the cl_convert utility and creates the /tmp/clconvert.log file. Recommended Use: View the clconvert.log to gauge conversion success when running cl_convert from the command line. For detailed information on the cl_convert utility see the chapter on Upgrading an HACMP Cluster, in the Installation Guide.
/usr/es/sbin/cluster/snapshots/ clsnapshot.log
Contains logging information from the snapshot utility of HACMP, and information about errors found and/or actions taken by HACMP for resetting cluster tunable values. All operations of the WebSMIT interface are logged to the wsm_smit.log file and are equivalent to the logging done with smitty -v. Script commands are also captured in the wsm_smit.script log file. wsm_smit log files are created by the CGI scripts using a relative path of <../logs>. If you copy the CGI scripts to the default location for the IBM HTTP Server, the final path to the logs is /usr/IBMIHS/logs. The location of the WebSMIT log files cannot be modified. Like log files smit.log and smit.script, new logging entries are appended to the end of the file, and you need to control their size and backup. There is no default logging of the cluster status display, although logging can be enabled through the wsm_clstat.com configuration file.
/usr/es/sbin/cluster/wsm/logs/ wsm_smit.log
/var/ha/log/grpglsm
Contains time-stamped messages in ASCII format. These track the execution of internal activities of the RSCT Group Services Globalized Switch Membership daemon. IBM support personnel use this information for troubleshooting. The file gets trimmed regularly. Therefore, please save it promptly if there is a chance you may need it.
38
Troubleshooting Guide
/var/ha/log/grpsvcs
Contains time-stamped messages in ASCII format. These track the execution of internal activities of the RSCT Group Services daemon. IBM support personnel use this information for troubleshooting. The file gets trimmed regularly. Therefore, please save it promptly if there is a chance you may need it. Contains time-stamped messages in ASCII format. These track the execution of internal activities of the RSCT Topology Services daemon. IBM support personnel use this information for troubleshooting. The file gets trimmed regularly. Therefore, please save it promptly if there is a chance you may need it. Contains time-stamped, formatted messages generated by HACMP scripts and daemons. For information about viewing this log file and interpreting its messages, see the following section Understanding the cluster.log File. Recommended Use: Because this log file provides a high-level view of current cluster status, check this file first when diagnosing a cluster problem.
/var/ha/log/topsvcs.<filename>
/var/hacmp/adm/cluster.log
/var/hacmp/adm/history/ cluster.mmddyyyy
Contains time-stamped, formatted messages generated by HACMP scripts. The system creates a cluster history file every day, identifying each file by its file name extension, where mm indicates the month, dd indicates the day, and yyyy the year. For information about viewing this log file and interpreting its messages, see the section Understanding the Cluster History Log File. Recommended Use: Use the cluster history log files to get an extended view of cluster behavior over time. Note that this log is not a good tool for tracking resource groups processed in parallel. In parallel processing, certain steps formerly run as separate events are now processed differently and these steps will not be evident in the cluster history log. Use the hacmp.out file to track parallel processing activity.
/var/hacmp/clcomd/ clcomddiag.log
Contains time-stamped, formatted, diagnostic messages generated by clcomd. Recommended Use: Information in this file is for IBM Support personnel.
/var/hacmp/log/autoverify.log
Troubleshooting Guide
39
/var/hacmp/log/clavan.log
Contains the state transitions of applications managed by HACMP. For example, when each application managed by HACMP is started or stopped and when the node stops on which an application is running. Each node has its own instance of the file. Each record in the clavan.log file consists of a single line. Each line contains a fixed portion and a variable portion: Recommended Use: By collecting the records in the clavan.log file from every node in the cluster, a utility program can determine how long each application has been up, as well as compute other statistics describing application availability time.
The clinfo.log file records the output generated by the event scripts as they run. This information supplements and expands upon the information in the /var/hacmp/log/hacmp.out file. You can install Client Information (Clinfo) services on both client and server systems client systems (cluster.es.client) will not have any HACMP ODMs (for example HACMPlogs) or utilities (for example clcycle); therefore, the Clinfo logging will not take advantage of cycling or redirection. The default debug level is 0 or off. You can enable logging using command line flags. Use the clinfo -l flag to change the log file name.
/var/hacmp/log/clstrmgr.debug Contains time-stamped, formatted messages generated by /var/hacmp/log/clstrmgr.debug. the clstrmgrES daemon. The default messages are verbose and are typically adequate for troubleshooting n, n=1,..,7 most problems, however IBM support may direct you to enable additional debugging. Recommended Use: Information in this file is for IBM Support personnel. /var/hacmp/log/ clstrmgr.debug.long /var/hacmp/log/ clstrmgr.debug.long.n, n=1,..,7 Contains high-level logging of cluster manager activity, in particular its interaction with other components of HACMP and with RSCT, which event is currently being run, and information about resource groups (for example, their state and actions to be performed, such as acquiring or releasing them during an event. Contains time-stamped, formatted messages generated by HACMP C-SPOC commands. The cspoc.log file resides on the node that invokes the C-SPOC command. Recommended Use: Use the C-SPOC log file when tracing a C-SPOC commands execution on cluster nodes.
/var/hacmp/log/cspoc.log
40
Troubleshooting Guide
/var/hacmp/log/cspoc.log.long
Contains a high-level of logging for the C-SPOC utility commands and utilities that have been invoked by C-SPOC on specified nodes and their return status. Contains logging of the execution of C-SPOC commands on remote nodes with ksh option xtrace enabled (set -x). Contains time-stamped, formatted messages generated by the HACMP Event Emulator. The messages are collected from output files on each node of the cluster, and cataloged together into the emuhacmp.out log file. In verbose mode (recommended), this log file contains a line-by-line record of every event emulated. Customized scripts within the event are displayed, but commands within those scripts are not executed.
Contains time-stamped, formatted messages generated by HACMP scripts on the current day. In verbose mode (recommended), this log file contains a line-by-line record of every command executed by scripts, including the values of all arguments to each command. An event summary of each high-level event is included at the end of each events details. For information about viewing this log and interpreting its messages, see the section Understanding the hacmp.out Log File. Recommended Use: Because the information in this log file supplements and expands upon the information in the /var/hacmp/adm/cluster.log file, it is the primary source of information when investigating a problem. Note: With recent changes in the way resource groups are handled and prioritized in fallover circumstances, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In HACMP releases prior to 5.2, non-recoverable event script failures result in the event_error event being run on the cluster node where the failure occurred. The remaining cluster nodes do not indicate the failure. With HACMP 5.2 and up, all cluster nodes run the event_error event if any node has a fatal error. All nodes log the error and call out the failing node name in the hacmp.out log file.
/var/hacmp/log/oraclesa.log /var/hacmp/log/sa.log
Contains logging of the Smart Assist for Oracle facility. Contains logging of the Smart Assist facility.
Troubleshooting Guide
41
/var/hacmp/clcomd/clcomd.log
Contains time-stamped, formatted messages generated by Cluster Communications daemon (clcomd) activity. The log shows information about incoming and outgoing connections, both successful and unsuccessful. Also displays a warning if the file permissions for /usr/es/sbin/cluster/etc/rhosts are not set correctlyusers on the system should not be able to write to the file. Recommended Use: Use information in this file to troubleshoot inter-node communications, and to obtain information about attempted connections to the daemon (and therefore to HACMP).
/var/hacmp/log/ clconfigassist.log
Contains debugging information for the Two-Node Cluster Configuration Assistant. The Assistant stores up to ten copies of the numbered log files to assist with troubleshooting activities.
/var/hacmp/clverify/clverify.log The clverify.log file contains the verbose messages output by the cluster verification utility. The messages indicate the node(s), devices, command, etc. in which any verification error occurred. For complete information see Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide. /var/hacmp/log/clutils.log Contains information about the date, time, results, and which node performed an automatic cluster configuration verification. It also contains information for the file collection utility, the two-node cluster configuration assistant, the cluster test tool and the OLPW conversion tool. /var/hacmp/log/ cl_testtool.log Includes excerpts from the hacmp.out file. The Cluster Test Tool saves up to three log files and numbers them so that you can compare the results of different cluster tests. The tool also rotates the files with the oldest file being overwritten Contains a high level of logging of cluster activity while the cluster manager on the local node operates in a migration state. All actions pertaining to the cluster manager follow the internal migration protocol.
/var/hacmp/log/migration.log
42
Troubleshooting Guide
Each entry has the following information: Date and Time stamp The day and time on which the event occurred. Node Subsystem The node on which the event occurred. The HACMP subsystem that generated the event. The subsystems are identified by the following abbreviations:
clstrmgrESthe Cluster Manager daemon clinfoESthe Cluster Information Program daemon HACMP for AIXstartup and reconfiguration scripts.
PID Message
The process ID of the daemon generating the message (not included for messages output by scripts). The message text.
The entry in the previous example indicates that the Cluster Information program (clinfoES) stopped running on the node named nodeA at 5:25 P.M. on March 3. Because the /var/hacmp/adm/cluster.log file is a standard ASCII text file, you can view it using standard AIX file commands, such as the more or tail commands. However, you can also use the SMIT interface. The following sections describe each of the options.
Troubleshooting Guide
43
Event Preambles
When an event processes resource groups with dependencies or with HACMP/XD replicated resources, an event preamble is included in the hacmp.out file. This preamble shows you the logic the Cluster Manager will use to process the event in question. See the sample below.
HACMP Event Preamble -----------------------------------------------------------------Node Down Completion Event has been enqueued. -----------------------------------------------------------------xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx HACMP Event Preamble Action: Resource: -----------------------------------------------------------------Enqueued rg_move acquire event for resource group rg3. Enqueued rg_move release event for resource group rg3. Enqueued rg_move secondary acquire event for resource group 'rg1'. Node Up Completion Event has been enqueued. ------------------------------------------------------------------
44
Troubleshooting Guide
Event Summaries
Event summaries that appear at the end of each events details make it easier to check the hacmp.out file for errors. The event summaries contain pointers back to the corresponding event, which allow you to easily locate the output for any event. See the section Verbose Output Example with Event Summary for an example of the output. You can also view a compilation of only the event summary sections pulled from current and past hacmp.out files. The option for this display is found on the Problem Determination Tools > HACMP Log Viewing and Management > View/Save/Remove Event Summaries > View Event Summaries SMIT panel. For more detail, see the section Viewing Compiled hacmp.out Event Summaries later in this chapter.
The start and stop times for the event Which resource groups were affected (acquired or released) as a result of the event In the case of a failed event, an indication of which resource action failed.
You can track the path the Cluster Manager takes as it tries to keep resources available. In addition, the automatically configured AIX Error Notification method that runs in the case of a volume group failure writes the following information in the hacmp.out log file:
AIX error label and ID for which the method was launched The name of the affected resource group The nodes name on which the error occurred.
simply indicates that a network interface joins the cluster. Similarly, when a network interface failure occurs, the actual event that is run in is called fail_interface. This is also reflected in the hacmp.out file. Remember that the event that is being run in this case simply indicates that a network interface on the given network has failed.
Resource group name Script name Name of the command that is being executed.
In cases where an event script does not process a specific resource group, for instance, in the beginning of a node_up event, a resource groups name cannot be obtained. In this case, the resource groups name part of the tag is blank. For example, the hacmp.out file may contain either of the following lines:
cas2:node_up_local[199] set_resource_status ACQUIRING :node_up[233] cl_ssa_fence up stan
In addition, references to the individual resources in the event summaries in the hacmp.out file contain reference tags to the associated resource groups. For instance:
Mon.Sep.10.14:54:49.EDT 2003.cl _swap_IP_address.192.168.1.1.cas2.ref
The first five config_too_long messages appear in the hacmp.out file at 30-second intervals The next set of five messages appears at an interval that is double the previous interval until the interval reaches one hour These messages are logged every hour until the event completes or is terminated on that node.
For more information on customizing the event duration time before receiving a config_too_long warning message, see the chapter on Planning for Cluster Events in the Planning Guide.
46
Troubleshooting Guide
Event Description
Verbose Output In verbose mode, the hacmp.out file also includes the values of arguments and flag settings passed to the scripts and commands. Verbose Output Example with Event Summary Some events (those initiated by the Cluster Manager) are followed by event summaries, as shown in these excerpts:
.... Mar 25 15:20:30 EVENT COMPLETED: network_up alcuin tmssanet_alcuin_bede HACMP Event Summary Event: network_up alcuin tmssanet_alcuin_bede Start time: Tue Mar 25 15:20:30 2003 End time: Tue Mar 25 15:20:30 2003 Action: Resource: Script Name: -----------------------------------------------------------------------No resources changed as a result of this event ------------------------------------------------------------------------
Event Summary for the Settling Time CustomRG has a settling time configured. A lower priority node joins the cluster:
Mar 25 15:20:30 EVENT COMPLETED: node_up alcuin HACMP Event Summary Event: node_up alcuin Start time: Tue Mar 25 15:20:30 2003 End time: Tue Mar 25 15:20:30 2003 Action: Resource: Script Name: ---------------------------------------------------------------No action taken on resource group 'CustomRG'. The Resource Group 'CustomRG' has been configured to use 20 Seconds Settling Time. This group will be processed when the timer expires. ----------------------------------------------------------------------
Troubleshooting Guide
47
Event Summary for the Fallback Timer CustomRG has a daily fallback timer configured to fall back on 22 hrs 10 minutes. The resource group is on a lower priority node (bede). Therefore, the timer is ticking; the higher priority node (alcuin) joins the cluster:
The message on bede ... Mar 25 15:20:30 EVENT COMPLETED: node_up alcuin HACMP Event Summary Event: node_up alcuin Start time: Tue Mar 25 15:20:30 2003 End time: Tue Mar 25 15:20:30 2003 Action: Resource: Script Name: ---------------------------------------------------------------No action taken on resource group 'CustomRG'. The Resource Group 'CustomRG' has been configured to fallback on Mon Mar 25 22:10:00 2003 ---------------------------------------------------------------------The message on alcuin ... Mar 25 15:20:30 EVENT COMPLETED: node_up alcuin HACMP Event Summary Event: node_up alcuin Start time: Tue Mar 25 15:20:30 2003 End time: Tue Mar 25 15:20:30 2003 Action: Resource: Script Name: ---------------------------------------------------------------The Resource Group 'CustomRG' has been configured to fallback using daily1 Timer Policy ----------------------------------------------------------------------
48
Troubleshooting Guide
Setting the Level and Format of Information Recorded in the hacmp.out File
Note: These preferences take place as soon as you set them. To set the level of information recorded in the /var/hacmp/log/hacmp.out file: 1. Enter smit hacmp 2. In SMIT, select Problem Determination Tools > HACMP Log Viewing and Management > Change/Show HACMP Log File Parameters. SMIT prompts you to specify the name of the cluster node you want to modify. Runtime parameters are configured on a per-node basis. 3. Type the node name and press Enter. SMIT displays the HACMP Log File Parameters panel. 4. To obtain verbose output, set the value of the Debug Level field to high. 5. To change the hacmp.out display format, select Formatting options for hacmp.out. Select a node and set the formatting to HTML (Low), HTML (High), Default (None), or Standard. Note: If you set your formatting options for hacmp.out to Default (None), then no event summaries will be generated. For information about event summaries, see the section Viewing Compiled hacmp.out Event Summaries. 6. To change the level of debug information, set the value of New Cluster Manager debug level field to either Low or High.
Troubleshooting Guide
49
The event summaries display is a good way to get a quick overview of what has happened in the cluster lately. If the event summaries reveal a problem event, you will probably want to examine the source hacmp.out file to see full details of what happened. Note: If you have set your formatting options for hacmp.out to Default (None), then no event summaries will be generated. The View Event Summaries command will yield no results.
50
Troubleshooting Guide
Error_description
In long format, a page of formatted information is displayed for each error. Unlike the HACMP log files, the system error log is not a text file.
Troubleshooting Guide
51
Note: This log reports specific events. Note that when resource groups are processed in parallel, certain steps previously run as separate events are now processed differently, and therefore do not show up as events in the cluster history log file. You should use the hacmp.out file, which contains greater detail on resource group activity and location, to track parallel processing activity.
52
Troubleshooting Guide
Error Message
Troubleshooting Guide
53
Using standard AIX file commands, such as the more or tail commands Using the SMIT interface.
Using the SMIT Interface to View the cspoc.log.long File To view the /var/hacmp/log/cspoc.log.long file using SMIT: 1. Enter smit hacmp. 2. In SMIT, select Problem Determination Tools > HACMP Log Viewing and Management > View Detailed HACMP Log Files > Scan the C-SPOC System Log File. Note: Note that you can select to either scan the contents of the cspoc.log.long file as it exists, or you can watch an active log file as new events are appended to it in real time. Typically, you scan the file to try to find a problem that has already occurred; you watch the file while duplicating a problem to help determine its cause, or as you test a solution to a problem to determine the results.
54
Troubleshooting Guide
+ echo /usr/es/sbin/cluster/events/utils/cl_ssa_fence down buzzcut graceful\n /usr/es/sbin/cluster/events/utils/cl_ssa_fence down buzzcut graceful ****************END OF EMULATION FOR NODE BUZZCUT *********************
The output of emulated events is presented as in the /var/hacmp/log/hacmp.out file described earlier in this chapter. The /var/hacmp/log/emuhacmp.out file also contains the following information: Header Notice Each nodes output begins with a header that signifies the start of the emulation and the node from which the output is received. The Notice field identifies the name and path of commands or scripts that are echoed only. If the command being echoed is a customized script, such as a pre- or post-event script, the contents of the script are displayed. Syntax errors in the script are also listed. The error field contains a statement indicating the type of error and the name of the script in which the error was discovered. Each nodes output ends with a footer that signifies the end of the emulation and the node from which the output is received.
ERROR Footer
Troubleshooting Guide
55
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
Nodes to Collect Data from Enter or select nodes from which the data will be collected. Separate node names with a comma. The default is All nodes. Debug Collect RSCT Log Files The default is No. Use this option if IBM Support requests to turn on debugging. The default is Yes. Skip collection of RSCT data.
Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
Output to the hacmp.out file lets you isolate details related to a specific resource group and its resources. Based on the content of the hacmp.out event summaries, you can determine whether or not the resource groups are being processed in the expected order. Depending on whether resource groups are processed serially or in parallel, you will see different output in the event summaries and in the log files. In HACMP, parallel processing is the default method. If you migrated the cluster from an earlier version of HACMP, serial processing is maintained. Note: If you configured dependent resource groups and specified the serial order of processing, the rules for processing dependent resource groups override the serial order. To avoid this, the serial order of processing that you specify should not contradict the configured dependencies between resource groups. This section contains detailed information on the following:
Serial Processing Order Reflected in Event Summaries Parallel Processing Order Reflected in Event Summaries Job Types: Parallel Resource Group Processing Processing in Clusters with Dependent Resource Groups or Sites Disk Fencing with Serial or Parallel Processing.
56
Troubleshooting Guide
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
As shown here, each resource group appears with all of its accounted resources below it.
Each line in the hacmp.out file flow includes the name of the resource group to which it applies The event summary information includes details about all resource types Each line in the event summary indicates the related resource group.
The following example shows an event summary for resource groups named cascrg1 and cascrg2 that are processed in parallel:
HACMP Event Summary
Troubleshooting Guide
57
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
Event: node_ up electron Start time: Wed May 8 11: 06: 30 2002 End time: Wed May 8 11: 07: 49 2002 Action: Resource: Script Name: ------------------------------------------------------------Acquiring resource group: cascrg1 process_ resources Search on: Wed. May. 8. 11: 06: 33. EDT. 2002. process_ resources. cascrg1. ref Acquiring resource group: cascrg2 process_ resources Search on: Wed. May. 8. 11: 06: 34. EDT. 2002. process_ resources. cascrg2. ref Acquiring resource: 192. 168. 41. 30 cl_ swap_ IP_ address Search on: Wed. May. 8. 11: 06: 36. EDT. 2002. cl_ swap_ IP_ address. 192. 168. 41. 30 Acquiring resource: hdisk1 cl_ disk_ available Search on: Wed. May. 8. 11: 06: 40. EDT. 2002. cl_ disk_ available. hdisk1. cascrg1 Acquiring resource: hdisk2 cl_ disk_ available Search on: Wed. May. 8. 11: 06: 40. EDT. 2002. cl_ disk_ available. hdisk2. cascrg2 Resource online: hdisk1 cl_ disk_ available Search on: Wed. May. 8. 11: 06: 42. EDT. 2002. cl_ disk_ available. hdisk1. cascrg1 Resource online: hdisk2 cl_ disk_ available Search on: Wed. May. 8. 11: 06: 43. EDT. 2002. cl_ disk_ available. hdisk2. cascrg2
As shown here, all processed resource groups are listed first, followed by the individual resources that are being processed.
There is one job type for each resource type: DISKS, FILESYSTEMS, TAKEOVER_LABELS, TAPE_RESOURCES, AIX_FAST_CONNECTIONS, APPLICATIONS, COMMUNICATION_LINKS, CONCURRENT_VOLUME_GROUPS, EXPORT_FILESYSTEMS, and MOUNT_FILESYSTEMS. There are also a number of job types that are used to help capitalize on the benefits of parallel processing: SETPRKEY, TELINIT, SYNC_VGS, LOGREDO, and UPDATESTATD. The related operations are now run once per event, rather than once per resource group. This is one of the primary areas of benefit from parallel resource group processing, especially for small clusters.
The following sections describe some of the most common job types in more detail and provide abstracts from the events in the hacmp.out log file which include these job types.
58
Troubleshooting Guide
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
JOB_TYPE=ONLINE
In the complete phase of an acquisition event, after all resources for all resource groups have been successfully acquired, the ONLINE job type is run. This job ensures that all successfully acquired resource groups are set to the online state. The RESOURCE_GROUPS variable contains the list of all groups that were acquired.
:process_resources[1476] clRGPA :clRGPA[48] [[ high = high ]] :clRGPA[48] version= 1. 16 :clRGPA[50] usingVer= clrgpa :clRGPA[55] clrgpa :clRGPA[56] exit 0 :process_resources[1476] eval JOB_TYPE= ONLINE RESOURCE_GROUPS=" cascrg1 cascrg2 conc_ rg1" :process_resources[1476] JOB_TYPE= ONLINE RESOURCE_GROUPS= cascrg1 cascrg2 conc_rg1 :process_resources[1478] RC= 0 :process_resources[1479] set +a :process_resources[1481] [ 0 -ne 0 ] :process_resources[1700] set_resource_group_state UP
JOB_TYPE= OFFLINE
In the complete phase of a release event, after all resources for all resource groups have been successfully released, the OFFLINE job type is run. This job ensures that all successfully released resource groups are set to the offline state. The RESOURCE_GROUPS variable contains the list of all groups that were released.
conc_rg1 :process_resources[1476] clRGPA conc_rg1 :clRGPA[48] [[ high = high ]] conc_rg1 :clRGPA[48] version= 1. 16 conc_rg1 :clRGPA[50] usingVer= clrgpa conc_rg1 :clRGPA[55] clrgpa conc_rg1 :clRGPA[56] exit 0 conc_rg1 :process_resources[1476] eval JOB_TYPE= OFFLINE RESOURCE_GROUPS=" cascrg2 conc_ rg1" conc_ rg1:process_resources[1476] JOB_TYPE= OFFLINE RESOURCE_GROUPS= cascrg2 conc_rg1 conc_ rg1 :process_resources[1478] RC= 0 conc_rg1 :process_resources[1479] set +a conc_rg1 :process_resources[1481] [ 0 -ne 0 ] conc_rg1 :process_resources[1704] set_resource_group_state DOWN
Troubleshooting Guide
59
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
JOB_TYPE=ERROR
If an error occurred during the acquisition or release of any resource, the ERROR job type is run. The variable RESOURCE_GROUPS contains the list of all groups where acquisition or release failed during the current event. These resource groups are moved into the error state. When this job is run during an acquisition event, HACMP uses the Recovery from Resource Group Acquisition Failure feature and launches an rg_move event for each resource group in the error state. For more information, see the Handling of Resource Group Acquisition Failures section in Appendix B: Resource Group Behavior During Cluster Events in the Administration Guide.
conc_rg1: process_resources[1476] clRGPA conc_rg1: clRGPA[50] usingVer= clrgpa conc_rg1: clRGPA[55] clrgpa conc_rg1: clRGPA[56] exit 0 conc_rg1: process_resources[1476] eval JOB_ TYPE= ERROR RESOURCE_GROUPS=" cascrg1" conc_rg1: cascrg1 conc_rg1: conc_rg1: conc_rg1: conc_rg1: process_resources[1476] JOB_TYPE= ERROR RESOURCE_GROUPS= process_resources[1478] process_resources[1479] process_resources[1481] process_resources[1712] RC= 0 set +a [ 0 -ne 0 ] set_resource_group_state ERROR
JOB_TYPE=NONE
After all processing is complete for the current process_resources script, the final job type of NONE is used to indicate that processing is complete and the script can return. When exiting after receiving this job, the process_resources script always returns 0 for success.
conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: conc_rg1: process_resources[1476] clRGPA clRGPA[48] [[ high = high ]] clRGPA[48] version= 1.16 clRGPA[50] usingVer= clrgpa clRGPA[55] clrgpa clRGPA[56] exit 0 process_resources[1476] eval JOB_TYPE= NONE process_resources[1476] JOB_TYPE= NONE process_resources[1478] RC= 0 process_resources[1479] set +a process_resources[1481] [ 0 -ne 0 ] process_resources[1721] break process_resources[1731] exit 0
60
Troubleshooting Guide
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
JOB_TYPE=ACQUIRE
The ACQUIRE job type occurs at the beginning of any resource group acquisition event. Search hacmp. out for JOB_ TYPE= ACQUIRE and view the value of the RESOURCE_ GROUPS variable to see a list of which resource groups are being acquired in parallel during the event.
:process_resources[1476] clRGPA :clRGPA[48] [[ high = high ]] :clRGPA[48] version= 1. 16 :clRGPA[50] usingVer= clrgpa :clRGPA[55] clrgpa :clRGPA[56] exit 0 :process_resources[1476] eval JOB_TYPE= ACQUIRE RESOURCE_GROUPS=" cascrg1 cascrg2" :process_resources[1476] JOB_TYPE= ACQUIRE RESOURCE_GROUPS= cascrg1 cascrg2 :process_resources[1478] RC= 0 :process_resources[1479] set +a :process_resources[1481] [ 0 -ne 0 ] :process_resources[1687] set_resource_group_state ACQUIRING
JOB_TYPE=RELEASE
The RELEASE job type occurs at the beginning of any resource group release event. Search hacmp. out for JOB_ TYPE= RELEASE and view the value of the RESOURCE_ GROUPS variable to see a list of which resource groups are being released in parallel during the event.
:process_resources[1476] clRGPA :clRGPA[48] [[ high = high ]] :clRGPA[48] version= 1. 16 :clRGPA[50] usingVer= clrgpa :clRGPA[55] clrgpa :clRGPA[56] exit 0 :process_resources[1476] eval JOB_ TYPE= RELEASE RESOURCE_ GROUPS=" cascrg1 cascrg2" :process_resources[1476] JOB_ TYPE= RELEASE RESOURCE_ GROUPS= cascrg1 cascrg2 :process_resources[1478] RC= 0 :process_resources[1479] set +a :process_resources[1481] [ 0 -ne 0 ] :process_resources[1691] set_ resource_ group_ state RELEASING
JOB_TYPE= SSA_FENCE
The SSA_FENCE job type is used to handle fencing and unfencing of SSA disks. The variable ACTION indicates what should be done to the disks listed in the HDISKS variable. All resources groups (both parallel and serial) use this method for disk fencing.
:process_resources[1476] clRGPA FENCE :clRGPA[48] [[ high = high ]] :clRGPA[55] clrgpa FENCE :clRGPA[56] exit 0 :process_resources[1476] eval JOB_TYPE= SSA_ FENCE ACTION= ACQUIRE HDISKS=" hdisk6" RESOURCE_GROUPS=" conc_ rg1 " HOSTS=" electron" :process_ resources[1476] JOB_TYPE= SSA_FENCE ACTION= ACQUIRE HDISKS= hdisk6 RESOURCE_GROUPS= conc_rg1 HOSTS=electron :process_ resources[1478] RC= 0 :process_ resources[1479] set +a :process_ resources[1481] [ 0 -ne 0 ] :process_ resources[1675] export GROUPNAME= conc_ rg1 conc_ rg1 :process_ resources[1676] process_ ssa_ fence ACQUIRE
Troubleshooting Guide
61
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
Note: Notice that disk fencing uses the process_resources script, and, therefore, when disk fencing occurs, it may mislead you to assume that resource processing is taking place, when, in fact, only disk fencing is taking place. If disk fencing is enabled, you will see in the hacmp.out file that the disk fencing operation occurs before any resource group processing. Although the process_ resources script handles SSA disk fencing, the resource groups are processed serially. cl_ ssa_ fence is called once for each resource group that requires disk fencing. The hacmp.out content indicates which resource group is being processed.
conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: hdisk6 conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: conc_ rg1: process_resources[8] export GROUPNAME process_resources[10] get_ list_ head hdisk6 process_resources[10] read LIST_OF_HDISKS_ FOR_ RG process_resources[11] read HDISKS process_resources[11] get_ list_ tail hdisk6 process_resources[13] get_ list_ head electron process_resources[13] read HOST_ FOR_ RG process_resources[14] get_ list_ tail electron process_resources[14] read HOSTS process_resources[18] cl_ ssa_fence ACQUIRE electron cl_ssa_fence[43] cl_ssa_fence[44] cl_ssa_fence[44] cl_ssa_fence[46] cl_ssa_fence[48] cl_ssa_fence[56] version= 1. 9. 1. 2 STATUS= 0 (( 3 < 3 OPERATION= ACQUIRE
JOB_TYPE=SERVICE_LABELS
The SERVICE_LABELS job type handles the acquisition or release of service labels. The variable ACTION indicates what should be done to the service IP labels listed in the IP_LABELS variable.
conc_ rg1: process_ resources[ 1476] clRGPA conc_ rg1: clRGPA[ 55] clrgpa conc_ rg1: clRGPA[ 56] exit 0 conc_ rg1: process_ resources[ 1476] eval JOB_ TYPE= SERVICE_ LABELS ACTION= ACQUIRE IP_ LABELS=" elect_ svc0: shared_ svc1, shared_ svc2" RESOURCE_ GROUPS=" cascrg1 rotrg1" COMMUNICATION_ LINKS=": commlink1" conc_ rg1: process_ resources[1476] JOB_ TYPE= SERVICE_ LABELS ACTION= ACQUIRE IP_ LABELS= elect_ svc0: shared_ svc1, shared_ svc2 RESOURCE_ GROUPS= cascrg1 rotrg1 COMMUNICATION_ LINKS=: commlink1 conc_ rg1: process_ resources[1478] RC= 0 conc_ rg1: process_ resources[1479] set +a conc_ rg1: process_ resources[1481] [ 0 -ne 0 ] conc_ rg1: process_ resources[ 1492] export GROUPNAME= cascrg1
This job type launches an acquire_service_addr event. Within the event, each individual service label is acquired. The content of the hacmp.out file indicates which resource group is being processed. Within each resource group, the event flow is the same as it is under serial processing.
cascrg1: cascrg1: cascrg1: cascrg1: cascrg1: cascrg1: cascrg1: cascrg1: acquire_service_addr[ 251] export GROUPNAME acquire_service_addr[251] [[ true = true ]] acquire_service_addr[254] read SERVICELABELS acquire_service_addr[254] get_ list_ head electron_ svc0 acquire_service_addr[255] get_ list_ tail electron_ svc0 acquire_service_addr[255] read IP_ LABELS acquire_service_addr[257] get_ list_ head acquire_service_addr[257] read SNA_ CONNECTIONS
62
Troubleshooting Guide
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
export SNA_ CONNECTIONS get_ list_ tail read _SNA_ CONNECTIONS clgetif -a electron_ svc0
JOB_TYPE=VGS
The VGS job type handles the acquisition or release of volume groups. The variable ACTION indicates what should be done to the volume groups being processed, and the names of the volume groups are listed in the VOLUME_GROUPS and CONCURRENT_VOLUME_GROUPS variables.
conc_rg1 :process_resources[1476] clRGPA conc_rg1 :clRGPA[55] clrgpa conc_rg1 :clRGPA[56] exit 0 conc_rg1 :process_resources[1476] eval JOB_TYPE= VGS ACTION= ACQUIRE CONCURRENT_VOLUME_GROUP=" con_vg6" VOLUME_GROUPS="" casc_vg1: casc_vg2" RESOURCE_GROUPS=" cascrg1 cascrg2 " EXPORT_FILESYSTEM="" conc_rg1 :process_resources[1476] JOB_TYPE= VGS ACTION= ACQUIRE CONCURRENT_VOLUME_GROUP= con_vg6 VOLUME_GROUPS= casc_vg1: casc_ vg2 RESOURCE_GROUPS= cascrg1 cascrg2 EXPORT_FILESYSTEM="" conc_rg1 :process_resources[1478] RC= 0 conc_rg1 :process_resources[1481] [ 0 -ne 0 ] conc_rg1 :process_resources[1529] export GROUPNAME= cascrg1 cascrg2
This job type runs the cl_activate_vgs event utility script, which acquires each individual volume group. The content of the hacmp.out file indicates which resource group is being processed, and within each resource group, the script flow is the same as it is under serial processing.
cascrg1 cascrg2 :cl_activate_vgs[256] 1> /usr/ es/ sbin/ cluster/ etc/ lsvg. out. 21266 2> /tmp/ lsvg. err cascrg1: cl_activate_vgs[260] export GROUPNAME cascrg1: cl_activate_vgs[262] get_ list_head casc_vg1: casc_vg2 cascrg1: cl_activate_vgs[ 62] read_LIST_OF_VOLUME_GROUPS_FOR_RG cascrg1: cl_activate_vgs[263] get_list_tail casc_vg1: casc_vg2 cascrg1: cl_activate_vgs[263] read VOLUME_GROUPS cascrg1: cl_activate_vgs[265] LIST_OF_VOLUME_GROUPS_ FOR_ RG= cascrg1: cl_activate_vgs[ 270] fgrep -s -x casc_ vg1 /usr/ es/ sbin/ cluster/ etc/ lsvg. out. 21266 cascrg1: cl_activate_vgs[275] LIST_OF_VOLUME_GROUPS_FOR_RG= casc_vg1 cascrg1: cl_activate_vgs[275] [[ casc_ vg1 = ]]
Troubleshooting Guide
63
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
64
Troubleshooting Guide
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
resource group's state change information. These sets of variables provide a picture of resource group actions on the peer site during the course of the local event during the acquire phase. For JOB_TYPE=RELEASE the following variables are used (both in node_down and rg_move release): SIBLING_GROUPS SIBLING_NODES_BY_GROUP SIBLING_RELEASE_GROUPS SIBLING_RELEASE_NODES_BY_GROUP On a per resource group basis the following variables are tracked: SIBLING_NODES SIBLING_NON_OWNER_NODES SIBLING_ACQURING_GROUPS or SIBLING_RELEASING_GROUPS SIBLING_ACQUIRING_NODES_BY_GROUP or SIBLING_RELEASING_GROUPS_BY_NODE Sample Event with Siblings Output to hacmp.out
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Mar 28 09:40:42 EVENT START: rg_move a2 1ACQUIRE xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx :process_resources[1952] eval JOB_TYPE=ACQUIRE RESOURCE_GROUPS="rg3" SIBLING_GROUPS="rg1 rg3" SIBLING_NODES_BY_GROUP="b2 : b2" SIBLING_ACQUIRING_GROUPS="" SIBLING_ACQUIRING_NODES_BY_GROUP="" PRINCIPAL_ACTION="ACQUIRE" AUXILLIARY_ACTION="NONE" :process_resources[1952] JOB_TYPE=ACQUIRE RESOURCE_GROUPS=rg3 SIBLING_GROUPS=rg1 rg3 SIBLING_NODES_BY_GROUP=b2 : b2 SIBLING_ACQUIRING_GROUPS= SIBLING_ACQUIRING_NODES_BY_GROUP= PRINCIPAL_ACTION=ACQUIRE AUXILLIARY_ACTION=NONE xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx :rg_move_complete[157] eval FORCEDOWN_GROUPS="" RESOURCE_GROUPS="" HOMELESS_GROUPS="" ERRSTATE_GROUPS="" PRINCIPAL_ACTIONS="" ASSOCIATE_ACTIONS="" AUXILLIARY_ACTIONS="" SIBLING_GROUPS="rg1 rg3" SIBLING_NODES_BY_GROUP="b2 : b2" SIBLING_ACQUIRING_GROUPS="" SIBLING _ACQUIRING_NODES_BY_GROUP="" SIBLING_RELEASING_GROUPS="" SIBLING_RELEASING_NODES_BY_GROUP="" :rg_move_complete[157] FORCEDOWN_GROUPS= RESOURCE_GROUPS= HOMELESS_GROUPS= ERRSTATE_GROUPS= PRINCIPAL_ACTIONS= ASSOCIATE_ACTIONS= AUXILLIARY_ACTIONS= SIBLING_GROUPS=rg1 rg3 SIBLING_NODES_BY_GROUP=b2 : b2 SIBLING_ACQUIRING_GROUPS= SIBLING_ACQUIRING_NODES_BY_GROUP = SIBLING_RELEASING_GROUPS= SIBLING_RELEASING_NODES_BY_GROUP= xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx :process_resources[1952] eval JOB_TYPE=SYNC_VGS ACTION=ACQUIRE VOLUME_GROUPS="vg3,vg3sm" RESOURCE_GROUPS="rg3 " :process_resources[1952] JOB_TYPE=SYNC_VGS ACTION=ACQUIRE_VOLUME_GROUPS=vg3,vg3sm RESOURCE_GROUPS=rg3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx rg3:process_resources[1952] eval JOB_TYPE=ONLINE RESOURCE_GROUPS="rg3" rg3:process_resources[1952] JOB_TYPE=ONLINE RESOURCE_GROUPS=rg3 rg3:process_resources[1954] RC=0 rg3:process_resources[1955] set +a rg3:process_resources[1957] [ 0 -ne 0 ]
Troubleshooting Guide
65
Using Cluster Log Files Tracking Resource Group Parallel and Serial Processing in the hacmp.out File
rg3:process_resources[2207] set_resource_group_state UP rg3:process_resources[3] STAT=0 rg3:process_resources[6] export GROUPNAME rg3:process_resources[7] [ UP != DOWN ] rg3:process_resources[9] [ REAL = EMUL ] rg3:process_resources[14] clchdaemons -d clstrmgr_scripts -t resource_locator -n a1 -o rg3 -v UP rg3:process_resources[15] [ 0 -ne 0 ] rg3:process_resources[26] [ UP = ACQUIRING ] rg3:process_resources[31] [ UP = RELEASING ] rg3:process_resources[36] [ UP = UP ] rg3:process_resources[38] cl_RMupdate rg_up rg3 process_resources Reference string: Sun.Mar.27.18:02:09.EST.2005.process_resources.rg3.ref rg3:process_resources[39] continue rg3:process_resources[80] return 0 rg3:process_resources[1947] true rg3:process_resources[1949] set -a rg3:process_resources[1952] clRGPA rg3:clRGPA[33] [[ high = high ]] rg3:clRGPA[33] version=1.16 rg3:clRGPA[35] usingVer=clrgpa rg3:clRGPA[40] clrgpa rg3:clRGPA[41] exit 0 rg3:process_resources[1952] eval JOB_TYPE=NONE rg3:process_resources[1952] JOB_TYPE=NONE rg3:process_resources[1954] RC=0 rg3:process_resources[1955] set +a rg3:process_resources[1957] [ 0 -ne 0 ] rg3:process_resources[2256] break rg3:process_resources[2267] [[ FALSE = TRUE ]] rg3:process_resources[2273] exit 0 :rg_move_complete[346] STATUS=0 :rg_move_complete[348] exit 0 Mar 27 18:02:10 EVENT COMPLETED: rg_move_complete a1 2 0
66
Troubleshooting Guide
Using Cluster Log Files Managing a Nodes HACMP Log File Parameters
Set the level of debug information output by the HACMP scripts. By default, HACMP sets the debug information parameter to high, which produces detailed output from script execution. Set the output format for the hacmp.out log file.
To change the log file parameters for a node: 1. Enter the fastpath smit hacmp 2. In SMIT, select Problem Determination Tools > HACMP Log Viewing and Management > Change/Show HACMP Log File Parameters and press Enter. 3. Select a node from the list. 4. Enter field values as follows: Debug Level Cluster event scripts have two levels of logging. The low level only logs events and errors encountered while the script executes. The high (default) level logs all commands performed by the script and is strongly recommended. The high level provides the level of script tracing needed to resolve many cluster problems. Select one of these: Default (None) (no special format), Standard (include search strings), HTML (Low) (limited HTML formatting), or HTML (High) (full HTML format).
5. Press Enter to add the values into the HACMP for AIX Configuration Database. 6. Return to the main HACMP menu. Select Extended Configuration > Extended Verification and Synchronization. The software checks whether cluster services are running on any cluster node. If so, there will be no option to skip verification. 7. Select the options you want to use for verification and Press Enter to synchronize the cluster configuration and node environment across the cluster. See Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide for complete information on this operation.
Troubleshooting Guide
67
You can view the content of the clcomd.log or clcomddiag.log file by using the AIX vi or more commands. You can turn off logging to clcomddiag.log temporarily (until the next reboot, or until you enable logging for this component again) by using the AIX tracesoff command. To permanently stop logging to clcomddiag.log, start the daemon from SRC without the -d flag by using the following command:
chssys -s clcomdES -a ""
Checks the location of the target directory to determine whether it is part of a local or remote file system.
68
Troubleshooting Guide
Performs a check to determine whether the target directory is managed by HACMP. If it is, any attempt to redirect a log file will fail. Checks to ensure that the target directory is specified using an absolute path (such as /mylogdir) as opposed to a relative path (such as mylogdir).
These checks decrease the possibility that the chosen file system may become unexpectedly unavailable. Note: The target directory must have read-write access.
Troubleshooting Guide
69
70
Troubleshooting Guide
Chapter 3:
This chapter guides you through the steps to investigate system components, identify problems that you may encounter as you use HACMP, and offers possible solutions.
Overview
If no error messages are displayed on the console and if examining the log files proves fruitless, you next investigate each component of your HACMP environment and eliminate it as the cause of the problem. The first section of this chapter reviews methods for investigating system components, including the RSCT subsystem. It includes these sections:
Investigating System Components Checking Highly Available Applications Checking the HACMP Layer Checking the Logical Volume Manager Checking the TCP/IP Subsystem Checking the AIX Operating System Checking Physical Networks Checking Disks, Disk Adapters, and Disk Heartbeating Networks Checking the Cluster Communications Daemon Checking System Hardware HACMP Installation Issues HACMP Startup Issues Disk and File System Issues Network and Switch Issues Cluster Communications Issues HACMP Takeover Issues Client Issues Miscellaneous Issues
The second section provides recommendations for investigating the following areas:
Troubleshooting Guide
71
Investigating System Components and Solving Common Problems Investigating System Components
Do some simple tests; for example, for a database application try to add and delete a record. Use the ps command to check that the necessary processes are running, or to verify that the processes were stopped properly. Check the resources that the application expects to be present to ensure that they are available, the file systems and volume groups for example.
72
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the HACMP Layer
The following sections describe how to investigate these problems. Note: These steps assume that you have checked the log files and that they do not point to the problem.
The Cluster Manager (clstrmgrES) daemon The Cluster Communications (clcomdES) daemon The Cluster Information Program (clinfoES) daemon.
When these components are not responding normally, determine if the daemons are active on a cluster node. Use either the options on the SMIT System Management (C-SPOC) > Manage HACMP Services > Show Cluster Services panel or the lssrc command. For example, to check on the status of all daemons under the control of the SRC, enter:
lssrc -a | grep active syslogd ras sendmail mail portmap portmap inetd tcpip snmpd tcpip dpid2 tcpip hostmibd tcpip aixmibd tcpip biod nfs rpc.statd nfs rpc.lockd nfs qdaemon spooler writesrv spooler ctrmc rsct clcomdES clcomdES IBM.CSMAgentRM rsct_rm IBM.ServiceRM rsct_rm 290990 270484 286868 295106 303260 299162 282812 278670 192646 254122 274584 196720 250020 98392 204920 90268 229510 active active active active active active active active active active active active active active active active active
Troubleshooting Guide
73
Investigating System Components and Solving Common Problems Checking the HACMP Layer
To check on the status of all cluster daemons under the control of the SRC, enter:
lssrc -g cluster
Note: When you use the -g flag with the lssrc command, the status information does not include the status of subsystems if they are inactive. If you need this information, use the -a flag instead. For more information on the lssrc command, see the man page. To view additional information on the status of a daemon run the clcheck_server command. The clcheck_server command makes additional checks and retries beyond what is done by lssrc command. For more information, see the clcheck_server man page. To determine whether the Cluster Manager is running, or if processes started by the Cluster Manager are currently running on a node, use the ps command. For example, to determine whether the clstrmgrES daemon is running, enter:
ps -ef | grep clstrmgrES root 18363 3346 3 11:02:05 - 10:20 /usr/es/sbin/cluster/clstrmgrES root 19028 19559 2 16:20:04 pts/10 0:00 grep clstrmgrES
See the ps man page for more information on using this command.
Verify that all cluster nodes contain the same cluster topology information Check that all network interface cards and tty lines are properly configured, and that shared disks are accessible to all nodes that can own them Check each cluster node to determine whether multiple RS232 non-IP networks exist on the same tty device
74
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the HACMP Layer
Check for agreement among all nodes on the ownership of defined resources, such as file systems, log files, volume groups, disks, and application servers Check for invalid characters in cluster names, node names, network names, network interface names and resource group names Verify takeover information. Custom snapshot methods Custom verification methods Custom pre or post events Cluster log file redirection. All IP labels listed in the configuration have the appropriate service principals in the .klogin file on each node in the cluster All nodes have the proper service principals Kerberos is installed on all nodes in the cluster All nodes have the same security mode setting.
The verification utility will also print out diagnostic information about the following:
If you have configured Kerberos on your system, the verification utility also determines that:
From the main HACMP SMIT panel, select Problem Determination Tools > HACMP Verification > Verify HACMP Configuration. If you find a configuration problem, correct it, then resynchronize the cluster. Note: Some errors require that you make changes on each cluster node. For example, a missing application start script or a volume group with autovaryon=TRUE requires a correction on each affected node. Some of these issues can be taken care of by using HACMP File Collections. For more information about using the cluster verification utility and HACMP File Collections, see Chapter 7: Verifying and Synchronizing a Cluster Configuration in the Administration Guide. Run the /usr/es/sbin/cluster/utilities/cltopinfo command to see a complete listing of cluster topology. In addition to running the HACMP verification process, check for recent modifications to the node configuration files. The command ls -lt /etc will list all the files in the /etc directory and show the most recently modified files that are important to configuring AIX, such as:
It is also very important to check the resource group configuration for any errors that may not be flagged by the verification process. For example, make sure the file systems required by the application servers are included in the resource group with the application.
Troubleshooting Guide
75
Investigating System Components and Solving Common Problems Checking the HACMP Layer
Check that the nodes in each resource group are the ones intended, and that the nodes are listed in the proper order. To view the cluster resource configuration information from the main HACMP SMIT panel, select Extended Configuration > Extended Resource Configuration > HACMP Extended Resource Group Configuration > Show All Resources by Node or Resource Group. You can also run the /usr/es/sbin/cluster/utilities/clRGinfo command to see the resource group information. Note: If cluster configuration problems arise after running the cluster verification utility, do not run C-SPOC commands in this environment as they may fail to execute on cluster nodes.
76
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the HACMP Layer
In addition to this Configuration Database data, a cluster snapshot also includes output generated by various HACMP and standard AIX commands and utilities. This data includes the current state of the cluster, node, network, and network interfaces as viewed by each cluster node, as well as the state of any running HACMP daemons. The cluster snapshot includes output from the following commands: cllscf cllsnw cllsif clshowres df exportfs ifconfig ls lsfs lslpp lslv lsvg netstat no clchsyncd cltopinfo
In HACMP 5.1 and up, by default, HACMP no longer collects cluster log files when you create the cluster snapshot, although you can still specify to do so in SMIT. Skipping the logs collection reduces the size of the snapshot and speeds up running the snapshot utility. You can use SMIT to collect cluster log files for problem reporting. This option is available under the Problem Determination Tools > HACMP Log Viewing and Management > Collect Cluster log files for Problem Reporting SMIT menu. It is recommended to use this option only if requested by the IBM support personnel. If you want to add commands to obtain site-specific information, create custom snapshot methods as described in the chapter on Saving and Restoring Cluster Configurations in the Administration Guide. Note that you can also use the AIX snap -e command to collect HACMP cluster data, including the hacmp.out and clstrmgr.debug log files.
Description section
Troubleshooting Guide
77
Investigating System Components and Solving Common Problems Checking the HACMP Layer
This section contains the HACMP Configuration Database object classes in generic AIX ODM stanza format. The characters <ODM identify the start of this section; the characters </ODM identify the end of this section.
The following is an excerpt from a sample cluster snapshot Configuration Database data file showing some of the ODM stanzas that are saved:
<VER 1.0 </VER <DSC My Cluster Snapshot </DSC <ODM HACMPcluster: id = 1106245917 name = "HA52_TestCluster" nodename = "mynode" sec_level = Standard sec_level_msg = sec_encryption = sec_persistent = last_node_ids = highest_node_id = 0 last_network_ids = highest_network_id = 0 last_site_ides = highest_site_id = 0 handle = 1 cluster_version = 7 reserved1 = 0 reserved2 = 0 wlm_subdir = settling_time = o rg_distribution_policy = node noautoverification = 0 clvernodename = clverhour = 0 HACMPnode: name = mynode object = VERBOSE_LOGGING value = high . . </ODM
78
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the Logical Volume Manager
Cluster State Information File (.info) This file contains the output from standard AIX and HACMP system management commands. This file is given the same user-defined basename with the .info file extension. If you defined custom snapshot methods, the output from them is appended to this file. The Cluster State Information file contains three sections: Version section This section identifies the version of the cluster snapshot. The characters <VER identify the start of this section; the characters </VER identify the end of this section. The cluster snapshot software sets this section. This section contains user-defined text that describes the cluster snapshot. You can specify up to 255 characters of descriptive text. The characters <DSC identify the start of this section; the characters </DSC identify the end of this section. This section contains the output generated by AIX and HACMP ODM commands. This section lists the commands executed and their associated output. This section is not delimited in any way.
Description section
Troubleshooting Guide
79
Investigating System Components and Solving Common Problems Checking the Logical Volume Manager
To list only the active (varied on) volume groups in the system, use the lsvg -o command as follows:
lsvg -o
To list all logical volumes in the volume group, and to check the volume group status and attributes, use the lsvg -l command and specify the volume group name as shown in the following example:
lsvg -l rootvg
Note: The volume group must be varied on to use the lsvg-l command. You can also use HACMP SMIT to check for inconsistencies: System Management (C-SPOC) > HACMP Logical Volume Management > Shared Volume Groups option to display information about shared volume groups in your cluster.
vg state could be active (if it is active varyon), or passive only (if it is passive varyon). vg mode could be concurrent or enhanced concurrent.
80
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the Logical Volume Manager
The first column of the display shows the logical name of the disk. The second column lists the physical volume identifier of the disk. The third column lists the volume group (if any) to which it belongs. Note that on each cluster node, AIX can assign different names (hdisk numbers) to the same physical volume. To tell which names correspond to the same physical volume, compare the physical volume identifiers listed on each node.
Troubleshooting Guide
81
Investigating System Components and Solving Common Problems Checking the Logical Volume Manager
If you specify the logical device name of a physical volume (hdiskx) as an argument to the lspv command, it displays information about the physical volume, including whether it is active (varied on). For example:
lspv hdisk2 PHYSICAL VOLUME: PV IDENTIFIER: PV STATE: STALE PARTITIONS: PP SIZE: TOTAL PPs: FREE PPs: USED PPs: FREE DISTRIBUTION:
USED DISTRIBUTION:
00..11..00..00..00
hdisk2 0000301919439ba5 active 0 4 megabyte(s) 203 (812 megabytes) 192 (768 megabytes) 11 (44 megabytes) 41..30..40..40..41
VOLUME GROUP: abalonevg VG IDENTIFIER: 00003019460f63c7 VG STATE: active/complete ALLOCATABLE: yes LOGICAL VOLUMES: 2 VG DESCRIPTORS: 2
If a physical volume is inactive (not varied on, as indicated by question marks in the PV STATE field), use the appropriate command for your configuration to vary on the volume group containing the physical volume. Before doing so, however, you may want to check the system error report to determine whether a disk problem exists. Enter the following command to check the system error report:
errpt -a|more
You can also use the lsdev command to check the availability or status of all physical volumes known to the system.
Use the lslv logicalvolume command to display information about the state (opened or closed) of a specific logical volume, as indicated in the LV STATE field. For example:
lslv nodeAlv LOGICAL VOLUME: nodeAlv LV IDENTIFIER: 00003019460f63c7.1 VG STATE: active/complete TYPE: jfs MAX LPs: 128 COPIES: 1 LPs: 10 STALE PPs: 0 INTER-POLICY: minimum INTRA-POLICY: middle MOUNT POINT: /nodeAfs MIRROR WRITE CONSISTENCY: on EACH LP COPY ON A SEPARATE PV ?: yes VOLUME GROUP: PERMISSION: LV STATE: WRITE VERIFY: PP SIZE: SCHED POLICY: PPs: BB POLICY: RELOCATABLE: UPPER BOUND: LABEL: nodeAvg read/write opened/syncd off 4 megabyte(s) parallel 10 relocatable yes 32 /nodeAfs
If a logical volume state is inactive (or closed, as indicated in the LV STATE field), use the appropriate command for your configuration to vary on the volume group containing the logical volume.
82
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the Logical Volume Manager
Use the cl_lsfs command to list file system information when running the C-SPOC utility.
Determine whether and where the file system is mounted, then compare this information against the HACMP definitions to note any differences.
Troubleshooting Guide
83
Investigating System Components and Solving Common Problems Checking the Logical Volume Manager
Check the %used column for file systems that are using more than 90% of their available space. Then check the free column to determine the exact amount of free space left.
Important: For file systems to be NFS exported, be sure to verify that logical volume names for these file systems are consistent throughout the cluster.
84
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the TCP/IP Subsystem
For file systems controlled by HACMP, this error message typically does not indicate a problem. The file system check fails because the volume group on which the file system is defined is not varied on at boot time. To avoid generating this message, edit the /etc/filesystems file to ensure that the stanzas for the shared file systems do not include the check=true attribute.
Use the netstat command to make sure that the network interfaces are initialized and that a communication path exists between the local node and the target node. Use the ping command to check the point-to-point connectivity between nodes. Use the ifconfig command on all network interfaces to detect bad IP addresses, incorrect subnet masks, and improper broadcast addresses. Scan the /var/hacmp/log/hacmp.out file to confirm that the /etc/rc.net script has run successfully. Look for a zero exit status. If IP address takeover is enabled, confirm that the /etc/rc.net script has run and that the service interface is on its service address and not on its base (boot) address. Use the lssrc -g tcpip command to make sure that the inetd daemon is running. Use the lssrc -g portmap command to make sure that the portmapper daemon is running. Use the arp command to make sure that the cluster nodes are not using the same IP or hardware address. Use the netstat command to:
Show the status of the network interfaces defined for a node. Determine whether a route from the local node to the target node is defined. The netstat -in command displays a list of all initialized interfaces for the node, along with the network to which that interface connects and its IP address. You can use this command to determine whether the service and standby interfaces are on separate subnets. The subnets are displayed in the Network column.
netstat -in Name lo0 lo0 en1 en1 en0 en0 tr1 tr1 Mtu 1536 1536 1500 1500 1500 1500 1492 1492 Network <Link> 127 <Link> 100.100.86. <Link> 100.100.83. <Link> 100.100.84. Address 127.0.0.1 100.100.86.136 100.100.83.136 100.100.84.136 Ipkts Ierrs 18406 0 18406 0 1111626 0 1111626 0 943656 0 943656 0 1879 0 1879 0 Opkts 18406 18406 58643 58643 52208 52208 1656 1656 Oerrs 0 0 0 0 0 0 0 0 Coll 0 0 0 0 0 0 0 0
Troubleshooting Guide
85
Investigating System Components and Solving Common Problems Checking the TCP/IP Subsystem
Look at the first, third, and fourth columns of the output. The Name column lists all the interfaces defined and available on this node. Note that an asterisk preceding a name indicates the interface is down (not ready for use). The Network column identifies the network to which the interface is connected (its subnet). The Address column identifies the IP address assigned to the node. The netstat -rn command indicates whether a route to the target node is defined. To see all the defined routes, enter:
netstat -rn
Tree for Protocol Family 2: 127.0.0.1 U 127.0.0.1 UH 100.100.83.136 U 100.100.84.136 U 100.100.85.136 U 100.100.86.136 U 100.100.100.136 U Tree for Protocol Family 6: 3 0 6 1 2 8 0 1436 456 18243 1718 1721 21648 39 lo0 lo0 en0 tr1 tr0 en1 en0
The same test, run on a system that does not have this route in its routing table, returns no response. If the service and standby interfaces are separated by a bridge, router, or hub and you experience problems communicating with network devices, the devices may not be set to handle two network segments as one physical network. Try testing the devices independent of the configuration, or contact your system administrator for assistance. Note that if you have only one interface active on a network, the Cluster Manager will not generate a failure event for that interface. For more information, see the section on network interface events in the Planning Guide. See the netstat man page for more information on using this command.
86
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the TCP/IP Subsystem
Type Control-C to end the display of packets. The following statistics appear:
----testcluster.nodeA.com PING Statistics---4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max = 1/1/2 ms
The ping command sends packets to the specified node, requesting a response. If a correct response arrives, ping prints a message similar to the output shown above indicating no lost packets. This indicates a valid connection between the nodes. If the ping command hangs, it indicates that there is no valid path between the node issuing the ping command and the node you are trying to reach. It could also indicate that required TCP/IP daemons are not running. Check the physical connection between the two nodes. Use the ifconfig and netstat commands to check the configuration. A bad value message indicates problems with the IP addresses or subnet definitions. Note that if DUP! appears at the end of the ping response, it means the ping command has received multiple responses for the same address. This response typically occurs when network interfaces have been misconfigured, or when a cluster event fails during IP address takeover. Check the configuration of all interfaces on the subnet to verify that there is only one interface per address. For more information, see the ping man page. In addition, you can assign a persistent node IP label to a cluster network on a node. When for administrative purposes you wish to reach a specific node in the cluster using the ping or telnet commands without worrying whether an service IP label you are using belongs to any of the resource groups present on that node, it is convenient to use a persistent node IP label defined on that node. For more information on how to assign persistent Node IP labels on the network on the nodes in your cluster, see the Planning Guide and Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) in the Administration Guide.
Troubleshooting Guide
87
Investigating System Components and Solving Common Problems Checking the TCP/IP Subsystem
The ifconfig command displays two lines of output. The first line shows the interfaces name and characteristics. Check for these characteristics: UP The interface is ready for use. If the interface is down, use the ifconfig command to initialize it. For example: ifconfig en0 up If the interface does not come up, replace the interface cable and try again. If it still fails, use the diag command to check the device. RUNNING The interface is working. If the interface is not running, the driver for this interface may not be properly installed, or the interface is not properly configured. Review all the steps necessary to install this interface, looking for errors or missed steps.
The second line of output shows the IP address and the subnet mask (written in hexadecimal). Check these fields to make sure the network interface is properly configured. See the ifconfig man page for more information.
This output shows what the host node currently believes to be the IP and MAC addresses for nodes flounder, cod, seahorse and pollock. (If IP address takeover occurs without Hardware Address Takeover, the MAC address associated with the IP address in the hosts arp cache may become outdated. You can correct this situation by refreshing the hosts arp cache.) See the arp man page for more information.
88
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the TCP/IP Subsystem
netstat -n shows the aliases clstrmgr.debug shows an IP Alias address when it is mapped to an interface.
The ATM devices atm1, and atm2, have connected to the ATM switch, and retrieved its address, 39.99.99.99.99.99.99.0.0.99.99.1.1. This address appears in the first 13 bytes of the two clients, at0, and at2. The clients have successfully registered with their corresponding Classic IP server - server_10_50_111 for at0 and server_10_50_110 for at2. The two clients are able to communicate with other clients on the same subnet. (The clients for at0, for example, are stby_1A, and stby_1C.) Example 2 If the connection between an ATM device and the switch is not functional on the ATM layer, the output of the arp command looks as follows:
arp -t atm -a SVC - at0 on device atm2 ========================== at0(10.50.111.4) 8.0.5a.99.a6.9b.0.0.0.0.0.0.0.0.0.0.0.0.0.0
Troubleshooting Guide
89
Investigating System Components and Solving Common Problems Checking the AIX Operating System
Here the MAC address of ATM device atm2, 8.0.5a.99.a6.9b, appears as the first six bytes of the ATM address for interface at0. The ATM device atm2 has not registered with the switch, since the switch address does not appear as the first part of the ATM address of at0.
Check the serial line between each pair of nodes. If you are using Ethernet:
Use the diag command to verify that the network interface card is good. Ethernet adapters for the IBM System p can be used either with the transceiver that is on the card or with an external transceiver. There is a jumper on the NIC to specify which you are using. Verify that your jumper is set correctly. Make sure that hub lights are on for every connected cable. Use the diag command to verify that the NIC and cables are good. Make sure that all the nodes in the cluster are on the same ring. Make sure that the ringspeed is set to the same value for all NICs.
To review HACMP network requirements, see Chapter 3: Planning Cluster Network Connectivity in the Planning Guide.
90
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking Disks, Disk Adapters, and Disk Heartbeating Networks
For example, if the standard SCSI adapters use IDs 5 and 6, assign values from 0 through 4 to the other devices on the bus. You may want to set the SCSI IDs of the adapters to 5 and 6 to avoid a possible conflict when booting one of the systems in service mode from a mksysb tape of other boot devices, since this will always use an ID of 7 as the default. If the SCSI adapters use IDs of 14 and 15, assign values from 3 through 13 to the other devices on the bus. Refer to your worksheet for the values previously assigned to the adapters. You can check the SCSI IDs of adapters and disks using either the lsattr or lsdev command. For example, to determine the SCSI ID of the adapter scsi1 (SCSI-3), use the following lsattr command and specify the logical name of the adapter as an argument:
lsattr -E -l scsi1 | grep id
Do not use wildcard characters or full pathnames on the command line for the device name designation. Important: If you restore a backup of your cluster configuration onto an existing system, be sure to recheck or reset the SCSI IDs to avoid possible SCSI ID conflicts on the shared bus. Restoring a system backup causes adapter SCSI IDs to be reset to the default SCSI ID of 7. If you note a SCSI ID conflict, see the Planning Guide for information about setting the SCSI IDs on disks and disk adapters. To determine the SCSI ID of a disk, enter:
lsdev -Cc disk -H
dhb_read tests connectivity for a disk heartbeating network. For information about dhb_read, see the RSCT Command for Testing Disk Heartbeating section in Appendix C: HACMP for AIX Commands in the Administration Guide.
clip_config provides information about devices discovered for disk heartbeating. lssrc -ls topsvcs shows network activity.
Troubleshooting Guide
91
Investigating System Components and Solving Common Problems Checking Disks, Disk Adapters, and Disk Heartbeating Networks
If a device that is expected to appear in a picklist does not, view the clip_config file to see what information was discovered.
$ cat /usr/es/sbin/cluster/etc/config/clip_config | grep diskhb nodeA:15#Serial#(none)#0#/0#0##0#0.0.0.0#hdisk1#hdisk1# DE:AD:BE:EF#(none)##diskhb#public#0#0002409f07346b43 nodeB:15#Serial#(none)#0#/0#0##0#0.0.0.0#hdisk1#hdisk1# DE:AD:BE:EF#(none)##diskhb#public#0#0002409f07346b43
92
Troubleshooting Guide
Investigating System Components and Solving Common Problems Checking the Cluster Communications Daemon
For more information see the section Decreasing Node Fallover Time in Chapter 3: Planning Cluster Network Connectivity in the Planning Guide.
Troubleshooting Guide
93
Investigating System Components and Solving Common Problems Checking System Hardware
Cannot Find File System at Boot Time cl_convert Does Not Run Due to Failed Installation Configuration Files Could Not Be Merged During Installation.
Solution For file systems controlled by HACMP, this error typically does not indicate a problem. The file system check failed because the volume group on which the file system is defined is not varied on at boot-time. To prevent the generation of this message, edit the /etc/filesystems file to ensure that the stanzas for the shared file systems do not include the check=true attribute.
94
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Startup Issues
Root user privilege is required to run cl_convert. WARNING: Before converting to HACMP 5.4.1, be sure that your ODMDIR environment variable is set to /etc/es/objrepos. For information on cl_convert flags, refer to the cl_convert man page.
Solution As part of the HACMP installation process, copies of HACMP files that could potentially contain site-specific modifications are saved in the /usr/lpp/save.config directory before they are overwritten. As the message states, you must merge site-specific configuration information into the newly installed files.
ODMPATH Environment Variable Not Set Correctly clinfo Daemon Exits after Starting Node Powers Down; Cluster Manager Will Not Start configchk Command Returns an Unknown Host Message Cluster Manager Hangs during Reconfiguration clcomdES and clstrmgrES Fail to Start on Newly installed AIX Nodes Pre- or Post-Event Does Not Exist on a Node after Upgrade Node Fails During Configuration with 869 LED Display Node Cannot Rejoin Cluster after Being Dynamically Removed Resource Group Migration Is Not Persistent after Cluster Startup. SP Cluster Does Not Startup after Upgrade to HACMP 5.4.1.
Troubleshooting Guide
95
Investigating System Components and Solving Common Problems HACMP Startup Issues
Solution HACMP has a dependency on the location of certain ODM repositories to store configuration data. The ODMPATH environment variable allows ODM commands and subroutines to query locations other than the default location if the queried object does not reside in the default location. You can set this variable, but it must include the default location, /etc/objrepos, or the integrity of configuration information may be lost.
96
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Startup Issues
which indicates that the /etc/hosts file on Node x does not contain an entry for your node. Solution Before starting the HACMP software, ensure that the /etc/hosts file on each node includes the service and boot IP labels of each cluster node.
An event script has failed. Solution Determine why the script failed by examining the /var/hacmp/log/hacmp.out file to see what process exited with a non-zero status. The error messages in the /var/hacmp/adm/cluster.log file may also be helpful. Fix the problem identified in the log file. Then run the clruncmd command either at the command line, or by using the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel. The clruncmd command signals the Cluster Manager to resume cluster processing.
Troubleshooting Guide
97
Investigating System Components and Solving Common Problems HACMP Startup Issues
WARNING: You must stop cluster services on the node before removing it from the cluster. The -R flag removes the HACMP entry in the /etc/inittab file, preventing cluster services from being automatically started when the node is rebooted. 2. Remove the HACMP entry from the rc.net file using the following command:
98
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Startup Issues
clchipat false
3. Remove the cluster definition from the nodes Configuration Database using the following command:
clrmclstr
You can also perform this task by selecting Extended Configuration > Extended Topology Configuration > Configure an HACMP Cluster > Remove an HACMP Cluster from the SMIT panel.
Ownership. All HACMP ODM files are owned by user root and group hacmp. In addition, all HACMP binaries that are intended for use by non-root users are also owned by user root and group hacmp. Permissions. All HACMP ODM files, except for the hacmpdisksubsystem file with 600 permissions, are set with 640 permissions (readable by user root and group hacmp, writable by user root). All HACMP binaries that are intended for use by non-root users are installed with 2555 permissions (readable and executable by all users, with the setgid bit turned on so that the program runs as group hacmp).
During the installation, HACMP creates the group hacmp on all nodes if it does not already exist. By default, group hacmp has permission to read the HACMP ODMs, but does not have any other special authority. For security reasons, it is recommended not to expand the authority of group hacmp.
Troubleshooting Guide
99
Investigating System Components and Solving Common Problems Disk and File System Issues
If you use programs that access the HACMP ODMs directly, you may need to rewrite them if they are intended to be run by non-root users:
All access to the ODM data by non-root users should be handled via the provided HACMP utilities. In addition, if you are using the PSSP File Collections facility to maintain the consistency of /etc/group, the new group hacmp that is created at installation time on the individual cluster nodes may be lost when the next file synchronization occurs. There are two possible solutions to this problem. Take one of the following actions before installing HACMP 5.4.1: a. Turn off PSSP File Collections synchronization of /etc/group or b. Ensure that group hacmp is included in the master /etc/group file and ensure that the change is propagated to all cluster nodes.
AIX Volume Group Commands Cause System Error Reports Verification Fails on Clusters with Disk Heartbeating Networks varyonvg Command Fails on a Volume Group cl_nfskill Command Fails cl_scdiskreset Command Fails fsck Command Fails at Boot Time System Cannot Mount Specified File Systems Cluster Disk Replacement Process Fails Automatic Error Notification Fails with Subsystem Device Driver
Troubleshooting Guide
100
Investigating System Components and Solving Common Problems Disk and File System Issues
Troubleshooting Guide
101
Investigating System Components and Solving Common Problems Disk and File System Issues
Solution 1 Ensure that the volume group is not set to autovaryon on any node and that the volume group (unless it is in concurrent access mode) is not already varied on by another node. The lsvg -o command can be used to determine whether the shared volume group is active. Enter:
lsvg volume_group_name
on the node that has the volume group activated, and check the AUTO ON field to determine whether the volume group is automatically set to be on. If AUTO ON is set to yes, correct this by entering
chvg -an volume_group_name
Problem 2 The volume group information on disk differs from that in the Device Configuration Data Base. Solution 2 Correct the Device Configuration Data Base on the nodes that have incorrect information: 1. Use the smit exportvg fastpath to export the volume group information. This step removes the volume group information from the Device Configuration Data Base. 2. Use the smit importvg fastpath to import the volume group. This step creates a new Device Configuration Data Base entry directly from the information on disk. After importing, be sure to change the volume group to not autovaryon at the next system boot. 3. Use the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing. Problem 3 The HACMP software indicates that the varyonvg command failed because the volume group could not be found. Solution 3 The volume group is not defined to the system. If the volume group has been newly created and exported, or if a mksysb system backup has been restored, you must import the volume group. Follow the steps described in Problem 2 to verify that the correct volume group name is being referenced. Problem 4 The HACMP software indicates that the varyonvg command failed because the logical volume <name> is incomplete. Solution 4 This indicates that the forced varyon attribute is configured for the volume group in SMIT, and that when attempting a forced varyon operation, HACMP did not find a single complete copy of the specified logical volume for this volume group. Also, it is possible that you requested a forced varyon operation but did not specify the super strict allocation policy for the mirrored logical volumes. In this case, the success of the varyon command is not guaranteed. For more information on the forced varyon functionality, see the
102
Troubleshooting Guide
Investigating System Components and Solving Common Problems Disk and File System Issues
chapter Planning Shared LVM Components in the Planning Guide and the Forcing a Varyon of Volume Groups section in the chapter on Configuring HACMP Resource Groups (Extended) in the Administration Guide.
Solution For file systems controlled by HACMP, this message typically does not indicate a problem. The file system check fails because the volume group defining the file system is not varied on. The boot procedure does not automatically vary on HACMP-controlled volume groups. To prevent this message, make sure that all the file systems under HACMP control do not have the check=true attribute in their /etc/filesystems stanzas. To delete this attribute or change it to check=false, edit the /etc/filesystems file.
Investigating System Components and Solving Common Problems Disk and File System Issues
trying to mount the file systems, the HACMP software tries to get the required information about the logical volume name from the old log name. Because this information has not been updated, the file systems cannot be mounted. Solution Be sure to update the /etc/filesystems file after making changes to logical volume names.
104
Troubleshooting Guide
Investigating System Components and Solving Common Problems Disk and File System Issues
Troubleshooting Guide
105
Investigating System Components and Solving Common Problems Network and Switch Issues
Unexpected Network Interface Failure in Switched Networks Cluster Nodes Cannot Communicate Distributed SMIT Causes Unpredictable Results Token-Ring Network Thrashes System Crashes Reconnecting MAU Cables after a Network Failure TMSCSI Will Not Properly Reintegrate when Reconnecting Bus Recovering from PCI Hot Plug NIC Failure Unusual Cluster Events Occur in Non-Switched Environments Cannot Communicate on ATM Classic IP Network Cannot Communicate on ATM LAN Emulation Network IP Label for HACMP Disconnected from AIX Interface TTY Baud Rate Setting Wrong First Node Up Gives Network Error Message in hacmp.out Network Interface Card and Network ODMs Out of Sync with Each Other Non-IP Network, Network Adapter or Node Failures Networking Problems Following HACMP Fallover Packets Lost during Data Transmission Verification Fails when Geo Networks Uninstalled Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks.
Troubleshooting VLANs
Problem Interface failures in Virtual LAN networks (from now on referred to as VLAN, Virtual Local Area Network) Solution To troubleshoot VLAN interfaces defined to HACMP and detect an interface failure, consider these interfaces as interfaces defined on single adapter networks.
106
Troubleshooting Guide
Investigating System Components and Solving Common Problems Network and Switch Issues
For information on single adapter networks and the use of the netmon.cf file, see Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks. In particular, list the network interfaces that belong to a VLAN in the ping_client_list variable in the /usr/es/sbin/cluster/etc/clinfo.rc script and run clinfo. This way, whenever a cluster event occurs, clinfo monitors and detects a failure of the listed network interfaces. Due to the nature of Virtual Local Area Networks, other mechanisms to detect the failure of network interfaces are not effective.
Troubleshooting Guide
107
Investigating System Components and Solving Common Problems Network and Switch Issues
where x identifies the particular tmscsi device. If the status is not Available, run the cfgmgr command and check again.
Investigating System Components and Solving Common Problems Network and Switch Issues
Cluster unable to form, either all or some of the time swap_adapter pairs swap_adapter, immediately followed by a join_standby fail_standby and join_standby pairs.
These events occur when ARP packets are delayed or dropped. This is correct and expected HACMP behavior, as HACMP is designed to depend on core protocols strictly adhering to their related RFCs. For a review of basic HACMP network requirements, see the Planning Guide. Solution The following implementations may reduce or circumvent these events:
Increase the Failure Detection Rate (FDR) to exceed the ARP retransmit time of 15 seconds, where typical values have been calculated as follows: FDR = (2+ * 15 seconds) + >5 = 35+ seconds (usually 45-60 seconds) 2+ is a number greater than one in order to allow multiple ARP requests to be generated. This is required so that at least one ARP response will be generated and received before the FDR time expires and the network interface is temporarily marked down, then immediately marked back up. Keep in mind, however, that the true fallover is delayed for the value of the FDR.
Increase the ARP queue depth. If you increase the queue, requests that are dropped or delayed will be masked until network congestion or network quiescence (inactivity) makes this problem evident.
Use a dedicated switch, with all protocol optimizations turned off. Segregate it into a physical LAN segment and bridge it back into the enterprise network. Use permanent ARP entries (IP address to MAC address bindings) for all network interfaces. These values should be set at boot time, and since none of the ROM MAC addresses are used, replacing network interface cards will be invisible to HACMP.
Note: The above four items simply describe how some customers have customized their unique enterprise network topology to provide the classic protocol environment (strict adherence to RFCs) that HACMP requires. IBM cannot guarantee HACMP will work as expected in these approaches, since none addresses the root cause of the problem. If your network topology requires consideration of any of these approaches please contact the IBM Consult Line for assistance.
Troubleshooting Guide
109
Investigating System Components and Solving Common Problems Network and Switch Issues
In the example above, the client at0 is operational. It has registered with its server, server_10_50_111. The client at1 is not operational, since it could not resolve the address of its Classic IP server, which has the hardware address 39.99.99.99.99.99.99.0.0.99.99.1.1.88.88.88.88.a0.11.0. However, the ATM layer is functional, since the 20 byte ATM address that has been constructed for the client at1 is correct. The first 13 bytes is the switch address, 39.99.99.99.99.99.99.0.0.99.99.1.1. For client at3, the connection between the underlying device atm2 and the ATM switch is not functional, as indicated by the failure to construct the 20 Byte ATM address of at3. The first 13 bytes do not correspond to the switch address, but contain the MAC address of the ATM device corresponding to atm2 instead.
110
Troubleshooting Guide
Investigating System Components and Solving Common Problems Network and Switch Issues
In the example above, the client is operational as indicated by the Running flag. If the client had failed to register with its configured LAN Emulation Server, the Running flag would not appear, instead the flag Limbo would be set. If the connection of the underlying device atm# was not functional on the ATM layer, then the local ATM address would not contain as the first 13 Bytes the Address of the ATM switch.
Troubleshooting Guide
111
Investigating System Components and Solving Common Problems Network and Switch Issues
3. Switch-specific configuration limitations: Some ATM switches do not allow more than one client belonging to the same ELAN and configured over the same ATM device to register with the LAN Emulation Server at the same time. If this limitation holds and two clients are configured, the following are typical symptoms.
Cyclic occurrence of events indicating network interface failures, such as fail_standby, join_standby, and swap_adapter This is a typical symptom if two such clients are configured as cluster network interfaces. The client which first succeeds in registering with the LES will hold the connection for a specified, configuration-dependent duration. After it times out the other client succeeds in establishing a connection with the server, hence the cluster network interface configured on it will be detected as alive, and the former as down.
Sporadic events indicating an network interface failure (fail_standby, join_standby, and swap_adapter) If one client is configured as a cluster network interface and the other outside, this configuration error may go unnoticed if the client on which the cluster network interface is configured manages to register with the switch, and the other client remains inactive. The second client may succeed at registering with the server at a later moment, and a failure would be detected for the cluster network interface configured over the first client.
112
Troubleshooting Guide
Investigating System Components and Solving Common Problems Network and Switch Issues
Whether the network is functional or not, the RSCT topology services heartbeat interval expires, resulting in the logging of the above error message. This message is only relevant to non-IP networks (such as RS232, TMSCSI, TMSSA). This behavior does not occur for disk heartbeating networks (for which network_down events are not logged in general). Solution Ignore the message and let the cluster services continue to function. You should see this error message corrected in a healthy cluster as functional network communication is eventually established between other nodes in the cluster. A network_up event will be run after the second node that has an interface on this network joins the cluster. If cluster communication is not established after this error message, then the problem should be diagnosed in other sections of this guide that discuss network issues.
Network Interface Card and Network ODMs Out of Sync with Each Other
Problem In some situations, it is possible for the HACMPadapter or the HACMPnetwork ODMs to become out of sync with the AIX ODMs. For example, HACMP may refer to an Ethernet network interface card while AIX refers to a Token-Ring network interface card. Note: This type of out-of-sync condition only occurs as a result of the following situations:
If the hardware settings have been adjusted after the HACMP cluster has been successfully configured and synchronized If the wrong values were selected when configuring predefined communication interfaces to HACMP.
or
Solution Run cluster verification to detect and report the following network and network interface card type incompatibilities:
The network interface card configured in HACMP is the correct one for the nodes hardware The network interface cards configured in HACMP and AIX match each other.
If verification returns an error, examine and adjust the selections made on the Extended Configuration > Extended Topology Configuration > Configuring HACMP Communication Interfaces/Devices > Change/Show Communication Interfaces/Devices SMIT panel. For more information on this screen, see Chapter 4: Configuring HACMP Cluster Topology and Resources (Extended) of the Administration Guide.
Troubleshooting Guide
113
Investigating System Components and Solving Common Problems Network and Switch Issues
114
Troubleshooting Guide
Investigating System Components and Solving Common Problems Network and Switch Issues
Solution Run the cluster verification utility to ensure that all of the network interface cards on all cluster nodes during the same network have the same setting for MTU size. If the MTU size is inconsistent across the network, an error displays, which enables you to determine which nodes to adjust. Note: You can change an MTU size by using the following command:
chev -l en0 -a mtu=<new_value_from_1_to_8>
Missing Entries in the /etc/hosts for the netmon.cf File May Prevent RSCT from Monitoring Networks
Problem Missing entries in the /etc/hosts for the netmon.cf file may prevent your networks from being properly monitored by the netmon utility of the RSCT Topology Services. Solution Make sure to include the entries for the netmon.cf file each IP address and its corresponding label in the /etc/hosts file. If the entries are missing, it may result in the NIM process of RSCT being blocked while RSCT attempts to determine the state of the local adapters. In general, we recommend to create the netmon.cf file for the cluster configurations where there are networks that under certain conditions can become single adapter networks. In such networks, it can be difficult for HACMP to accurately determine adapter failure. This is because RSCT Topology Services cannot force packet traffic over the single adapter to verify its operation. The creation of the netmon.cf file allows RSCT to accurately determine adapter failure. For more information on creating the netmon.cf file, see the Planning Guide.
Troubleshooting Guide
115
Investigating System Components and Solving Common Problems Cluster Communications Issues
Message Encryption Fails Cluster Nodes Do Not Communicate with Each Other.
For data encryption with DES message authentication: rsct.crypt.des For data encryption standard Triple DES message authentication: rsct.crypt.3des For data encryption with Advanced Encryption Standard (AES) message authentication: rsct.crypt.aes256.
If needed, install these filesets from the AIX Expansion Pack CD-ROM. If the filesets are installed after HACMP is already running, start and stop the HACMP Cluster Communications daemon to enable HACMP to use these filesets. To restart the Cluster Communications daemon:
stopscr -s clcomdes startsrc -s clcomdes
If the filesets are present, and you get an encryption error, the encryption filesets may have been installed, or reinstalled, after HACMP was running. In this case, restart the Cluster Communications daemon as described above.
Message authentication, or message authentication and encryption enabled Use of persistent IP labels for VPN tunnels.
Solution Make sure that the network is operational, see the section Network and Switch Issues. Check if the cluster has persistent IP labels. If it does, make sure that they are configured correctly and that you can ping the IP label.
116
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Takeover Issues
Make sure that each cluster node has the same setting for message authentication mode. If the modes are different, on each node set message authentication mode to None and configure message authentication again. Make sure that each node has the same type of encryption key in the /usr/es/sbin/cluster/etc directory. Encryption keys cannot reside in other directories.
If you have configured use of persistent IP labels for a VPN: 1. Change User Persistent Labels to No. 2. Synchronize cluster configuration. 3. Change User Persistent Labels to Yes.
varyonvg Command Fails During Takeover Highly Available Applications Fail Node Failure Detection Takes Too Long HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX Group Services Sends GS_DOM_MERGE_ER Message cfgmgr Command Causes Unwanted Behavior in Cluster Releasing Large Amounts of TCP Traffic Causes DMS Timeout Deadman Switch Causes a Node Failure Deadman Switch Time to Trigger A device busy Message Appears after node_up_local Fails Network Interfaces Swap Fails Due to an rmdev device busy Error MAC Address Is Not Communicated to the Ethernet Switch.
Troubleshooting Guide
117
Investigating System Components and Solving Common Problems HACMP Takeover Issues
Solution
Check the /var/hacmp/log/hacmp.out file to find the error associated with the varyonvg failure. List all the volume groups known to the system using the lsvg command; then check that the volume group names used in the HACMPresource Configuration Database object class are correct. To change a volume group name in the Configuration Database, from the main HACMP SMIT panel select Initialization and Standard Configuration > Configure HACMP Resource Groups > Change/Show Resource Groups, and select the resource group where you want the volume group to be included. Use the Volume Groups or Concurrent Volume Groups fields on the Change/Show Resources and Attributes for a Resource Group panel to set the volume group names. After you correct the problem, use the SMIT Problem Determination Tools > Recover From HACMP Script Failure panel to issue the clruncmd command to signal the Cluster Manager to resume cluster processing. Run the cluster verification utility to verify cluster resources.
where nnn is the hostname of the machine the fallover node is masquerading as. Problem 2 An application that a user has manually stopped following a stop of cluster services where resource groups were placed in an UNMANAGED state, does not restart with reintegration of the node. Solution 2 Check that the relevant application entry in the /usr/es/sbin/cluster/server.status file has been removed prior to node reintegration. Since an application entry in the /usr/es/sbin/cluster/server.status file lists all applications already running on the node, HACMP will not restart the applications with entries in the server.status file. Deleting the relevant application server.status entry before reintegration, allows HACMP to recognize that the highly available application is not running, and that it must be restarted on the node.
118
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Takeover Issues
HACMP Selective Fallover Is Not Triggered by a Volume Group Loss of Quorum Error in AIX
Problem HACMP fails to selectively move the affected resource group to another cluster node when a volume group quorum loss occurs. Solution If quorum is lost for a volume group that belongs to a resource group on a cluster node, the system checks whether the LVM_SA_QUORCLOSE error appeared in the nodes AIX error log file and informs the Cluster Manager to selectively move the affected resource group. HACMP uses this error notification method only for mirrored volume groups with quorum enabled. If fallover does not occur, check that the LVM_SA_QUORCLOSE error appeared in the AIX error log. When the AIX error log buffer is full, new entries are discarded until buffer space becomes available and an error log entry informs you of this problem. To resolve this issue, increase the size of the AIX error log internal buffer for the device driver. For information about increasing the size of the error log buffer, see the AIX documentation listed in About This Guide.
Troubleshooting Guide
119
Investigating System Components and Solving Common Problems HACMP Takeover Issues
120
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Takeover Issues
In clusters consisting of more than two nodes the decision is based on which partition has the most nodes left in it, and that partition stays up. With an equal number of nodes in each partition (as is always the case in a two-node cluster) the node(s) that remain(s) up is determined by the node number (lowest node number in cluster remains) which is also generally the first in alphabetical order. Group Services domain merge messages indicate that a node isolation problem was handled to keep the resources as highly available as possible, giving you time to later investigate the problem and its cause. When a domain merge occurs, Group Services and the Cluster Manager exit. The clstrmgr.debug file will contain the following error:
"announcementCb: GRPSVCS announcement code=n; exiting" "CHECK FOR FAILURE OF RSCT SUBSYSTEMS (topsvcs or grpsvcs)"
Troubleshooting Guide
121
Investigating System Components and Solving Common Problems HACMP Takeover Issues
122
Troubleshooting Guide
Investigating System Components and Solving Common Problems HACMP Takeover Issues
still being marketed, regardless of memory size. Without these changes, the chances of a DMS timeout can be high in these specific environments, especially those with minimum memory size. For database environments, these suggestions should be modified. If JFS files are being used for database tables, then watching minfree still applies, but maxfree could be just minfree + (8 x the number of memory pools). If raw logical volumes are being used, the concerns about minfree/maxfree do not apply, but the following suggestion about maxperm is relevant. In any environment (HA or otherwise) that is seeing non-zero paging rates, it is recommended that maxperm be set lower than the default of ~80%. Use the avm column of vmstat as an estimate of the number of working storage pages in use, or the number of valid memory pages, (should be observed at full load on the systems real memory, as shown by vmtune) to determine the percentage of real memory occupied by working storage pages. For example, if avm shows as 70% of real memory size, then maxperm should be set to 25% (vmtune -P 25). The basic formula used here is maxperm = 95 - avm/memory size in pages. If avm is less than or equal to 95% of memory, then this system is memory constrained. The options at this point are to set maxperm to 5% and incur some paging activity, add additional memory to this system, or to reduce the total workload run simultaneously on the system so that avm is lowered.
Investigating System Components and Solving Common Problems HACMP Takeover Issues
Solution Check to see if sysinfod, the SMUX peer daemon, or another process is keeping the device open. If it is sysinfod, restart it using the -H option.
Solution Check to see whether the following applications are being run on the system. These applications may keep the device busy:
Netview / Netmon Ensure that the sysmond daemon has been started with a -H flag. This will result in opening and closing the network interface each time SM/6000 goes out to read the status, and allows the cl_swap_HW_address script to be successful when executing the rmdev command after the ifconfig detach before swapping the hardware address. Use the following command to stop all Netview daemons:
/usr/OV/bin/nv6000_smit stopdaemons
124
Troubleshooting Guide
Use the following commands to stop NetBIOS and unload NetBIOS streams:
mcsadm stop; mcs0 unload
Some customer applications will keep a device busy. Ensure that the shared applications have been stopped properly.
2. Include on this line the names or IP addresses of at least one client on each subnet on the switched Ethernet. 3. Run clinfoES on all nodes in the HACMP cluster that are attached to the switched Ethernet. If you normally start HACMP cluster services using the /usr/es/sbin/cluster/etc/rc.cluster shell script, specify the -i option. If you normally start HACMP cluster services through SMIT, specify yes in the Start Cluster Information Daemon? field.
Client Issues
The following potential HACMP client issues are described here:
Network Interface Swap Causes Client Connectivity Problem Clients Cannot Access Applications Clients Cannot Find Clusters Clinfo Does Not Appear to Be Running Clinfo Does Not Report That a Node Is Down.
Troubleshooting Guide
125
Solution Issue a ping command to the client from a cluster node to update the clients ARP cache. Be sure to include the client name as the argument to this command. The ping command will update a clients ARP cache even if the client is not running clinfoES. You may need to add a call to the ping command in your applications pre- or post-event processing scripts to automate this update on specific clients. Also consider using hardware address swapping, since it will maintain configured hardware-to-IP address mapping within your cluster.
126
Troubleshooting Guide
Solution Create an updated client-based clhosts file by running verification with automatic corrective actions enabled. This produces a clhosts.client file on the server nodes. Copy this file to the /usr/es/sbin/cluster/etc/ directory on the clients, renaming the file clhosts. Then run the clstat command.
Miscellaneous Issues
The following non-categorized HACMP issues are described here:
Limited Output when Running the tail -f Command on /var/hacmp/log/hacmp.out CDE Hangs after IPAT on HACMP Startup Cluster Verification Gives Unnecessary Message config_too_long Message Appears Console Displays SNMP Messages Device LEDs Flash 888 (System Panic) Unplanned System Reboots Cause Fallover Attempt to Fail Deleted or Extraneous Objects Appear in NetView Map F1 Does not Display Help in SMIT Panels /usr/es/sbin/cluster/cl_event_summary.txt File (Event Summaries Display) Grows Too Large View Event Summaries Does Not Display Resource Group Information as Expected Application Monitor Problems Cluster Disk Replacement Process Fails Resource Group Unexpectedly Processed Serially rg_move Event Processes Several Resource Groups at Once File System Fails to Unmount Dynamic Reconfiguration Sets a Lock WebSMIT Does Not See the Cluster Problems with WPAR-Enabled Resource Group
127
Troubleshooting Guide
Note that if you are investigating resource group movement in HACMPfor instance, investigating why an rg_move event has occurredalways check the /var/hacmp/log/hacmp.out file. In general, given the recent changes in the way resource groups are handled and prioritized in fallover circumstances, particularly in HACMP, the hacmp.out file and its event summaries have become even more important in tracking the activity and resulting location of your resource groups. In addition, with parallel processing of resource groups, the hacmp.out file reports details that will not be seen in the cluster history log or the clstrmgr.debug file. Always check this log early on when investigating resource group movement after takeover activity.
The output of hostname and the uname -n must be the same. If the output is different, use uname -S hostname to make the uname match the output from hostname. Define an alias for the hostname on the loopback address. This can be done by editing /etc/hosts to include an entry for:
127.0.0.1 loopback localhost hostname
where hostname is the name of your host. If name serving is being used on the system edit the /etc/netsvc.conf file such that the local file is checked first when resolving names.
Ensure that the hostname and the service IP label resolve to different addresses. This can be determine by viewing the output of the /bin/host command for both the hostname and the service IP label.
128
Troubleshooting Guide
Solution Ignore this message if you have not configured Auto Error Notification.
$event_name is the reconfig event that failed $argument is the parameter(s) used by the event $sec is the number of seconds before the message was sent out.
In versions prior to HACMP 4.5, config_too_long messages continued to be appended to the hacmp.out file every 30 seconds until action was taken. Starting with version 4.5, for each cluster event that does not complete within the specified event duration time, config_too_long messages are logged in the hacmp.out file and sent to the console according to the following pattern:
The first five config_too_long messages appear in the hacmp.out file at 30-second intervals The next set of five messages appears at interval that is double the previous interval until the interval reaches one hour These messages are logged every hour until the event is complete or is terminated on that node.
This message could appear in response to the following problems: Problem Activities that the script is performing take longer than the specified time to complete; for example, this could happen with events involving many disks or complex scripts. Solution
Determine what is taking so long to execute, and correct or streamline that process if possible. Increase the time to wait before calling config_too_long. You can customize Event Duration Time using the Change/Show Time Until Warning panel in SMIT. Access this panel through the Extended Configuration > Extended Event Configuration SMIT panel.
Troubleshooting Guide
129
For complete information on tuning event duration time, see the Tuning Event Duration Time Until Warning section in the chapter on Configuring Cluster Events in the Administration Guide. Problem A command is hung and event script is waiting before resuming execution. If so, you can probably see the command in the AIX process table (ps -ef). It is most likely the last command in the /var/hacmp/log/hacmp.out file, before the config_too_long script output. Solution You may need to kill the hung command. See also Dynamic Reconfiguration Sets a Lock.
Both measures prevent a system from rebooting if the shutdown command is issued inadvertently. Without one of these measures in place, if an unplanned reboot occurs the activity against the disks on the rebooting node can prevent other nodes from successfully acquiring the disks.
Since the LANG environment variable determines the active locale, if LANG=en_US, the locale is en_US.
Troubleshooting Guide
131
View Event Summaries Does Not Display Resource Group Information as Expected
Problem In HACMP, event summaries are pulled from the hacmp.out file and can be viewed using the Problem Determination Tools > HACMP Log Viewing and Management > View/Save/Delete Event Summaries > View Event Summaries option in SMIT. This display includes resource group status and location information at the end. The resource group information is gathered by clRGinfo, and may take extra time if the cluster is not running when running the View Event Summaries option. Solution clRGinfo displays resource group information more quickly when the cluster is running. If the cluster is not running, wait a few minutes and the resource group information will eventually appear.
This command produces a long line of verbose output if the application is being monitored. If there is no output, the application is not being monitored. Solution 1 If the application monitor is not running, there may be a number of reasons, including
No monitor has been configured for the application server The monitor has not started yet because the stabilization interval has not completed
132
Troubleshooting Guide
The monitor is in a suspended state The monitor was not configured properly An error has occurred.
Check to see that a monitor has been configured, the stabilization interval has passed, and the monitor has not been placed in a suspended state, before concluding that something is wrong. If something is clearly wrong, reexamine the original configuration of the monitor in SMIT and reconfigure as needed. Problem 2 Application Monitor Does Not Perform Specified Failure Action. The specified failure action does not occur even when an application has clearly failed. Solution 2 Check the Restart Interval. If set too short, the Restart Counter may be reset to zero too quickly, resulting in an endless series of restart attempts and no other action taken.
Troubleshooting Guide
133
134
Troubleshooting Guide
2. On the specified node, verify there is a WPAR with the same name as the WPAR-enabled resource group. Use the lswpar <resource group name> command to check this. If there is no WPAR with the specified name, create it using the mkwpar command. After creating a WPAR, make sure that all the user-defined scripts associated with the WPAR-enabled resource group are accessible within the WPAR. 3. Ensure that the file systems on the node are not full. If so, free up some disk space by moving some files to external storage. 4. Verify that the rsh service is enabled in the corresponding WPAR. This can be done as follows:
Check that the inetd service is running in the WPAR by issuing the following command in the WPAR:
lssrc -s inetd
If the inetd service is not active, then start the service using the startsrc command.
Make sure that rsh is listed as a known service in /etc/inetd.conf file in the WPAR.
Troubleshooting Guide
135
136
Troubleshooting Guide
Highlighting
The following highlighting conventions are used in this appendix: Bold Italics Monospace Identifies command words, keywords, files, directories, and other items whose actual names are predefined by the system. Identifies parameters whose actual names or values are supplied by the user. Identifies examples of specific data values, examples of text similar to what you may see displayed, examples of program code similar to what you may write as a programmer, messages from the system, or information you should actually type.
Note: Flags listed in syntax diagrams throughout this appendix are those recommended for use with the HACMP for AIX software. Flags used internally by SMIT are not listed.
Troubleshooting Guide
137
Utilities
The script utilities are stored in the /usr/es/sbin/cluster/events/utils directory. The utilities described in this chapter are grouped in the following categories:
Disk Utilities RS/6000 SP Utilities File System and Volume Group Utilities Logging Utilities Network Utilities Resource Group Move Utilities Emulation Utilities Security Utilities Start and Stop Tape Resource Utilities Cluster Resource Group Information Commands.
Disk Utilities
cl_disk_available
Syntax
cl_disk_available diskname ...
Description Checks to see if a disk named as an argument is currently available to the system and if not, makes the disk available. Parameters diskname Return Values 0 1 2 Successfully made the specified disk available. Failed to make the specified disk available. Incorrect or bad arguments were used. List of one of more disks to be made available; for example, hdisk1.
138
Troubleshooting Guide
cl_fs2disk
Syntax or
cl_fs2disk -g volume_group cl_fs2disk [-lvip] mount_point
where -l identifies and returns the logical volume, -v returns the volume group, -i returns the physical volume ID, -p returns the physical volume, and -g is the mount point of a file system (given a volume group). Description Checks the ODM for the specified logical volume, volume group, physical volume ID, and physical volume information. Parameters mount point volume group Return Values 0 1 Successfully retrieved file system information. Failed to retrieve file system information. Mount point of file system to check. Volume group to check.
cl_get_disk_vg_fs_pvids
Syntax
cl_get_disk_vg_fs_pvids [filesystem_list volumegroup_list]
Description Given file systems and/or volume groups, the function returns a list of the associated PVIDs. Parameters filesystem_list volumegroup_list Return Values 0 1 2 Success. Failure. Invalid arguments. The file systems to check. The volume groups to check.
Troubleshooting Guide
139
cl_is_array
Syntax
cl_is_array diskname
Description Checks to see if a disk is a READI disk array. Parameters diskname Return Values 0 1 2 Disk is a READI disk array. Disk is not a READI disk array. An error occurred. Single disk to test; for example, hdisk1.
cl_is_scsidisk
Syntax
cl_is_scsidisk diskname
Description Determines if a disk is a SCSI disk. Parameters diskname Return Values 0 1 2 Disk is a SCSI disk. Disk is not a SCSI disk. An error occurred. Single disk to test; for example, hdisk1.
140
Troubleshooting Guide
cl_raid_vg
Syntax
cl_raid_vg volume_group
Description Checks to see if the volume group is comprised of RAID disk arrays. Parameters volume_group Return Values 0 1 2 Successfully identified a RAID volume group. Could not identify a RAID volume group; volume group must be SSA. An error occurred. Mixed volume group identified. Single volume group to check.
cl_scdiskreset
Syntax
cl_scdiskreset /dev/diskname ...
Description Issues a reset (SCSI ioctl) to each SCSI disk named as an argument. Parameters /dev/diskname Return Values 0 -1 n All specified disks have been reset. No disks have been reset. Number of disks successfully reset. List of one or more SCSI disks.
Troubleshooting Guide
141
cl_scdiskrsrv
Syntax
cl_scsidiskrsrv /dev/diskname ...
Description Reserves the specified SCSI disk. Parameters /dev/diskname Return Values 0 -1 n All specified disks have been reserved. No disks have been reserved. Number of disks successfully reserved. List of one or more SCSI disks.
cl_sync_vgs
Syntax
cl_sync_vgs -b|f volume_group ...
Description Attempts to synchronize a volume group by calling syncvg for the specified volume group. Parameters volume_group -b -f Return Values 0 1 2 Successfully started syncvg for all specified volume groups. The syncvg of at least one of the specified volume groups failed. No arguments were passed. Volume group list. Background sync. Foreground sync.
142
Troubleshooting Guide
scdiskutil
Syntax
scdiskutil -t /dev/diskname
Description Tests and clears any pending SCSI disk status. Parameters -t /dev/diskname Return Values -1 0 >0 An error occurred or no arguments were passed. The disk is not reserved. The disk is reserved. Tests to see if a unit is ready. Single SCSI disk.
ssa_fence
Syntax
ssa_fence -e event pvid
Description Fences a node in or out. Additionally, this command also relies on environment variables; the first node up fences out all other nodes of the cluster regardless of their participation in the resource group. If it is not the first node up, then the remote nodes fence in the node coming up. The node joining the cluster will not do anything. If it is a node_down event, the remote nodes will fence out the node that is leaving. The node leaving the cluster will not do anything. The last node going down clears the fence register.
Troubleshooting Guide
143
POST_EVENT_MEMBERSHIP Set by Cluster Manager. EVENT_ON_NODE Parameters -e event pvid Return Values 0 1 Success. Failure. A problem occurred during execution. A message describing the problem is written to stderr and to the cluster log file. Failure. Invalid number of arguments. A message describing the problem is written to stderr and to the cluster log file. 1=up; 2=down. Physical volume ID on which fencing will occur. Set by calling script.
ssa_clear
Syntax
ssa_clear -x | -d pvid
Description Clears or displays the contents of the fence register. If -d is used, a list of fenced out nodes will be displayed. If -x is used, the fence register will be cleared. Note: This command exposes data integrity of a disk, by unconditionally clearing its fencing register. It requires adequate operator controls and warnings, and should not be included within any takeover script. Return Values 0 1 Success. Failure. A problem occurred during execution. A message describing the problem is written to stderr and to the cluster log file. Failure. Invalid number of arguments. A message describing the problem is written to stderr and to the cluster log file.
144
Troubleshooting Guide
ssa_clear_all
Syntax
ssa_clear_all pvid1, pvid2 ...
Description Clears the fence register on multiple physical volumes. Return Values 0 1 Success. Failure. A problem occurred during execution. A message describing the problem is written to stderr and to the cluster log file. Failure. Invalid number of arguments. A message describing the problem is written to stderr and to the cluster log file.
ssa_configure
Syntax
ssa_configure
Description Assigns unique node IDs to all the nodes of the cluster. Then it configures and unconfigures all SSA pdisks and hdisks on all nodes thus activating SSA fencing. This command is called from the SMIT panel during the sync of a node environment. If this command fails for any reason, that node should be rebooted. Return Values 0 1 Success. Failure. A problem occurred during execution. A message describing the problem is written to stderr and to the cluster log file.
Troubleshooting Guide
145
RS/6000 SP Utilities
cl_swap_HPS_IP_address
Syntax
cl_swap_HPS_IP_address [cascading rotating] [action] interface address old_address netmask
Description This script is used to specify an alias address to an SP Switch interface, or remove an alias address, during IP address takeover. Note that adapter swapping does not make sense for the SP Switch since all addresses are alias addresses on the same network interface. Parameters action IP label behavior acquire or release rotating/cascading. Select rotating if an IP label should be placed on a boot interface; Select cascading if an IP label should be placed on a backup interface on a takeover node. The name of the interface. new alias IP address alias IP address you want to change Netmask.
Success. The network interface could not be configured (using the ifconfig command) at the specified address. Invalid syntax.
146
Troubleshooting Guide
cl_activate_fs
Syntax
cl_activate_fs /filesystem_mount_point ...
Description Mounts the file systems passed as arguments. Parameters /filesystem_mount_point Return Values 0 1 2 All file systems named as arguments were either already mounted or were successfully mounted. One or more file systems failed to mount. No arguments were passed. A list of one or more file systems to mount.
cl_activate_vgs
Syntax
cl_activate_vgs [-n] volume_group_to_activate ...
Description Initiates a varyonvg of the volume groups passed as arguments. Parameters -n Do not sync the volume group when varyon is called.
volume_group_to_activate List of one of more volume groups to activate. Return Values 0 1 2 All of the volume groups are successfully varied on. The varyonvg of at least one volume group failed. No arguments were passed.
Troubleshooting Guide
147
cl_deactivate_fs
Syntax
cl_deactivate_fs /filesystem_mount_point ...
Description Attempts to unmount any file system passed as an argument that is currently mounted. Parameters /filesystem_mount_point Return Values 0 1 2 All file systems were successfully unmounted. One or more file systems failed to unmount. No arguments were passed. List of one or more file systems to unmount.
148
Troubleshooting Guide
cl_deactivate_vgs
Syntax
cl_deactivate_vgs volume_group_to_deactivate ...
Description Initiates a varyoffvg of any volume group that is currently varied on and that was passed as an argument. Parameters volume_group_to_deactivate List of one or more volume groups to vary off. Return Values 0 1 2 All of the volume groups are successfully varied off. The varyoffvg of at least one volume group failed. No arguments were passed.
Troubleshooting Guide
149
cl_nfskill
Syntax
cl_nfskill [-k] [-t] [-u] directory ...
Description Lists the process numbers of local processes using the specified NFS directory. Find and kill processes that are executables fetched from the NFS-mounted file system. Only the root user can kill a process of another user. If you specify the -t flag, all processes that have certain NFS module names within their stack will be killed. WARNING: When using the -t flag it is not possible to tell which NFS file system the process is related to. This could result in killing processes that belong to NFS-mounted file systems other than those that are cross-mounted from another HACMP node and under HACMP control. This could also mean that the processes found could be related to file systems under HACMP control but not part of the current resources being taken. This flag should therefore be used with caution and only if you know you have a specific problem with unmounting the NFS file systems. To help to control this, the cl_deactivate_nfs script contains the normal calls to cl_nfskill with the -k and -u flags and commented calls using the -t flag as well. If you use the -t flag, you should uncomment those calls and comment the original calls. Parameters -k -u -t directory Return Values None. Sends the SIGKILL signal to each local process, Provides the login name for local processes in parentheses after the process number. Finds and kills processes that are just opening on NFS file systems. Lists of one or more NFS directories to check.
150
Troubleshooting Guide
Logging Utilities
cl_log
Syntax
cl_log message_id default_message variables
Description Logs messages to syslog and standard error. Parameters message_id default_message variables Return Values 0 2 Successfully logged messages to syslog and standard error. No arguments were passed. Message ID for the messages to be logged. Default message to be logged. List of one or more variables to be logged.
cl_echo
Syntax
cl_echo message_id default_message variables
Description Logs messages to standard error. Parameters message_id default_message variables Return Values 0 2 Successfully displayed messages to stdout. No arguments were passed. Message ID for the messages to be displayed. Default message to be displayed. List of one or more variables to be displayed.
Troubleshooting Guide
151
Network Utilities
cl_swap_HW_address
Syntax
cl_swap_HW_address address interface
Description Checks to see if an alternate hardware address is specified for the address passed as the first argument. If so, it assigns the hardware address specified to the network interface. Parameters address interface Return Values 0 1 2 Successfully assigned the specified hardware address to a network interface. Could not assign the specified hardware address to a network interface. Wrong number of arguments were passed. Interface address or IP label. Interface name (for example, en0 or tr0).
cl_swap_IP_address
Syntax
cl_swap_IP_address cascading/rotating acquire/release interface new_address old_address netmask cl_swap_IP_address swap_adapter swap interface1 address1 interface2 address2 netmask
Description This routine is used during adapter_swap and IP address takeover. In the first form, the routine sets the specified interface to the specified address:
cl_swap_IP_address rotating acquire en0 1.1.1.1 255.255.255.128
In the second form, the routine sets two interfaces in a single call. An example where this is required is the case of swapping two interfaces.
cl_swap_IP_address swap-adapter swap en0 1.1.1.1 en1 2.2.2.2 255.255.255.128
152
Troubleshooting Guide
Parameters interface address IP label behavior Interface name IP address rotating/cascading. Select rotating if an IP label should be placed on a boot interface; select cascading if an IP label should be placed on a backup interface on a takeover node. Network mask. Must be in decimal format.
This utility is used for swapping the IP address of either a standby network interface with a local service network interface (called adapter swapping), or a standby network interface with a remote service network interface (called masquerading). For masquerading, the cl_swap_IP_address routine should sometimes be called before processes are stopped, and sometimes after processes are stopped. This is application dependent. Some applications respond better if they shutdown before the network connection is broken, and some respond better if the network connection is closed first.
cl_unswap_HW_address
Syntax
cl_unswap_HW_address interface
Description Script used during adapter_swap and IP address takeover. It restores a network interface to its boot address. Parameters interface Return Values 0 1 2 Success. Failure. Invalid parameters. Interface name (for example, en0 or tr0).
Troubleshooting Guide
153
Description This utility communicates with the Cluster Manager to queue an rg_move event to bring a specified resource group offline or online, or to move a resource group to a different node. This utility provides the command line interface to the Resource Group Migration functionality, which can be accessed through the SMIT System Management (C-SPOC) panel. To move specific resource groups to the specified location or state, use the System Management (C-SPOC) > HACMP Resource Group and Application Management > Move a Resource Group to another Node/Site SMIT menu, or the clRGmove command. See the man page for the clRGmove command. You can also use this command from the command line, or include it in the pre- and post-event scripts. Parameters -a Use this flag for concurrent resource groups only. This flag is interpreted as all nodes in the resource group when bringing the concurrent resource group offline or online. To bring a concurrent resource group online or offline on a single node, use the -n flag. -d Use this flag to bring the resource group offline. Cannot be used with -m or -u flags. -g <groupname> -i The name of the resource group to move. Displays the locations and states of all resource groups in the cluster after the migration has completed by calling the clRGinfo command. Use this flag to move one or more resource groups from their current node to a specified destination node. Cannot be used with -d or -u flags. Example: clRGmove -g rgA, rgB, rgC -n nodeB -m
-m
154
Troubleshooting Guide
-n <nodename>
The name of the node to which the resource group will be moved. For a non-concurrent resource group this flag can only be used when bringing a resource group online or moving a resource group to another node. For a concurrent resource group this flag can be used to bring a resource group online or offline on a single node. Cannot be used with -r or -a flags.
-p
Use this flag to show the temporal changes in the resource group behavior that occurred because of the resource group migration utility. This flag can only be used when bringing a non-concurrent resource group online or moving a non-concurrent resource group to another node. If this flag is specified, the command uses the highest priority node that is available as the destination node to which the group will be moved. Cannot be used with -n or -a flags.
-r
-s true | false
Use this flag to specify actions on the primary or secondary instance of a resource group (if sites are defined). With this flag, you can take the primary or the secondary instance of the resource group offline, online or move it to another node within the same site. - s true specifies actions on the secondary instance of a resource group. -s false specifies actions on the primary instance of a resource group. Use this flag with -r, -d, -u, and -m flags.
-u
Use this flag to bring the resource group online. Cannot be used with -m or -d flags.
Repeat this syntax on the command line for each resource group you want to migrate. Return Values 0 1 Success. Failure.
Troubleshooting Guide
155
Emulation Utilities
Emulation utilities are found in the /usr/es/sbin/cluster/events/emulate/driver directory.
cl_emulate
Syntax
cl_emulate cl_emulate cl_emulate cl_emulate cl_emulate cl_emulate cl_emulate -e -e -e -e -e -e -e node_up -n nodename node_down -n nodename {f|g|t} network_up -w networkname -n nodename network_down -w networkname -n nodename join_standby -n nodename -a ip_label fail_standby -n nodename -a ip_label swap_adapter -n nodename -w networkname -a ip_label -d ip_label
Description Emulates a specific cluster event and outputs the result of the emulation. The output is shown on the screen as the emulation runs, and is saved to an output file on the node from which the emulation was executed. The Event Emulation utility does not run customized scripts such as pre- and post- event scripts. In the output file the script is echoed and the syntax is checked, so you can predict possible errors in the script. However, if customized scripts exist, the outcome of running the actual event may differ from the outcome of the emulation. When emulating an event that contains a customized script, the Event Emulator uses the ksh flags -n and -v. The -n flag reads commands and checks them for syntax errors, but does not execute them. The -v flag indicates verbose mode. When writing customized scripts that may be accessed during an emulation, be aware that the other ksh flags may not be compatible with the -n flag and may cause unpredictable results during the emulation. See the ksh man page for flag descriptions. You can run only one instance of an event emulation at a time. If you attempt to start an emulation while an emulation is already running on a cluster, the integrity of the output cannot be guaranteed.
156
Troubleshooting Guide
Parameters -e eventname The name of the event to emulate: node_up, node_down, network_up, network_down, join standby, fail standby, swap adapter. The node name used in the emulation. Emulates stopping cluster services with the option to place resource groups in an UNMANAGED state. Cluster daemons terminate without running any local procedures. Emulates stopping cluster services with the option to bring resource groups offline. Emulates stopping cluster services with the option to move resource groups to another node. The network name used in the emulation. The standby network interface address with which to switch. The service network interface to fail.
-n nodename -f
Success. Failure.
Note: The cldare command also provides an emulation feature for dynamic reconfiguration events.
Troubleshooting Guide
157
Security Utilities
HACMP security utilities include kerberos setup utilities.
cl_setup_kerberosextracts the HACMP network interface labels from an already configured node and creates a file, cl_krb_service, that contains all of the HACMP network interface labels and additional format information required by the add_principal Kerberos setup utility. Also creates the cl_adapters file that contains a list of the network interfaces required to extract the service principals from the authentication database. cl_ext_krbprompts the user to enter the Kerberos password to be used for the new principals, and uses this password to update the cl_krb_service file. Checks for a valid .k file and alerts the user if one does not exist. Once a valid .k file is found, the cl_ext_krb script runs the add_principal utility to add all the network interface labels from the cl_krb_service file into the authentication database; extracts the service principals and places them in a new Kerberos services file, cl_krb-srvtab; creates the cl_klogin file that contains additional entries required by the .klogin file; updates the .klogin file on the control workstation and all nodes in the cluster; and concatenates the cl_krb-srvtab file to each nodes /etc/krb-srvtab file.
158
Troubleshooting Guide
tape_resource_start_example
Syntax
tape_resource_start_example
Description Rewinds a highly available tape resource. Parameters none Return Values 0 1 2 Successfully started the tape resource. Failure. Usage error.
tape_resource_stop_example
Syntax
tape_resource_stop_example
Description Rewinds the highly available tape resource. Parameters None. Return Values 0 1 2 Successfully stopped the tape resource. Failure. Usage error.
Troubleshooting Guide
159
clRMupdate clRGinfo.
HACMP event scripts use the clRMupdate command to notify the Cluster Manager that it should process an event. It is not documented for end users; it should only be used in consultation with IBM support personnel. Users or scripts can execute the clRGinfo command to get information about resource group status and location.
clRGinfo
Syntax
clRGinfo [-a] [-h] [-v] [-s| -c] | [-p] [-t] [-d][groupname1] [groupname2] ...
Description Use the clRGinfo command to display the location and state of all resource groups. If clRGinfo cannot communicate with the Cluster Manager on the local node, it attempts to find a cluster node with the Cluster Manager running, from which resource group information may be retrieved. If clRGinfo fails to find at least one node with the Cluster Manager running, HACMP displays an error message.
clRGinfo: Resource Manager daemon is unavailable
160
Troubleshooting Guide
Parameters s or c a Displays output in colon (shortened) format. Displays the pre-event and the post- event node locations of the resource group. (Recommended for use in pre- and post-event scripts in clusters with dependent resource groups). Displays the name of the server that provided the information for the command. Displays the node that temporarily has the highest priority for this instance as well as the state for the primary and secondary instances of the resource group. The command shows information about those resource groups whose locations were temporally changed because of the resource group migration utility. Collects information only from the local node and displays the delayed fallback timer and the settling time settings for resource groups on the local node. Note: This flag can be used only if the Cluster Manager is running on the local node. h v Displays the usage message. Displays the verbose output with the startup, fallover and fallback policies for resource groups
d p
Troubleshooting Guide
161
If you run the clRGinfo command with sites configured, that information is displayed as in the following example:
$ /usr/es/sbin/cluster/utilities/clRGinfo -----------------------------------------------------------------------Group Name Group State Node Node State -----------------------------------------------------------------------Colors ONLINE white@Site1 ONLINE amber@Site1 OFFLINE yellow@Site1 OFFLINE navy@Site2 ecru@Site2 ONLINE_SECONDARY OFFLINE
samwise
ONLINE
clRGinfo -c -p If you run the clRGinfo -c -p command, it lists the output in a colon separated format and the parameter indicating status and location of the resource groups Possible States of a Resource Group
ONLINE OFFLINE OFFLINE Unmet dependencies OFFLINE User requested UNKNOWN ACQUIRING RELEASING ERROR TEMPORARY ERROR ONLINE SECONDARY ONLINE PEER ACQUIRING SECONDARY RELEASING SECONDARY ACQUIRING PEER RELEASING PEER ERROR SECONDARY TEMPORARY ERROR SECONDARY
clRGinfo -a The clRGinfo -a command lets you know the pre-event location and the post-event location of a particular resource group. Because HACMP performs these calculations at event startup, this information will be available in pre-event scripts (such as a pre-event script to node_up), on all nodes in the cluster, regardless of whether the node where it is run takes any action on a particular resource group. Note: clRGinfo - a provides meaningful output only if you run it while a cluster event is being processed.
162
Troubleshooting Guide
In this example, the resource group A is moving from the offline state to the online state on node B. The pre-event location is left blank, the post-event location is Node B:
:rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a -------------------------------------------------------Group Name Resource Group Movement -------------------------------------------------------rgA PRIMARY=":nodeB"
In this example, the resource group B is moving from Node B to the offline state. The pre-event location is node B, the post-event location is left blank:
:rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a -------------------------------------------------------Group Name Resource Group Movement -------------------------------------------------------rgB PRIMARY="nodeB:"
In this example, the resource group C is moving from Node A to Node B. The pre-event location is node A, the post-event location is node B:
:rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a -------------------------------------------------------Group Name Resource Group Movement -------------------------------------------------------rgC PRIMARY="nodeA:nodeB"
In this example with sites, the primary instance of resource group C is moving from Node A to Node B, and the secondary instance stays on node C:
:rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a -------------------------------------------------------Group Name Resource Group Movement -------------------------------------------------------rgC PRIMARY="nodeA:nodeB" SECONDARY=nodeC:nodeC
With concurrent resource groups, the output indicates each node from which a resource group is moving online or offline. In the following example, both nodes release the resource group:
:rg_move[112] /usr/es/sbin/cluster/utilities/clRGinfo -a -------------------------------------------------------Group Name Resource Group Movement -------------------------------------------------------rgA "nodeA:" rgA "nodeB:"
clRGinfo -p The clRGinfo -p command displays the node that temporarily has the highest priority for this instance as well as the state for the primary and secondary instances of the resource group. The command shows information about those resource groups whose locations were temporally changed because of the resource group migration utility.
$ /usr/es/sbin/cluster/utilities/clRGinfo -p Cluster Name: TestCluster Resource Group Name: Parent Primary instance(s): The following node temporarily has the highest priority for this instance:
Troubleshooting Guide
163
user-requested rg_move performed on Wed Dec 31 19:00:00 1969 Node State ---------------------------- --------------node3@s2 OFFLINE node2@s1 ONLINE node1@s0 OFFLINE Resource Group Name: Child Node State ---------------------------- --------------node3@s2 ONLINE node2@s1 OFFLINE node1@s0 OFFLINE ---------------------------- ---------------
clRGinfo -p -t The clRGinfo -p -t command displays the node that temporarily has the highest priority for this instance and a resource group's active timers:
/usr/es/sbin/cluster/utilities/clRGinfo -p -t Cluster Name: MyTestCluster Resource Group Name: Parent Primary instance(s): The following node temporarily has the highest priority for this instance: node4, user-requested rg_move performed on Fri Jan 27 15:01:18 2006 Node Primary State Secondary StateDelayed Timers ------------------------------- --------------- ------------------node1@siteA OFFLINE ONLINE SECONDARY node2@siteA OFFLINE OFFLINE node3@siteB OFFLINE OFFLINE node4@siteB ONLINE OFFLINE Resource Group Name: Child Node State Delayed Timers ---------------------------- --------------- ------------------node2 ONLINE node1 OFFLINE node4 OFFLINE node3 OFFLINE
clRGinfo -s
$ /usr/es/sbin/cluster/utilities/clRGinfo -s Group1:ONLINE:merry:OHN:FNPN:FBHPN:ignore: : : Group1:OFFLINE:samwise:OHN:FNPN:FBHPN:ignore: : : Group2:ONLINE:merry:OAAN:BO:NFB:ignore: : : Group2:ONLINE:samwise:OAAN:BO:NFB:ignore: : :
164
Troubleshooting Guide
where the resource group startup fallover and fallback preferences are abbreviated as follows:
OHN: Online On Home Node Only OFAN: Online On First Available Node OUDP: Online Using Distribution Policy OAAN: Online On All Available Nodes FNPN: fallover To Next Priority Node In The List FUDNP: fallover Using Dynamic Node Priority BO: Bring Offline (On Error Node Only) FHPN: Fallback To Higher Priority Node In The List NFB: Never Fallback. ignore: ignore OES: Online On Either Site OBS: Online Both Sites PPS: Prefer Primary Site
If an attribute is not available for a resource group, the command displays a colon and a blank instead of the attribute. clRGinfo -v
$ /usr/es/sbin/cluster/utilities/clRGinfo -v Cluster Name: myCLuster Resource Group Name: Group1 Startup Policy: Online On Home Node Only fallover Policy: fallover To Next Priority Node In The List Fallback Policy: Fallback To Higher Priority Node In The List Site Policy: ignore Node State --------------- --------------merry ONLINE samwise OFFLINE Resource Group Name: Group2 Startup Policy: Online On All Available Nodes fallover Policy: Bring Offline (On Error Node Only) Fallback Policy: Never Fallback Site Policy: ignore
Troubleshooting Guide
165
166
Troubleshooting Guide
Overview
CEL is a programming language that lets you integrate the dsh commands distributed functionality into each C-SPOC script the CEL preprocessor (celpp) generates. When you invoke a C-SPOC script from a single cluster node to perform an administrative task, the script is automatically executed on all nodes in the cluster. Without C-SPOCs distributed functionality, you must execute each administrative task separately on each cluster node, which can lead to inconsistent node states within the cluster. Appendix C: HACMP for AIX Commands in the Administration Guide provides a list of all C-SPOC commands provided with the HACMP for AIX software.
Troubleshooting Guide
167
For a description of option string variables used in the preceding example, refer to the cl_init.cel file in the /usr/es/sbin/cluster/samples/cspoc directory. The cl_init.cel file provides examples of functionality required in any execution plan you create; it should be included in each .cel file. The initialization and verification routines in the cl_init.cel file provide the following functionality:
168
Troubleshooting Guide
Get a list of target nodes for command execution. Determine nodes associated with any resource groups specified. Process and implement the standard C-SPOC flags (-f, -n, and -g). Validate option strings. Requires several environment variables to be set within the plan (_OPT_STR, CSPOC_OPT_STR, and _USAGE). Save log file entries upon command termination. Perform C-SPOC verification as follows:
Ensure dsh is available in $PATH Check the HACMP version on all nodes.
The cl_path .cel file sets the PATH variable so that the C-SPOC and HACMP functions can be found at runtime. This is essential for execution plans that make use of any HACMP command, or the C-SPOC try_serial or try_parallel operations.
Thus, you should encode all command-line arguments to C-SPOC commands unless they begin with a dash. Arguments that begin with a dash are generally command-line flags and do not contain spaces or quotes. If a string that begins with a dash is passed to the clencodearg or cldecodearg program, the string is passed through without being encoded or decoded. The following script fragments shows how to encode and decode a list of arguments. To encode a list of args:
ENCODED_ARGS="" for ARG in "$@" do ENCODED_ARGS="$ENCODED_ARGS $(print ARG | clencodearg)" done To decode a list of args: UNENCODED_ARGS="" for ARG in $ENCODED_ARGS do UNENCODED_ARGS="$UNENCODED_ARGS $(print $ARG | cldecodearg)" done
Troubleshooting Guide
169
WARNING: You must use the clencodearg or cldecodearg program to process command-line arguments that are to be passed to any commands contained within a %try_serial or %try_parallel statement because the C-SPOC Execution Engine (cdsh) tries to decode all command-line arguments before executing the command. See CEL Constructs for more information about %try_serial and %try_parallel statements. WARNING: Any arguments obtained from the environment variables set by routines within cl_init.cel, such as _getopts(), will have been encoded. If a command contained within a %try_serial or %try_parallel statement includes arguments generated within the execution plan, then they must first be encoded. For example, most execution plans pass command-line arguments as follows:
%try_parallel chuser $_CMD_ARGS %end
The following CEL plan uses the clencodearg and cldecodearg programs.
#Initialize C-SPOC variables required by cl_init.cel _CMD_NAME=$(basename $0) # Specify the name of this script. _CSPOC_OPT_STR="d:f" # Specify valid C-SPOC option flags. _OPT_STR="+2" # Specify valid AIX command option flags. #Specify a Usage statement for this script. _USAGE="USAGE: cl_chuser [-cspoc -f] Attr=Value...Username" #Initialize variables local to this script. The default return code #is '0' for success. RETCODE=0 #Include the required C-SPOC Initialization and Verification #Routines %include cl_path.cel %include cl_init.cel #This define makes the following exception clause more readable. %define USER_NOT_FOUND 2 #Get the username, which is the last command line arg and perform #some crude checks for validity. Note: This is an example of #decoding args within a script. user=${_CMD_ARGS##*[ ]} # The []'s contain a tab & a space! case $user in -*|"") print "$_USAGE" exit 2 ;; esac #Since cl_init.cel has encoded all the args we must decode it to # use it. Duser=$(print $user | cldecodearg) #Construct a Gecos field based on the 411 entry for the username.
170
Troubleshooting Guide
#Note: 411 is a local script that prints the following # tab-separated fields: username Firstname Lastname Work_phone # Home_phone #This plan cuts the "Firstname Lastname" to put into the Gecos #field of /etc/passwd
Troubleshooting Guide
171
GECOS=$(/usr/local/bin/411 $Duser | cut -d' ' -f2) #Construct a new set of command line args. Note the following: # 1) We put the 'gecos=' arg just before the username so that it # will supercede any that may have been specified on the # command line. # 2) We must encode the 'gecos=' arg explicitly. # 3) This is an example of encoding args specified inside a script. NEW_CMD_ARGS=${_CMD_ARGS%[ ]*} # []'s contain a tab & a space NEW_CMD_ARGS="$NEW_CMD_ARGS $(print gecos="$GECOS" | HR> clencodearg) $user" #Perform a check that the username exists on all nodes. The check #is not performed when the C-SPOC force flag is specified. if [[ -z "${_SPOC_FORCE}" ]] then #Check if user exists across all cluster nodes %try_parallel _NODE _TARGET_NODES silent_err silent_out lsuser $user %except USER_NOT_FOUND print "${_CMD_NAME}: User ${Duser} does " print "not exist on node ${_NODE}" >&2" RETCODE=1 %end %end fi #If the username does not exist on a node then exit immediately if [[ ${RETCODE} -ne 0 ]] then exit ${RETCODE} fi #Execute the chuser command in parallel on all nodes in the #cluster. %try_parallel _NODE _TARGET_NODES chuser $NEW_CMD_ARGS %others # If chuser returned an error on any node # set the return code to '1' to indicate # that one or more nodes failed. RETCODE=1 %end %end #Exit with the appropriate value. exit ${RETCODE}
172
Troubleshooting Guide
CEL Constructs
You can use the following CEL constructs (statements or clauses) in a commands execution plan. All C-SPOC commands must contain the %include statement, a %try_parallel or %try_serial statement, and an %end statement. The %include statement is used to access ksh libraries within the cl_init.cel file. These libraries make C-SPOC commands cluster aware. The %except and %others clauses are used typically for error handling in response to a commands execution on one or more cluster nodes.
%define key value
%define statement:
For improved readability, the %define statement (keyword) is used to provide descriptive names for error_id values in %except clauses. The error_id given in the define statement is inserted in place of the given ID in any subsequent %except clauses. %include statement:
%include filename
The %include statement allows a copy of a file to be included in an execution plan, which means that common code can be shared among execution plans. The %include statement also allows C-SPOC commands with similar algorithms to share the same plans. Using %include, CEL statements can be used within library routines included in any execution plan.
%try_parallel node nodelist [silent_err] [silent_out] ksh statement [ %except clause ... ] [%end | %end_p ]
%try_parallel statement:
The %try_parallel statement executes the enclosed ksh statement simultaneously across all nodes in the nodelist. It waits until a command completes its execution on all cluster nodes before checking for errors. The %try_parallel statement is equivalent to the following pseudo-code:
for each node in nodelist execute ksh statement in the background on each node end wait for all background execution to complete on all nodes for each node in nodelist check status for each node and execute %except clause(s) end
%try_serial statement:
%try_serial node nodelist [silent_err] [silent_out] ksh statement [ %except clause ... ] [%others clause] [%end | %end_s ]
The %try_serial statement executes the enclosed ksh statement consecutively on each node specified in the nodelist. The command must complete before %try_serial continues its execution on the next node. All %try_serial statements sequentially check for errors after a command completes on each node in the list.
Troubleshooting Guide
173
The %try_serial statement is equivalent to the following pseudo-code: for each node in nodelist execute ksh statement in the on each node end for each node in nodelist check status for each node and execute %except clause(s) end Note: Any non-flag arguments (those not preceded by a dash) used inside a try statement must be encoded using the /usr/es/sbin/cluster/cspoc/clencodearg utility; otherwise, arguments will be decoded incorrectly. %except clause:
%except error_id ksh code [ %end | %end_e ]
The %except clause is used for error checking on a per-node basis, and for the execution of the ksh statement contained within a %try_parallel or %try_serial statement. If the ksh statement fails or times out, the %except clauses define actions to take for each return code value. If an error_id is defined using the %define statement, the return code of the ksh statement (the status code set by a commands execution on a particular node) is compared to the error_id value. If a match is found, the ksh code within the %except statement is executed. If the %others clause is used within the %except clause, the ksh code within the %except clause is executed only if the return code value did not match an error_id defined in previous %except clauses. An %except clause can be defined within or outside a try statement. If it is defined within a try statement, the clauses scope is local to the %try_parallel or %try_serial statement. If the %except clause is defined outside a try statement, it is used in any subsequent %try_serial or %try_parallel statements that do not already provide one for the specified error_id or %others clause. In this case, the %except clauses scope is global until an %unexcept statement is found. Global %except clauses are used to simplify and standardize repetitive types of error handling.
%except error_id ksh code [ %stop_try clause ] ksh code [ %end | %end_e ]
%stop_try clause:
The %stop_try clause can be used within an %except clause; it causes an exit from the enclosing try statement. This statement has the effect of a break statement in other languages. The %stop_try clause has a different effect that depends on whether it is defined with a %try_parallel versus a %try_serial statement.
174
Troubleshooting Guide
In a %try_serial statement, defining a %stop_try clause prevents further execution of the ksh statement on all cluster nodes; it also prevents further error checking defined by the %except clause. In a %try_parallel statement, defining a %stop_try clause prevents error checking on all cluster nodes, since execution of the try statement happens simultaneously on all nodes. The commands within the ksh statement will have completed on other nodes by the time the %except clauses are evaluated.
%others error_id ksh code [ %stop_try clause ] ksh code [ %end | %end_o ]
%others clause:
The %others clause is the default error action performed if a commands return code does not match the return code of a specific %except clause. The %others clause can be used in a try statement, or globally within an execution plan. %end statement:
%end
The %end statement is used with all CEL constructs. Ensure that your .cel file includes an %end statement with each statement or clause. A construct like %try_parallel, for example, can be ended with %end or with %end_p, where p represents parallel.
Troubleshooting Guide
175
_RETCODE=0 # Default exception handling for a COULD_NOT_CONNECT error %except 1000 nls_msg -l $cspoc_tmp_log ${_MSET} CDSH_UNABLE_TO_CONNECT "${_CMD_NAME}: Unable to connect to node ${_NODE}" >& 2 if [[ -z "${_SPOC_FORCE}" ]] then exit 1 fi %end # C-SPOC Initialization and Verification %include cl_path.cel %include cl_init.cel %define USER_NOT_FOUND 2 user=${_CMD_ARGS##* } if [[ -z "${_SPOC_FORCE}" ]] then # # Check if user exists across all cluster nodes # %try_parallel _NODE _TARGET_NODES silent_err silent_out lsuser $user %except USER_NOT_FOUND nls_msg -l $cspoc_tmp_log ${_MSET} 1 "${_CMD_NAME}: User ${user} does not exist on node ${_NODE}" ${_CMD_NAME} ${user} ${_NODE} >& 2 _RETCODE=1 %end %end fi # If user does not exist on a node, exit 1 if [[ ${_RETCODE} -ne 0 ]] then exit ${_RETCODE} fi # Run chuser across the cluster nodes %try_parallel _NODE _TARGET_NODES chuser $_CMD_ARGS %others # If chuser returned an error on any node, exit 1 _RETCODE=1 %end %end exit ${_RETCODE}
176
Troubleshooting Guide
-i inputfile -o outputfile
Uses inputfile for input, or stdin by default. Uses outputfile for output, or stdout by default.
-I IncludeDirectory Uses the specified directory name to locate .cel files (execution plans). You can specify multiple -I options. The cl_init.cel and cl_path.cel files are installed in /usr/es/sbin/cluster/samples/cspoc; The IncludeDirectory should normally be specified.
This command converts the cl_chuser.cel file into a ksh script. 3. Enter the following command to make the generated ksh script executable:
chmod +x inputfile
You can now invoke the C-SPOC command. If you change a commands execution plan after converting it, repeat the preceding steps to generate a new ksh script. Then make the script executable. Note that the preceding steps can be included in a Makefile.
Troubleshooting Guide
177
178
Troubleshooting Guide
The Cluster Manager daemon (clstrmgrES) The Cluster Information Program daemon (clinfoES) The Cluster Communication daemon (clcomdES)
The clstrmgrES, clinfoES and clcomd daemons are controlled by the System Resource Controller (SRC).
clcomd Daemon
The clcomd daemon is also under control of the SRC. To start a trace of this daemon, use the AIX traceson command and specify the clcomd subsystem.
Troubleshooting Guide
179
180
Troubleshooting Guide
6. Indicate whether you want a short or long trace event in the Trace Type field. A short trace contains terse information. For the clstrmgrES daemon, a short trace produces messages only when topology events occur. A long trace contains detailed information on time-stamped events. 7. Press Enter to enable the trace. SMIT displays a panel that indicates that tracing for the specified process is enabled.
Troubleshooting Guide
181
4. Enter values as necessary into the remaining fields and press Enter. SMIT displays a panel that indicates that the trace session has started.
To generate a trace report: 1. Enter: smit hacmp 2. In SMIT, select Problem Determination Tools > HACMP Trace Facility > Start/Stop/Report Tracing of HACMP for AIX Services > Generate a Trace Report. A dialog box prompts you for a destination, either a filename or a printer. 3. Indicate the destination and press Enter. SMIT displays the Generate a Trace Report panel. 4. Enter the trace IDs of the daemons whose events you want to include in the report in the IDs of events to INCLUDE in Report field. 5. Press F4 to see a list of the trace IDs. (Press Ctrl-v to scroll through the list.) Move the cursor to the first daemon whose events you want to include in the report and press F7 to select it. Repeat this procedure for each event that you want to include in the report. When you are done, press Enter. The values that you selected are displayed in the IDs of events to INCLUDE in Report field.The HACMP daemons have the following trace IDs: clstrmgrES clinfoES 910 911
6. Enter values as necessary in the remaining fields and press Enter. 7. When the information is complete, press Enter to generate the report. The output is sent to the specified destination. For an example of a trace report, see the following Sample Trace Report section.
182
Troubleshooting Guide
PROCESS NAME
001 trace Fri Mar 10 13:01:38 1995 011 trace broadcast_map_request 011 trace Function: skew_delay 011 trace skew_delay, amount: 718650720 011 trace service_context 011 trace Function: dump_valid_nodes 011 trace Function: dump_valid_nodes 011 trace Function: dump_valid_nodes 011 trace Function: dump_valid_nodes 011 trace 011 trace Function: service_context 011 trace 011 trace 011 trace 011 trace 011 trace Function: broadcast_map_request 002 trace
TRACE ON channel 0 HACMP for AIX:clinfo Exiting Function: HACMP for AIX:clinfo Entering HACMP for AIX:clinfo Exiting Function: HACMP for AIX:clinfo Exiting Function: HACMP for AIX:clinfo Entering HACMP for AIX:clinfo Entering HACMP for AIX:clinfo Entering HACMP for AIX:clinfo Entering HACMP for AIX:clinfo Waiting for event HACMP for AIX:clinfo Entering HACMP HACMP HACMP HACMP HACMP for for for for for AIX:clinfo AIX:clinfo AIX:clinfo AIX:clinfo AIX:clinfo Cluster ID: -1 Cluster ID: -1 Cluster ID: -1 Time Expired: -1 Entering
Troubleshooting Guide
183
184
Troubleshooting Guide
185
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation Dept. LRAS / Bldg. 003 11400 Burnet Road Austin, TX 78758-3493 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental.
186
Index +-*/ C
Index
+-*/
38, 93 /etc/hosts file check before starting cluster 97 listing IP labels 22 loopback and localhosts as aliases 126 missing entries for netmon.cf 115 /etc/locks file 103 /etc/netsvc.conf file editing for nameserving 128 /etc/rc.net script checking the status of 85 /etc/syslogd file redirecting output 130 /usr becomes too full 132 /usr/es/sbin/cluster/cl_event_summary.txt 132 /usr/es/sbin/cluster/clinfo daemon Clinfo 73 /usr/es/sbin/cluster/clstrmgrES daemon 179 Cluster Manager 179 /usr/es/sbin/cluster/cspoc/clencodearg utility 174 /usr/es/sbin/cluster/etc/clhosts file invalid hostnames/addresses 126 on client 126 updating IP labels and addresses 126 /usr/es/sbin/cluster/etc/rhosts troubleshooting 93 /usr/es/sbin/cluster/events/utils directory 138 /usr/es/sbin/cluster/server.status 118 /usr/es/sbin/cluster/snapshots/clsnapshot.log 38 /usr/es/sbin/cluster/utilities/celpp utility 167 /usr/es/sbin/cluster/utilities/clRGinfo command example output 162 /usr/es/sbin/cluster/utilities/clRGmove command 154 /usr/es/sbin/cluster/utilities/clruncmd command 22 /usr/es/sbin/cluster/utilities/clsnapshot utility clsnapshot 17 /usr/es/sbin/cluster/utilities/cltopinfo command 75 /usr/es/sbin/rsct/bin/hatsdmsinfo command 123 /var/ha/log/grpglsm log file 38 /var/ha/log/grpsvcs recommended use 39 /var/ha/log/topsvcs 39 /var/hacmp/adm/history/cluster.mmddyyyy file 41 /var/hacmp/clverify/clverify.log file 20, 42 /var/hacmp/log/clconfigassist.log 42 /var/hacmp/log/clutils.log 21, 42 /var/hacmp/log/cspoc.log file recommended use 40 /var/hacmp/log/emuhacmp.out file 41 message format 54 understanding messages 54 viewing its contents 55 /var/hacmp/log/hacmp.out file 41 correcting sparse content 128 first node up gives network error message recommended use 41 troubleshooting TCP/IP 85 understanding messages 44 /var/spool/lpd/qdir 34 /var/spool/qdaemon 34
113
AIX network interface disconnected from HACMP IP label recovery 112 application monitoring troubleshooting 132 applications fail on takeover node 118 inaccessible to clients 126 troubleshooting 72 ARP cache flushing 125 arp command 88 checking IP address conflicts 85 assigning persistent IP labels 87 ATM arp command 89 LAN emulation troubleshooting 111 troubleshooting 110 automatic cluster configuration monitoring configuring 21 automatic error notification failure 104
20
30, 112
CDE hangs after IPAT on HACMP startup 128 CEL Command Execution Language 167
Troubleshooting Guide
187
Index CC
CEL guide CEL constructs 173 writing execution plans 167 CEL plan example 170 celpp 167 converting execution plan to ksh script 176 cfgmgr command unwanted behavior in cluster 121 changing network modules 30 checking cluster services and Processes clcheck_server 74 cluster snapshot file 76 HACMP cluster 73 logical volume definitions 83 shared file system definitions 84 shared volume group definitions 81 volume group definitions 79 checking cluster configuration with Online Planning Worksheets 16 cl_activate_fs utility 147 cl_activate_vgs utility 147 cl_convert not run due to failed installation 94 cl_convert utility 94 cl_deactivate_fs utility 148 cl_deactivate_vgs utility 149 cl_disk_available utility 138 cl_echo utility 151 cl_fs2disk utility 139 cl_get_disk_vg_fs_pvids utility 139 cl_init.cel file 168 cl_is_array utility 140 cl_is_scsidisk utility 140 cl_log utility 151 cl_lsfs command checking shared file system definitions 84 cl_lslv command checking logical volume definitions 83 cl_lsvg command checking shared volume group definitions 81 cl_nfskill command 150 unmounting NFS file systems 103 cl_path .cel file 169 cl_raid_vg utility 141 cl_rsh remote shell command 159 cl_scsidiskreset command fails and writes errors to /var/hacmp/log/hacmp.out file 103 cl_scsidiskreset utility 141 cl_scsidiskrsrv utility 142 cl_swap_HPS_IP_address utility 146 cl_swap_HW_address utility 152 cl_swap_IP_address utility 152 cl_sync_vgs utility 142
cl_unswap_HW_address 153 clcheck_server checking cluster services and processes 74 clcmod.log 42 clcmoddiag.log 39 clcomd logging 68 tracing 179 troubleshooting 93 clcomd.log file 68 clcomddiag.log 68 clcomdES and clstrmgrES fail to start newly installed AIX nodes 97 clearing SSA disk fence registers 23 clhosts file editing on client nodes 126 clients cannot access applications 126 connectivity problems 125 not able to find clusters 126 Clinfo checking the status of 73 exits after starting 96 not reporting that a node is down 127 not running 126 restarting to receive traps 96 trace ID 181 tracing 179 clRGinfo command 160 reference page 160 clRGmove utility syntax 154 clsetenvgrp script used for serial processing 64 clsnapshot utility 17, 76 clsnapshot.log file 31 clstat utility finding clusters 126 clstrmgrES and clinfoES daemons user-level applications controlled by SRC 179 cluster checking configuration with cluster snapshot utility 17 troubleshooting configuration 74 tuning performance parameters 29 cluster configuration automatic cluster verification 20 cluster events emulating 24 resetting customizations 32 cluster history log file message format and content 52 Cluster Information Program Clinfo 179
188
Troubleshooting Guide
Index CC
cluster log files redirecting 68 Cluster Manager cannot process CPU cycles 130 checking the status of 73 hangs during reconfiguration 97 trace ID 181 tracing 179 troubleshooting common problems 96, 97 will not start 96 cluster security troubleshooting configuration 116 cluster services not starting 68 starting on a node after a DARE 98 Cluster SMUX Peer checking the status of 73 failure 126 trace ID 181 tracing 179 cluster snapshot checking during troubleshooting 76 creating prior to resetting tunable values 31 files 77 information saved 76 ODM data file 77 using to check configuration 17 cluster verification utility automatic cluster configuration monitoring 20 checking a cluster configuration 74 tasks performed 74 troubleshooting a cluster configuration 96 cluster.log file message format 42 recommended use 39 viewing its contents 43 cluster.mmddyyyy file cluster history log 52 recommended use 39 clverify log file logging to console 20 clverify.log file 42 collecting data from HACMP clusters 16 Command Execution Language CEL guide 167 command-line arguments encoding/decoding C-SPOC commands 169
commands arp 85, 88, 89 cl_convert 94 cl_nfskill 103 cl_rsh 159 cl_scsidiskreset 103 clRGinfo 160 clRGmove 154 clruncmd 22, 102 cltopinfo 75 configchk 97 df 84 dhb_read 91 diag 90, 94 errpt 90 fsck 103 ifconfig 85, 88 lsattr 90 lsdev 91 lsfs 84 lslv 82 lspv 81, 82 lssrc 85 lsvg 79 mount 83 netstat 85 ping 85, 87 snap -e 16 varyonvg 117 communications daemon tracing 179 config_too_long message 129 configchk command returns an Unknown Host message 97 configuration files merging during installation 94, 95 configuring automatic cluster configuration monitoring checking with snapshot utility 17 log file parameters 67 network modules 29 runtime parameters 49 configuring cluster restoring saved configurations 95 conversion failed installation 94 cron jobs making highly available 33 C-SPOC checking shared file systems 84 checking shared logical volumes 83 checking shared volume groups 81 C-SPOC commands creating 167 encoding/decoding arguments 169 C-SPOC scripts using CEL 167
21
Troubleshooting Guide
189
Index DF
cspoc.log file message format 53 viewing its contents 54 custom scripts print queues 34 samples 33 customizing cluster log files 68
disks troubleshooting 90 Distributed SMIT (DSMIT) unpredictable results 107 domain merge error message displayed 120 handling node isolation 120 dynamic reconfiguration emulating 28 lock 134
daemon.notice output redirecting to /usr/tmp/snmpd.log 130 daemons clinfo exits after starting 96 monitoring 73 trace IDs 181 tracing 179 deadman switch avoiding 122 cluster performance tuning 29 definition 29 fails due to TCP traffic 121 releasing TCP traffic 121 time to trigger 123 tuning virtual memory management 122 debug levels setting on a node 67 dependent resource groups processing 64 df command 83 checking filesystem space 84 dhb_read testing a disk heartbeating network 92 dhb_read command 91 diag command checking disks and adapters 90 testing the system unit 94 diagnosing problems recommended procedures 13 discovery 68 disk adapters troubleshooting 90 disk enclosure failure detection disk heartbeating 93 disk fencing job type 61 disk heartbeating disk enclosure failure detection 93 failure detection 92 troubleshooting 91 disk heartbeating network testing 92 diskhb networks failure detection 92 troubleshooting 91
emulating cluster events 24 enabling I/O pacing 122 Enterprise Storage System (ESS) automatic error notification 104 error messages console display of 15 errors mail notification of 14 errpt command 52, 90 event duration time customizing 129 event emulator log file 41 event preamble example 44 event summaries cl_event_summary.txt file too large 132 examples with job types 58 reflecting resource groups parallel processing 57 reflecting resource groups serial processing 57 resource group information does not display 132 sample hacmp.out contents 47 viewing 49 event_error event 41, 44 events changing custom events processing 98 displaying event summaries 49 emulating dynamic reconfiguration 28 emulating events 24 event_error 41, 44 processing replicated resource groups 64 unusual events 109 execution plan converting to ksh script 176 exporting volume group information 102
190
Troubleshooting Guide
Index GL
failure detection rate changing 30 changing to avoid DMS timeout 121 failures non-ip network, network adapter or node failures 114 fallback timer example in hacmp.out event summary 48 Fast Failover Detection decreasing fallover time using 92 Fast Failure Detection using disk heartbeating to reduce node failure detection time. 92 file systems change not recognized by lazy update 105 failure to unmount 134 mount failures 134 troubleshooting 83 flushing ARP cache 125 forced varyon failure 102 fsck command Device open failed message 103 fuser command using in scripts 134
HAGEO network issues after uninstallation 115 hardware address swapping message appears after node_up_local fails heartbeating over IP Aliases checking 89 HWAT gigabit Ethernet adapters fail 114
123
I/O pacing enabling 122 tuning 29, 122 ifconfig command 85, 88 initiating a trace session 180 installation issues cannot find filesystem at boot-time 94 unmerged configuration files 94 IP address listing in arp cache 88 IP address takeover applications fail on takeover node 118
generating trace report 182 Geo_Primary problem after uninstall 115 Geo_Secondary problem after uninstall 115 gigabit Ethernet adapters fail with HWAT 114
job types examples in the event summaries 58 fencing 61 parallel processing of resource groups 58
HACMP troubleshooting components 73 HACMP Configuration Database security changes for HACMP 99 HACMP for AIX commands syntax conventions 137 hacmp.out event summary example for settling time 47 fall back timer example 48 hacmp.out file displaying event summaries 49 message formats 47 selecting verbose script output 49 hacmp.out log file event summaries 47 setting output format 67
LANG variable 131 lazy update file system changes not recognized 105 lock set by dynamic reconfiguration 134 log WebSmit 38 log files /var/hacmp/log/clutils.log 21, 42 /var/hacmp/log/emuhacmp.out 41, 54 /var/hacmp/log/hacmp.out file 41, 44 changing parameters on a node 67 clcomd 42 cluster.log file 39, 42 cluster.mmddyyyy 39, 52 collecting for problem reporting 55 recommended use 38 redirecting 68 system error log 38, 51 types of 37 with cluster messages 37 logical volume manager (LVM) troubleshooting 79 logical volumes troubleshooting 82
Troubleshooting Guide
191
Index MR
lsattr command 90 lsdev command for SCSI disk IDs 90 lsfs command 83, 84 lslv command for logical volume definitions 82 lspv command 81 checking physical volumes 81 for logical volume names 82 lssrc command checking the inetd daemon status 85 checking the portmapper daemon status 85 lsvg command 79 checking volume group definitions 79 LVM troubleshooting 79
cannot communicate on ATM Classic IP 110 cannot communicate on ATM LANE 111 Token-Ring thrashes 107 unusual events when simple switch not supported 109 NIC failure switched networks 106 NIM process of RSCT being blocked 115 nodes troubleshooting cannot communicate with other nodes 107 configuration problems 98 dynamic node removal affects rejoining 98 non-IP networks failure detection 92
O
mail used for event notification 14 maxfree 122 migration ODM security changes 99 PSSP File Collections issue 100 minfree 122 monitoring applications troubleshooting 132 monitoring cluster configuration cluster verification automatic monitor mount command 83 listing filesystems 83
Object Data Manager (ODM) 102 updating 102 ODM see Object Data Manager Online Planning Worksheets 16
20
netmon.cf 115 netstat command network interface and node status 85 NetView deleted or extraneous objects in map 131 network troubleshooting network failure after MAU reconnect 108 will not reintegrate when reconnecting bus 108 network error message 113 network modules changing or showing parameters 30 configuring 29 failure detection parameters 30 networks diskhb 91 Ethernet 90 modify Geo networks definition 115 reintegration problem 108 single adapter configuration 115 Token-Ring 90, 119 troubleshooting 90
parallel processing tracking resource group activity 57 partitioned cluster avoiding 107 PCI network card recovering from failure 91, 108 permissions on HACMP ODM files 99 persistent node IP label 87 physical volumes troubleshooting 81 ping command 87 checking node connectivity 85 flushing the ARP cache 126 print queues custom script 34 making highly available 34 problem reporting 22 process_resources event script 58 PSSP File Collections migration issue 100
119
192
Troubleshooting Guide
Index ST
refreshing the cluster communications daemon 93 replicated resources event processing 64 resetting HACMP tunable values 31 Resource Group Migration utilities 154 resource group recovery on node_up troubleshooting 45 resource groups monitoring status and location clRMupdate and clRGinfo commands 160 processed serially unexpectedly 133 processing messages in hacmp.out 46 tracking parallel and serial processing in hacmp.out 56 rg_move event multiple resource groups 133 rhosts troubleshooting 93 RS/6000 SP system See SP 146 RSCT command dhb_read 91
scripts activating verbose mode 49 making print queues highly available 34 recovering from failures 22 sample custom scripts 33 setting debug levels 67 tape_resource_start_example 159 SCSI devices troubleshooting 90 scsidiskutil utility 143 security ODM changes that may affect upgrading HACMP 99 selective fallover not triggered by loss of quorum 119 serial processing resource groups 64 serial processing of resource groups tracking in event summaries 57 server.status file (see /usr/sbin/cluster/server.status) 118 service IP labels listed in /etc/hosts file 22 setting I/O Pacing 29 syncd frequency rate 30 settling time example in hacmp.out event summary 47 single-adapter networks and netmon configuration file 115
sites processing resource groups 64 SMIT help fails to display with F1 131 open on remote node 28 snap command 16 snapshot checking cluster snapshot file 76 definition 17 SP Switch 146 SP Utilities 146 SSA disk fence registers clearing 23 ssa_clear utility 144 ssa_clear_all utility 145 ssa_configure utility 145 ssa_fence utility 143 stabilizing a node 22 starting cluster services on a node after a DARE status icons displaying 135 stopping HACMP with unmanaged resource groups 15 switched networks NIC failure 106 syncd setting frequency rate 30 syntax conventions HACMP for AIX commands 137 system components checking 72 investigating 71 system error log file message formats 51 recommended use 38 understanding its contents 51 system panic invoked by deadman switch 130
98
tape_resource_start_example script 159 tape_resource_stop_example script 159 target mode SCSI failure to reintegrate 108 TCP traffic releasing deadman switch 121 TCP/IP services troubleshooting 85 Token-Ring network thrashes 107 node failure detection takes too long 119 topsvcs daemon messages on interface states 114
Troubleshooting Guide
193
Index UV
tracing HACMP for AIX daemons disabling using SMIT 181 enabling tracing using SMIT 180 generating a trace report using SMIT 182 initiating a trace session 180 overview 179 sample trace report 183 specifying a trace report format 181 specifying a trace report output file 182 specifying content of trace report 182 starting a trace session using SMIT 181 stopping a trace session using SMIT 182 trace IDs 181 using SMIT 180 triggering deadman switch 123 troubleshooting AIX operating system 90 applications 72 clcomd errors 93 cluster communications 116 cluster configuration 74 cluster security configuration 116 disk heartbeating 91 Ethernet networks 90 file systems 83 HACMP components 73 heartbeating over IP Aliases 89 LVM entities 79 networks 90 recommended procedures 13 resource group processing 56 rhosts 93 SCSI disks and adapters 90 snap -e command 16 solving common problems 71 system hardware 94 TCP/IP subsystem 85 Token-Ring networks 90 volume groups 79 VPN 116 TTY baud rate changing 30 tty baud rate 112 tuning I/O pacing 29 virtual memory management 122 tuning parameters cluster performance 29 resetting 31
upgrading pre- and post-event scripts 98 utilities cl_activate_fs 147 cl_activate_vgs 147 cl_deactivate_fs 148 cl_deactivate_vgs 149 cl_disk_available 138 cl_echo 151 cl_fs2disk 139 cl_getdisk_vg_fs_pvids 139 cl_is_array 140 cl_is_scsidisk 140 cl_log 151 cl_nfskill 150 cl_raid_vg 141 cl_scdiskreset 141 cl_scsidiskrsrv 142 cl_swap_HPS_IP_address 146 cl_swap_HW_address 152 cl_swap_IP_address 152 cl_sync_vgs 142 cl_unswap_HW_address 153 clsnapshot 17, 76 clstat 126 cluster verification 74 C-SPOC (see also C-SPOC utility) checking shared file systems 84 checking shared logical volumes 83 checking shared vgs 81 event emulation 24 scripts 137 scsidiskutil 143 ssa_clear 144 ssa_clear_all 145 ssa_configure 145 ssa_fence 143 stop_clmarketdemo 159
varyon checking passive or active mode 80 varyonvg command fails during takeover 117 fails if volume group varied on 101 troubleshooting 117 verbose script output activating 49 viewing cluster.log file 43 cspoc.log file 54 emuhacmp.out log file 55 event summaries 49 virtual memory management tuning deadman switch 122
194
Troubleshooting Guide
Index WW
VLAN troubleshooting 106 vmstat command 122 vmtune command 122 volume groups checking varyon state 80 disabling autovaryon at boot troubleshooting 79 VPN troubleshooting 117
101
135
Troubleshooting Guide
195
Index WW
196
Troubleshooting Guide