Sk62570 - How To Troubleshoot Failovers in ClusterXL - Advanced Guide
Sk62570 - How To Troubleshoot Failovers in ClusterXL - Advanced Guide
Solution ID sk62570
Product ClusterXL, Cluster - 3rd party
Version All
Platform / Model All
Date Created 21-abr-2011
Last Modied 20-mar-2016
Solution
This article contains the most common reasons for fail-over in ClusterXL.
In addition, this article provides the list of les and outputs that should be collected for each case of fail-over.
This article is an advanced guide made to complement sk56202 - How to troubleshoot failovers in ClusterXL.
In addition, refer to sk93306 - ATRG: ClusterXL R6x and R7xand to sk92723 - Cluster apping prevention.
2. Policy installation
Example
Interface is declared as "Down" because there is a problem with CCP packets (either Inbound, or Outbound, or both directions) As a result, the local
member counts less interfaces than required.
Related solutions:
sk92723 - Cluster apping prevention
sk93454 - Increasing ClusterXL dead timeout
sk43984 - Interface apping when cluster interfaces are connected through several switches
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -a if
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# ifconfig -a
[Expert@HostName]# fw getifs
[Expert@HostName]# fw ctl iflist
[Expert@HostName]# netstat -ni
[Expert@HostName]# tcpdump -ni <PROBLEMATIC_IF> arp
[Expert@HostName]# tcpdump -ni <PROBLEMATIC_IF> icmp
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/boot/modules/fwkern.conf
$FWDIR/conf/discntd.if
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + if mac pnote stat conf timer ccp
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
B. Synchronization
Example
Full Sync did not succeed after this machine tried to join the cluster - clocks on members are not synchronized, problem with SIC certicates of the
members, policy was unloaded from the member (sk36320).
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo $FWDIR/bin/fwd
on VSX : [Expert@HostName]# ls -l $FWDIR/state/__tmp/VSX
on VSX : [Expert@HostName]# ls -l $FWDIR/state/local/VSX
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
debug of FWD daemon only (sk86321)
debug of Full Sync (sk37029 , sk37030 , sk65103)
C. Filter
Example
Policy was not loaded successfully.
Policy was unloaded (sk36320).
There was some problem in FWD daemon.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# cpvinfo $FWDIR/bin/fwd
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
on VSX : [Expert@HostName]# ls -l $FWDIR/state/__tmp/VSX
on VSX : [Expert@HostName]# ls -l $FWDIR/state/local/VSX
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m fw + filter
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
D. CPHAD
Note: Does not exist on VSX cluster R6x.
Example
High load on machine's CPU interferes with reports from CPHAD pnote (e.g., policy installation on Nokia cluster - sk36647).
Hotx for CPHAMCSET was installed only on one of the members.
Different timeout values for pnote CPHAD and for pnote FWD (sk43172).
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# cpvinfo $FWDIR/bin/cphamcset
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
E. FWD
Example
FWD daemon is crashing.
High load on machine's CPU interferes with reports from FWD pnote (e.g., policy installation on Nokia cluster - sk36647).
Hotx for FWD was installed only on one of the members.
Different timeout values for pnote CPHAD and for pnote FWD (sk43172).
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo $FWDIR/bin/fwd
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
on VSX : [Expert@HostName]# ls -l $FWDIR/state/__tmp/VSX
on VSX : [Expert@HostName]# ls -l $FWDIR/state/local/VSX
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
F. FIBMGR
Example
Trafc between FIBMGRD daemons does not pass (TCP port 2010) (sk31243).
FWD daemon (parent process) is crashing.
High load on machine's CPU interferes with reports from FIBMGR pnote.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# md5sum $ADVRDIR/bin/*
[Expert@HostName]# cpvinfo $ADVRDIR/bin/*
[Expert@HostName]# tcpdump -i <SYNC_IF> tcp port 2010
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
/var/log/routing_messages*
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
G. CVPND
Example
CVPND daemon is crashing.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo $CVPNDIR/bin/cvpnd
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$CVPNDIR/log/cvpnd.elg
$CVPNDIR/log/httpd.log
$CVPNDIR/log/trace_log/*
$CVPNDIR/conf/httpd.conf
$CVPNDIR/conf/cvpnd.C
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
H. ROUTED
Example
ROUTED daemon is down / is not able to start.
Related solution: sk92787 - How to debug ClusterXL failovers caused by RouteD daemon on Gaia OS.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# netstat -anp
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo /bin/routed
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
In addition:
[Expert@Cluster_Member_HostName:0]# iclid
Cluster_Member_HostName> show cluster state
les:
/var/log/messages*
/etc/routed*.conf
/var/log/routed.log
/var/log/dmesg
/var/log/boot.log
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
I. TED
Example
TED daemon is down / is not able to start.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# netstat -anp
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/ted.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
J. VSX
Example
Virtual Systems are in problematic state (e.g., policy installation failed).
Related solutions:
sk92812 - VSX Virtual System might be left without any policy, if installation of policy fails after running 'cpstop;cpstart' commands
sk93599 - Failover occurs randomly in VSX cluster because Critical Device 'VSX' reports its status as 'problem'
outputs:
[Expert@HostName:0]# vsx stat -v
[Expert@HostName:0]# vsx stat -l
[Expert@HostName:0]# cphaprob state
[Expert@HostName:0]# cphaprob -ia list
[Expert@HostName:0]# cpwd_admin list
[Expert@HostName:0]# ps auxwf
[Expert@HostName:0]# top
[Expert@HostName:0]# vmstat -n 1 20
[Expert@HostName:0]# ifconfig -a
[Expert@HostName:0]# netstat -anp
[Expert@HostName:0]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName:0]# ls -l $FWDIR/state/__tmp/VSX
[Expert@HostName:0]# ls -l $FWDIR/state/local/FW1
[Expert@HostName:0]# ls -l $FWDIR/state/local/VSX
The following outputs have to be collected from context of each VS / VR:
[Expert@HostName:0]# ifconfig -a
[Expert@HostName:0]# netstat -anp
[Expert@HostName:0]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName:0]# ls -l $FWDIR/state/local/FW1
les:
This log le has to be collected from context of VS0 and of each VS / VR:
$FWDIR/log/fwk.elg*
/var/log/messages*
/etc/routed*.conf
/var/log/routed.log
/var/log/dmesg
/var/log/boot.log
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName:0]# fw ctl debug 0
[Expert@HostName:0]# fw ctl debug -buf 32000
[Expert@HostName:0]# fw ctl debug -m cluster + conf stat pnote if mac
[Expert@HostName:0]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName:0]# fw ctl debug 0
K. Instances
Example
Mismatch between the number of CoreXL FW instances in the received CCP packet and the number of loaded CoreXL FW instances on the involved
Virtual System.
Refer to sk106912 - VSX cluster member is "Down" due to Critical Device "Instances" in "problem" state.
Need to collect the following:
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# fw ctl multik stat
les:
/etc/fw.boot/boot.conf
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + stat pnote ccp conf
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
L. Customer pnotes
Example
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + stat pnote subs conf
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
Example
Customer changes the priorities of the members.
In HA Active Up mode, the Standby member is not under high load, therefore it installs the policy faster and becomes Active.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
Example
Active / Pivot member changed its state to "Down" (e.g., after 'clusterXL_admin down' command), and sent a CCP My_State packet. As a result, Standby
member will change its state to Active.
outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# fw ctl pstat
les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"
debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + ccp conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0
1994-2017 Check Point Software Technologies Ltd. All rights reserved.
Copyright | Privacy Policy