0% found this document useful (0 votes)
3K views8 pages

Sk62570 - How To Troubleshoot Failovers in ClusterXL - Advanced Guide

Sk62570 - How to Troubleshoot Failovers in ClusterXL - Advanced Guide

Uploaded by

Pepe Lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views8 pages

Sk62570 - How To Troubleshoot Failovers in ClusterXL - Advanced Guide

Sk62570 - How to Troubleshoot Failovers in ClusterXL - Advanced Guide

Uploaded by

Pepe Lopez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

SUPPORT CENTER USER CENTER / PARTNER MAP THREAT PREVENTION RESOURCES MY ACCOUNT

PRODUCTS / SOLUTIONS SUPPORT / SERVICES PARTNERS COMPANY BLOG

Support Center > Search Results > SecureKnowledge Details

Search Support Center

How to troubleshoot failovers in ClusterXL - Advanced Guide


Rate This My Favorites Email Print

Solution ID sk62570
Product ClusterXL, Cluster - 3rd party
Version All
Platform / Model All
Date Created 21-abr-2011
Last Modied 20-mar-2016

Solution
This article contains the most common reasons for fail-over in ClusterXL.
In addition, this article provides the list of les and outputs that should be collected for each case of fail-over.

This article is an advanced guide made to complement sk56202 - How to troubleshoot failovers in ClusterXL.

In addition, refer to sk93306 - ATRG: ClusterXL R6x and R7xand to sk92723 - Cluster apping prevention.

Fail-over might be caused by one of three major reasons:

1. Critical Device (Pnote) reported its state as 'problem'


Interface Active Check
Synchronization
Filter
CPHAD
FWD
FIBMGR
CVPND
ROUTED
TED
VSX
Instances
Customer pnotes

2. Policy installation

3. CCP packet of type 'My_State' was deliberately sent by a member

Each of the above reasons is described below:

1. Fail-over caused by Pnotes


A. Interface Active Check

Example
Interface is declared as "Down" because there is a problem with CCP packets (either Inbound, or Outbound, or both directions) As a result, the local
member counts less interfaces than required.
Related solutions:
sk92723 - Cluster apping prevention
sk93454 - Increasing ClusterXL dead timeout
sk43984 - Interface apping when cluster interfaces are connected through several switches

Need to check the following:


sk33781 - Performance analysis for Security Gateway NGX R65 / R7x
sk98348 - Best Practices - Security Gateway Performance

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -a if
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# ifconfig -a
[Expert@HostName]# fw getifs
[Expert@HostName]# fw ctl iflist
[Expert@HostName]# netstat -ni
[Expert@HostName]# tcpdump -ni <PROBLEMATIC_IF> arp
[Expert@HostName]# tcpdump -ni <PROBLEMATIC_IF> icmp

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/boot/modules/fwkern.conf
$FWDIR/conf/discntd.if
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + if mac pnote stat conf timer ccp
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

B. Synchronization
Example
Full Sync did not succeed after this machine tried to join the cluster - clocks on members are not synchronized, problem with SIC certicates of the
members, policy was unloaded from the member (sk36320).

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo $FWDIR/bin/fwd
on VSX : [Expert@HostName]# ls -l $FWDIR/state/__tmp/VSX
on VSX : [Expert@HostName]# ls -l $FWDIR/state/local/VSX

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
debug of FWD daemon only (sk86321)
debug of Full Sync (sk37029 , sk37030 , sk65103)

C. Filter

Example
Policy was not loaded successfully.
Policy was unloaded (sk36320).
There was some problem in FWD daemon.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# cpvinfo $FWDIR/bin/fwd
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
on VSX : [Expert@HostName]# ls -l $FWDIR/state/__tmp/VSX
on VSX : [Expert@HostName]# ls -l $FWDIR/state/local/VSX

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739
additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m fw + filter
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

D. CPHAD
Note: Does not exist on VSX cluster R6x.

Example
High load on machine's CPU interferes with reports from CPHAD pnote (e.g., policy installation on Nokia cluster - sk36647).
Hotx for CPHAMCSET was installed only on one of the members.
Different timeout values for pnote CPHAD and for pnote FWD (sk43172).

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# cpvinfo $FWDIR/bin/cphamcset
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

E. FWD

Example
FWD daemon is crashing.
High load on machine's CPU interferes with reports from FWD pnote (e.g., policy installation on Nokia cluster - sk36647).
Hotx for FWD was installed only on one of the members.
Different timeout values for pnote CPHAD and for pnote FWD (sk43172).

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo $FWDIR/bin/fwd
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20
on VSX : [Expert@HostName]# ls -l $FWDIR/state/__tmp/VSX
on VSX : [Expert@HostName]# ls -l $FWDIR/state/local/VSX

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

F. FIBMGR

Example
Trafc between FIBMGRD daemons does not pass (TCP port 2010) (sk31243).
FWD daemon (parent process) is crashing.
High load on machine's CPU interferes with reports from FIBMGR pnote.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# md5sum $ADVRDIR/bin/*
[Expert@HostName]# cpvinfo $ADVRDIR/bin/*
[Expert@HostName]# tcpdump -i <SYNC_IF> tcp port 2010
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
/var/log/routing_messages*
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

G. CVPND

Example
CVPND daemon is crashing.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo $CVPNDIR/bin/cvpnd
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$CVPNDIR/log/cvpnd.elg
$CVPNDIR/log/httpd.log
$CVPNDIR/log/trace_log/*
$CVPNDIR/conf/httpd.conf
$CVPNDIR/conf/cvpnd.C
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

H. ROUTED

Example
ROUTED daemon is down / is not able to start.
Related solution: sk92787 - How to debug ClusterXL failovers caused by RouteD daemon on Gaia OS.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# netstat -anp
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# cpvinfo /bin/routed
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20

In addition:
[Expert@Cluster_Member_HostName:0]# iclid
Cluster_Member_HostName> show cluster state

les:
/var/log/messages*
/etc/routed*.conf
/var/log/routed.log
/var/log/dmesg
/var/log/boot.log
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

I. TED
Example
TED daemon is down / is not able to start.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# netstat -anp
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat
[Expert@HostName]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName]# ls -l $FWDIR/state/local/FW1
[Expert@HostName]# top
[Expert@HostName]# vmstat -n 1 20

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/ted.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

J. VSX

Example
Virtual Systems are in problematic state (e.g., policy installation failed).

Related solutions:
sk92812 - VSX Virtual System might be left without any policy, if installation of policy fails after running 'cpstop;cpstart' commands
sk93599 - Failover occurs randomly in VSX cluster because Critical Device 'VSX' reports its status as 'problem'

Need to collect the following:

outputs:
[Expert@HostName:0]# vsx stat -v
[Expert@HostName:0]# vsx stat -l
[Expert@HostName:0]# cphaprob state
[Expert@HostName:0]# cphaprob -ia list
[Expert@HostName:0]# cpwd_admin list
[Expert@HostName:0]# ps auxwf
[Expert@HostName:0]# top
[Expert@HostName:0]# vmstat -n 1 20
[Expert@HostName:0]# ifconfig -a
[Expert@HostName:0]# netstat -anp
[Expert@HostName:0]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName:0]# ls -l $FWDIR/state/__tmp/VSX
[Expert@HostName:0]# ls -l $FWDIR/state/local/FW1
[Expert@HostName:0]# ls -l $FWDIR/state/local/VSX
The following outputs have to be collected from context of each VS / VR:
[Expert@HostName:0]# ifconfig -a
[Expert@HostName:0]# netstat -anp
[Expert@HostName:0]# ls -l $FWDIR/state/__tmp/FW1
[Expert@HostName:0]# ls -l $FWDIR/state/local/FW1

les:
This log le has to be collected from context of VS0 and of each VS / VR:
$FWDIR/log/fwk.elg*

/var/log/messages*
/etc/routed*.conf
/var/log/routed.log
/var/log/dmesg
/var/log/boot.log
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName:0]# fw ctl debug 0
[Expert@HostName:0]# fw ctl debug -buf 32000
[Expert@HostName:0]# fw ctl debug -m cluster + conf stat pnote if mac
[Expert@HostName:0]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName:0]# fw ctl debug 0

K. Instances

Example
Mismatch between the number of CoreXL FW instances in the received CCP packet and the number of loaded CoreXL FW instances on the involved
Virtual System.
Refer to sk106912 - VSX cluster member is "Down" due to Critical Device "Instances" in "problem" state.
Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# fw ctl multik stat

les:
/etc/fw.boot/boot.conf
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + stat pnote ccp conf
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

L. Customer pnotes

Example

Customer used $FWDIR/bin/clusterXL_admin , $FWDIR/bin/clusterXL_monitor_ips , $FWDIR/bin/clusterXL_monitor_process


script(s).
Customer registered a pnote manually.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$FWDIR/conf/cphaprob.conf
$FWDIR/conf/cpha_global_pnotes.conf
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + stat pnote subs conf
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

2. Fail-over caused by policy installation

Example
Customer changes the priorities of the members.
In HA Active Up mode, the Standby member is not under high load, therefore it installs the policy faster and becomes Active.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# ps auxwf
[Expert@HostName]# cpstat -f policy fw
[Expert@HostName]# fw ctl pstat

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
$FWDIR/log/fwd.elg*
$CPDIR/log/cpwd.elg
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

3. Fail-over caused by CCP 'My_State' packet

Example

Active / Pivot member changed its state to "Down" (e.g., after 'clusterXL_admin down' command), and sent a CCP My_State packet. As a result, Standby
member will change its state to Active.

Need to collect the following:

outputs:
[Expert@HostName]# cphaprob state
[Expert@HostName]# cphaprob -ia list
[Expert@HostName]# cpwd_admin list
[Expert@HostName]# fw ctl pstat

les:
/var/log/messages*
/var/log/dmesg
/var/log/boot.log
CPinfo file from each cluster member - collected with latest version of CPinfo utility per sk92739
CPinfo file from MGMT server - collected with latest version of CPinfo utility per sk92739

additional information:
export of log entries from SmartView Tracker that contain "cluster_info" in column "Information"

debug:
[Expert@HostName]# fw ctl debug 0
[Expert@HostName]# fw ctl debug -buf 32000
[Expert@HostName]# fw ctl debug -m cluster + ccp conf stat pnote if mac subs
[Expert@HostName]# fw ctl kdebug -T -f > /var/log/debug.txt
replicate the problem
press CTRL+C
[Expert@HostName]# fw ctl debug 0

Give us Feedback Please rate this document [1=Worst,5=Best]

Enter your comment here Submit


Comment


1994-2017 Check Point Software Technologies Ltd. All rights reserved.
Copyright | Privacy Policy

You might also like