Checkpoint Firewall Health Check
Checkpoint Firewall Health Check
Checkpoint Firewall Health Check
11 May 2011
2011 Check Point Software Technologies Ltd. All rights reserved. This product and related documentation are protected by copyright and distributed under licensing restricting their use, copying, distribution, and decompilation. No part of this product or related documentation may be reproduced in any form or by any means without prior written authorization of Check Point. While every precaution has been taken in the preparation of this book, Check Point assumes no responsibility for errors or omissions. This publication and features described herein are subject to change without notice. RESTRICTED RIGHTS LEGEND: Use, duplication, or disclosure by the government is subject to restrictions as set forth in subparagraph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFARS 252.227-7013 and FAR 52.227-19. TRADEMARKS: Refer to the Copyright page (http://www.checkpoint.com/copyright.html) for a list of our trademarks. Refer to the Third Party copyright notices (http://www.checkpoint.com/3rd_party_copyright.html) for a list of relevant copyrights and third-party licenses.
Important Information
Latest Software
We recommend that you install the most recent software release to stay up-to-date with the latest functional improvements, stability fixes, security enhancements and protection against new and evolving attacks.
Latest Documentation
The latest version of this document is at: http://supportcontent.checkpoint.com/documentation_download?ID=12143 For additional technical information, visit the Check Point Support Center (http://supportcenter.checkpoint.com).
Revision History
Date 5/9/2011 Description First release of this document
Feedback
Check Point is engaged in a continuous effort to improve its documentation. Please help us by sending your comments (mailto:[email protected]?subject=Feedback on How To Perform A SecurePlatform Firewall Health Check ).
Contents
Important Information .............................................................................................3 How To Perform a SecurePlatform Firewall Health Check ...................................5 Before You Start .....................................................................................................5 Performing a SecurePlatform Firewall Health Check ...........................................6 Health Check Action Severity Guide .................................................................... 6 Section 1 Physical Platform Checks ................................................................. 6 Date, System Uptime and Clock: .................................................................... 6 Disk Space ..................................................................................................... 7 Physical RAM and Swap Space: ..................................................................... 8 Memory Usage ............................................................................................... 9 CPU Usage....................................................................................................10 Interface Errors ..............................................................................................12 Fragmentation................................................................................................13 Checking dmesg and the Messages File........................................................14 Section 2 Firewall Application Checks: ............................................................15 Processes ......................................................................................................15 Capacity Optimization ....................................................................................17 ClusterXL and State Synchronization .............................................................18 SecureXL .......................................................................................................23 Aggressive Ageing .........................................................................................24 HFA Patching ................................................................................................25 Completing the Procedure ...................................................................................27 Verfying .................................................................................................................27
Supported Versions
All supported Check Point versions including R70+
Supported OS
SecurePlatform versions 2.4 and 2.6
Supported Appliances
All supported SecurePlatform Appliances
Page 5
Example output: Zulu# uptime 09:46:34 up 124 days, 9:40, 1 user, load average: 0.36, 0.19, 0.14 If a low uptime is shown it normally indicates that the firewall has been administratively rebooted but it may also have been due to a self-reboot, for example due to a panic. Low uptime - if you suspect the uptime is less than it should be check the /var/log/messages file for the reason of the last reboot.
Page 6
For state synchronization between cluster members to function properly the clocks on the cluster members must be set to within 1 minute of each other. The best means of achieving this is to use NTP. To check the time use the uptime command. (The time command does not show seconds). If there is a discrepancy of more than 10 seconds on the cluster members there may be an issue with NTP. Examine the /var/log/messages file to determine if the NTP server updates are working properly: Sep 6 16:46:42 Zulu ntpdate[6291]: step time server 10.225.227.57 offset 1.633937 sec Sep 6 16:48:52 Shaka ntpdate[28347]: no server suitable for synchronization found Manually adjust the time and date if required and fix the NTP configuration or network issue. To have accurate timestamps on the logs, it is recommended that non-clustered firewalls are also configured to use NTP to synchronise their clock to a NTP reference clock.
Disk Space
The disk space usage can be examined using the command: df k
Example output: [Expert@Zulu]# df k Filesystem 1K-blocks /dev/sda5 600832 none 600832 /dev/sda1 147766 /dev/sda7 1541680 none 2045688 /dev/sda6 1541680 /dev/sda8 27024000 [Expert@Zulu]# Used Available Use% Mounted on 187800 382512 33% / 187800 382512 33% /dev/pts 10124 130013 8% /boot 930324 533044 64% /opt 0 2045688 0% /dev/shm 593844 869524 41% /sysimg 5472984 20178264 22% /var
In the above example, all partitions are under 70% usage. If a partition has a use% that is more than 70% but less than 90% If the use% is 90% or more See if the partition can be cleaned up to free up disk space. /var/opt/CPsuite-RXX/fw1/log may be filled with old log files if the firewall has been logging locally. /var/log may have old messages files
Page 7
The total column shows the amount of RAM installed in the system (2GB in the ab ove example) and the amount of disk space allocated for swap space (4GB). The amount of swap space is normally automatically set to twice the size of the physical memory, with 4 GB being the maximum. The used column indicates how much RAM and swap space are being used. The free column indicates how much RAM and swap space are available. In the above example output the used column indicates <1 GB of RAM is being used and no swap space is being used. If for some reason the amount of free RAM becomes low, the appliance will start to preserve free RAM by swapping out the contents of the memory to the hard disk (swap space). The performance will be sub-optimal if swap space is being used due to time and resources spent writing and reading to the hard-disk. Example Output: [Expert@Zulu]# free k total cached Mem: 2055120 697688 -/+ buffers/cache: Swap: 4192912 Total: 6248032 [Expert@Zulu]# -t used 1897424 1101004 735980 2633404 free 157696 954116 3456932 3614628 shared 0 buffers 98732
Swap space usage may indicate not enough memory is installed in the appliance. The kernel is 32 bit and can use up to 4GB. It is recommended to upgrade the memory if less than 4GB of RAM are installed. For further information about the amount of RAM that is supported by SecurePlatform refer to: sk22343: What is the maximum memory supported by SecurePlatform?
Page 8
Memory Usage
The firewalls memory usage can be examined by using the command: fw ctl pstat The output of this command is vast and can be difficult to understand as not all the output is intuitive. The statistics that need to be checked to ensure memory is healthy are: hash kernel memory hmem system kernel memory smem kernel memory kmem.
Example output: [Expert@Zulu]# fw ctl pstat | more Machine Capacity Summary: Memory used: 7% (128MB out of 1638MB) - below low watermark Concurrent Connections: 21% (43253 out of 199900) - below low watermark Aggressive Aging is not active Hash kernel memory (hmem) statistics: Total memory allocated: 142606336 bytes in 34782 4KB blocks using 34 pools Initial memory allocated: 20971520 bytes (Hash memory extended by 121634816 bytes) Memory allocation limit: 335544320 bytes using 512 pools Total memory bytes used: 39254196 unused: 103352140 (72.47%) peak: 133739228 Total memory blocks used: 10335 unused: 24447 (70%) peak: 32795 Allocations: 3375437074 alloc, 0 failed alloc, 3375001310 free System kernel memory (smem) statistics: Total memory bytes used: 188577580 peak: 227270504 Blocking memory bytes used: 1958392 peak: 2205256 Non-Blocking memory bytes used: 186619188 peak: 225065248 Allocations: 979925174 alloc, 0 failed alloc, 979924513 free, 0 failed free Kernel memory (kmem) statistics: Total memory bytes used: 84876956 peak: 177110948 Allocations: 3375820431 alloc, 0 failed alloc, 3375384380 free, 0 failed free External Allocations: 0 for packets, 31589936 for SXL In the above example there are no hmem, smem, kmem failed allocations. Presence of hmem failed allocations indicates that the hash kernel memory was full. This is not a serious memory problem but indicates there is a configuration problem. The value assigned to the hash memory pool, (either manually or automatically by changing the number concurrent connections in the capacity optimization section of a firewall) determines the size of the hash kernel memory. If a low hmem limit was configured it leads to improper usage of the OS memory. See Capacity Optimization in the Firewall Health Checks section for further information. Presence of smem failed allocations indicates that the OS memory was exhausted or there are large non-sleep allocations. This is symptomatic of a memory shortage. If there are failed smem allocations and the memory is less than 2 GB, upgrading to 2GB may fix the problem. Decreasing the TCP end timeout and decreasing the number of concurrent connections can also help reduce memory consumption.
Page 9
Presence of kmem failed allocations means that some applications did not get memory. This is usually an indication of a memory problem; most commonly a memory shortage. The natural limit is 2GB, since the Kernel is 32bit.) Memory shortage sometimes indicates a memory leak. In order to troubleshoot memory shortage, stop the load you need to stop the load and let connections close. If the memory consumption returns back to normal, you are not dealing with a memory leak. Such shortage might happen when traffic volumes are too high for the device capacity. If the memory shortage happens after a change in the system or the environment, undo the change, and check whether kmem memory consumption goes down.
For optimum performance there should not be any failed memory allocations.
CPU Usage
CPU usage on single and multicore platforms can be checked with the command: Top Example top output from a badly optimized multi-core system:
Time spent running non-kernel code (User) Time spent running kernel code (System) Nice time Time spent idle Time spent waiting for IO hardware interrupt Software interrupt stealth time (Involuntary wait time)
The idle value (%id) shows how busy the appliance is. If the value is 0, the CPU is maxed out. With the firewall under load, examine the output of idle column (%id) for each CPU and determine if core usage is spread out evenly.
Page 10
In the above example the core usage is uneven; some cores are maxed out while other cores are mostly idle. The core allocation (sim affinity) may require tuning to optimize the usage of the cores and improve the performance. For information on core tuning, refer to: sk33250: Automatic SIM Affinity on Multi-Core CPU Systems The CPU usage is broken down into: High CPU in user time (%us) indicates that some daemon process is consuming high CPU; security server processes like fwssd and in.ahttpd have been offenders in the past. (Figure out which process it is from the output of ps or top.) High CPU usage in system (%sy) indicates that the Check Point kernel (traffic being inspected by Check Point or SmartDefense) is consuming CPU. Certain configurations in SmartDefense and web-Intelligence can cause this to occur by disabling SecureXL templating or completely disabling SecureXL acceleration. High CPU in wait time (%wa) occurs when the CPU was idle due to the system waiting for an outstanding disk I/O request to complete. This indicates your system is probably low on physical memory and is swapping out memory (paging)*. The CPU is not actually busy if this number is spiking; the CPU is blocked from doing any useful work waiting for an I/O event to complete. A high value against software interrupt (%si) indicates that there is probably a high load of traffic on the appliance. The interface errors (netstat i) should be examined to see if this is a cause of concern. * The occurrence of paging can be determined by running vmstat -n 5 5 and checking the swapped in (si) and swapped out (so) statistics. Disregard the first line as it is an average value since the appliance started.
Page 11
Interface Errors
Interface statistics are displayed using the command: netstat i Example output:
[Expert@Zulu]# Iface MTU Met eth0 1500 0 eth1 1500 0 eth2 1500 0 eth6 1500 0 lo 16436 0 [Expert@Zulu]# netstat -i RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg 29597525 0 0 0 42570398 0 0 0 BMRU 1032315302 0 3976 0 1615311511 0 0 0 BMRU 1624715902 0 12111 0 1025019332 0 0 0 BMRU 26828076 0 0 0 477906370 0 0 0 BMRU 5922470 0 0 0 5922470 0 0 0 LRU
In the above example the, RX-DRP indicates that the appliance is dropping packets at the network. This is not ideal but as a percentage of received packets, the amount of RX-DRP packets is insignificant and can therefore be disregarded as a source of concern. If the ratio is higher than 0.5% attention is required! The RX and TX columns show how many packets have been received or transmitted error-free (RX-OK/TXOK) or damaged (RX-ERR/TX-ERR); how many were dropped (RX-DRP/TX-DRP); and how many were lost because of an overrun (RX-OVR/TX-OVR). RX-ERR/TX-ERR errors usually indicate a mismatch in duplex setting, mtu size, bad cabling or possibly a faulty interface card. Check the switch settings and fix the speed and duplex settings if there is a mismatch, check cabling and try a spare interface. RX-DRP implies the appliance is dropping packets at the network. If the ratio of RX-DRP to RXOK is greater than 0.5% attention is required as it is a sign that the firewall does not have enough FIFO memory buffer (descriptors) to hold the packets while waiting for a free interrupt to process them. When the FIFO buffer is full the appliance will drop new packets as it does not have any spare buffer to hold them. A possible solution is to use Link Aggregation or tune the driver by increasing the descriptors, see: sk25921: Tuning Intel PRO/1000 family NICs driver parameters for maximal throughput TX-DRP usually indicates that there is a downstream issue and the firewall has to drop the packets as it is unable to put them on the wire fast enough. Increasing the bandwidth through link aggregation or introducing flow control may be a possible solution to this problem.
Page 12
Fragmentation
Excessive fragmentation will have a detrimental impact on the firewalls performance. When packets are fragmented by the network the kernel may receive them out of order. The kernel has to wait until it has received all the fragments before it can re-assemble the fragments and then inspect the re-assembled packet. Fragmented traffic can not be accelerated by the performance pack (SecureXL). To examine the level of fragmentation run the following command: fw ctl pstat Find the section in the output for fragmentation and if there is fragmentation, examine the expired and failures values. Example fw ctl pstat fragmentation output (truncated): Fragments: 130963 fragments, 64066 packets, 2337 expired, 0 short, 4 large, 304 duplicates, 0 failures Expired denotes how many fragments were expired when the firewall failed to reassemble them in a 20 seconds time frame or when due to memory exhaustion, they could not be kept in memory anymore. Failures denotes the number of fragmented packets that were received that could not be successfully re-assembled. The number of failures should be viewed in context with the amount of fragmentation occurring and relative to the total packet throughput (netstat i). The values in pstat are accumulative and large values may actually be relatively small to the total packet throughput. However, if there is a significant number against failures then the cause of the issue should be traced to determine if there is a way to mitigate it. In the above example output 1.8% of fragments that were received had to be expired by the firewall but as there were no failures it implies that the fragments were subsequently re-transmitted and successfully re-assembled by the firewall so no packets were lost.
If the source of fragmentation is external there is little that can be done to alleviate the problem but if it is internal, reducing the mtu size on the offending server may resolve the problem.
Page 13
From time to time other messages of a similar nature may appear in dmesg, the /var/log/messages file and on the console. It is always a good idea to research the message in the Check Point Secure Knowledge if you are unsure of the meaning. For further information see: sk33219: Critical error messages and logs
Page 14
Processes
A list of processes running on the firewall can be displayed with the following commands: top ps auxw
Use the top command to check if any process is hogging CPU or Memory and to see if there are any Zombie processes.
Example output:
[Expert@Zulu]# top 09:46:44 up 24 days, 9:40, 1 user, load average: 0.30, 0.19, 0.14 55 processes: 50 sleeping, 2 running, 3 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 15.0% 0.0% 1.0% 10.0% 24.0% 0.0% 150.0% cpu00 7.0% 0.0% 0.0% 0.0% 1.0% 0.0% 92.0% cpu01 8.0% 0.0% 1.0% 10.0% 23.0% 0.0% 58.0% Mem: 4091376k av, 1390028k used, 2701348k free, 0k shrd, 90864k buff 786476k active, 140320k inactive Swap: 4192944k av, 0k used, 4192944k free 278224k cached PID 1526 1 2 3 4 5 6 9 7 8 10 17 22 90 USER root root root root root root root root root root root root root root PRI 25 15 RT RT 15 34 34 25 15 15 15 25 15 25 NI SIZE 0 97280 0 512 0 0 0 0 0 0 19 0 19 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RSS SHARE STAT %CPU %MEM 95M 11396 R 15.8 2.3 512 452 S 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SWN 0.0 0.0 0 0 SWN 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 0 0 SW 0.0 0.0 TIME CPU COMMAND 2590m 1 fw 0:17 0 init 0:00 0 migration 0:00 1 migration 0:00 1 keventd 0:00 0 ksoftirqd 0:00 1 ksoftirqd 0:00 1 bdflush 0:10 0 kswapd 0:12 0 kscand 0:14 0 kupdated 0:00 0 scsi_eh_0 0:14 0 kjournald 0:00 1 khubd
The above example output indicates there are 3 zombie processes but there are no resource hogging processes. The Zombie processes should be identified to see if there is any cause for action.
Page 15
Use ps auxw | more to examine the value in the START column of the process INIT, check the START column of cpd, fwd and vpnd processes and other daemons to see if they have restarted since the last boot. Identify any Zombie processes. Example output:
[Expert@Zulu]# ps auxw | USER PID %CPU %MEM root 1 0.0 0.0 root 731 0.0 0.0 root 1174 0.0 0.0 root 1212 0.0 0.0 root 1265 0.0 0.0 /opt/spwm/bin/cpwmd_wd root 1269 0.0 0.1 root 1389 0.0 0.1 root 1402 0.0 0.0 root 1416 0.2 4.9 root 1526 7.3 2.3 root 1578 0.0 1.6 root 1579 0.0 1.6 root 1580 0.1 1.7 root 1586 0.2 0.1 root 1680 0.0 2.0 more VSZ RSS TTY 1524 512 ? 1524 476 ? 3040 1348 ? 1572 620 ? 2724 904 ? 34412 7348 ? 7948 4608 ? 9120 3908 ? 331348 204012 ? 422392 97280 ? 220252 66864 ? 220220 66800 ? 240988 69844 ? 11508 6172 ? 273760 82716 ? STAT S S S S S S S S S S S S S S S START Jun13 Jun13 Jun13 Jun13 Jun13 TIME 0:17 0:00 0:00 0:00 0:00 COMMAND init klogd -x -c 1 /usr/sbin/sshd -4 crond /bin/sh
Jun13 0:18 cpwmd -D -app SPLATWebUI Jun13 0:00 /opt/CPshrd-R65/bin/cprid Jun13 2:30 /opt/CPshrd-R65/bin/cpwd Jun13 88:42 cpd Jun13 2590:42 fwd Jun13 0:42 in.asessiond 0 Jun13 0:43 in.aufpd 0 Jun13 57:51 vpnd 0 Jun13 95:09 dtlsd 0 Jun13 15:20 rtmd
No daemons in the ps auxw output have restarted. Any daemon processes that have restarted may not necessarily indicate a fault because somebody may have restarted it, for example by performing cpstop;cpstart. Normally the cause of a process restart can be determined by looking at the /var/log/messages file or by examining the daemons error log file (cpd.elg, fwd.elg, vpnd.elg etc). In the above example of top output there were 3 Zombie processes. Zombie processes do not consume resources but should not be present. Check the process list to identify the Zombie (Stat: z) processes and determine if action is required.
[Expert@Zulu]# ps auxw | more USER PID %CPU %MEM VSZ RSS TTY root 18374 0.5 0.0 4680 1932 ttyp0 root 18399 0.0 0.0 0 0 ttyp0 root 18403 0.2 0.0 0 0 ttyp0 root 18413 0.4 0.0 0 0 ttyp0 STAT S Z Z Z START 09:46 09:46 09:46 09:46 TIME 0:00 0:00 0:00 0:00 COMMAND cpinfo -n -z [cpprod_util [cpprod_util [cpprod_util
The process cpprod_util was called by a process used by CPinfo to gather Ethernet stats. The Zombie process is also marked defunct which means the same as Zombie. A defunct or Zombie process is a process that has finished but still depends on a parent which is still alive. After the completion and termination of the parent process these Zombie processes should terminate and no longer be shown in the process list. If the Zombie processes are still there after completion of the CPinfo, killing the parent process will be required to remove them from the process list. Sometimes Zombie processes are the result of an error in the daemon coding. For example if a Zombie vpnd process is seen there is a hotfix for it, refer to: sk33941: "Zombie" vpnd process
Page 16
Capacity Optimization
The maximum number of concurrent connections that a firewall can handle is configured in the Capacity Optimization section of the firewall or cluster object. It is recommended under normal circumstances to use the automatic hash table size and memory pool configuration when increasing or decreasing the number of maximum concurrent connections (default 25,000). To check what value the maximum number of concurrent connections has been set to either check the setting in the GUI firewall/cluster object or run the following command on the firewall: fw tab t connections | grep limit Example output: [Expert@Zulu] #fw tab t connections | grep limit dynamic, id 8158, attributes: keep, sync, aggressive aging, expires 25, refresh, limit 100000, hashsize 534288, kbuf 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31, free function c0b98510 0, post sync handler c0b9a370 The number (100000) directly after limit is the maximum value as set in the Capacity Optimization page on the firewall or cluster object (GUI). To check the number of concurrent connections (#VALS) and the peak value (#PEAK) use the following command on the firewall: fw tab t connections s Example output: [Expert@Zulu]# fw tab t connections -s HOST NAME localhost connections [Expert@Zulu]# ID #VALS #PEAK #SLINKS 8158 23055 77921 29141
The values that we are interested in are the limit and peak values. Ensure that there is about 15 -20% headroom before Aggressive Ageing is activated to ensure there is adequate spare capacity in the connections table to cope with an increase in connections. If necessary, change the value in the capacity optimization section on the firewall object and push the policy to make it effective. Greatly over-prescribing the maximum concurrent connections is not recommended as it can lead to inefficient use of memory. In the above example, a maximum of 100,000 concurrent connections has been set in the Capacity Optimization section for the firewall and the peak number of connections (#PEAK) was 77,921 over the last 124 days (uptime). The headroom above the #PEAK is set too low because the Aggressive Ageing default threshold of 80% will be activated at 80,000. Increase the concurrent connections limit to around 120,000 connections to give between 15-20% head-room before Aggressive Ageing becomes active. If NAT is performed on the module check the fwx_cache table using the command: fw tab t fwx_cache -s Example output: [Expert@Zulu]# fw tab t fwx_cache -s HOST NAME localhost fwx_cache [Expert@Zulu]# ID #VALS #PEAK #SLINKS 8116 10000 10000 0
In the above example, the value of #PEAK is equal to 10,000 it indicates that the NAT cache table (default 10,000) was full at some time. (#VALS equal to 10,000 indicates that the NAT cache table is still full.) For improved NAT cache performance the size of the NAT cache should be increased or the time entries are held in the table decreased. For further information see:
Performing a SecurePlatform Firewall Health Check Page 17
sk21834: How to modify the values of the properties related to the NAT cache table
[Expert@Shaka]# cphaprob a if eth1c0 non sync(non secured) eth2c0 non sync(non secured) eth3c0 non sync(non secured) eth4c0 sync(secured), broadcast Virtual cluster interfaces: 3 eth1c0 192.168.1.1 eth2c0 192.168.2.1 eth3c0 10.1.1.1 [Expert@Shaka]# In the above example, interface eth4c0 has been configured on both cluster members for state sync but the sync mode is inconsistent, one is using multicast and the other broadcast mode. Ensure the cluster members use the same mode. (The default mode is multicast.) The following document explains how to change between broadcast and multicast mode: sk20576: How to set ClusterXL Control Protocol (CCP) in broadcast mode in ClusterXL
Page 18
Use the cphaprob state command to check if state sync is up and running. The local and remote state synchronization IP addresses should be displayed and their state should be shown as Active on the HA Master and Standby on the HA Backup. In a load-sharing cluster the state should be shown as Active on both the local and remote firewalls: Example output - HA: [Expert@Zulu]# cphaprob state Cluster Mode: New High Availability (Active Up) Number Unique Address Assigned Load 100% 0% State Active Standby
In a HA cluster configuration (above), one member should be Active and the other Standby.
Example output Load-Sharing: [Expert@Dingaan]# cphaprob state Cluster Mode: New High Availability (Active Up) Number Unique Address Assigned Load 50% 50% State Active Active
Example output HA or Load-Sharing: [Expert@Zulu]# cphaprob state Cluster Mode: New High Availability (Active Up) Number Unique Address Assigned Load 100% State Active
Remote cluster partner is missing! If the remote partner is not shown it will be usually be due to one of the following: There is no network connectivity between the members of the cluster on the state sync network The partner does not have state synchronization enabled One partner is using broadcast mode and the other is using multicast mode One of the monitored processes has an issue, such as no policy loaded The partner firewall is down.
Page 19
Example output - HA or Load-Sharing: [Expert@Zulu]# cphaprob state Cluster Mode: New High Availability (Active Up) Number Unique Address Assigned Load 100% 0% State Active Ready
Partner is in the Ready state. If one of the partners is in the Ready state it indicates that there is an issue with state synchronization. The Ready state is normally caused by another member of the cluster running a higher version of code or HFA, for example, as would happen during an upgrade. This state is also seen when CoreXL has been configured to use a different number of cores on the individual cluster members. For further information see: sk42096: Cluster member with CoreXL is in 'Ready' state The Ready state can also occur if a cluster member receives state synchronization traffic from a different cluster that is using the same mac magic number and the other cluster is running a higher version of code. For further information see: sk36913: Connecting several clusters on the same network Example output - HA or Load-Sharing: [Expert@Zulu]# cphaprob state Cluster Mode: New High Availability (Active Up) Number Unique Address Assigned Load 100% 0% State Active Down
A remote cluster member is in the Down state indicates that there is either a problem on the remote member or the state synchronization network between the cluster members is broken. To investigate why a member shows itself to be locally Down use the cpstat ha f all | more command on the firewall that shows Down. This command displays the Problem Notification Table and the state of health of the monitored processes: Example output (truncated): [Expert@Zulu]# cpstat ha f all | more Problem Notification table ------------------------------------------------|Name |Status |Priority|Verified|Descr| ------------------------------------------------|Synchronization|OK | 0| 3383| | |Filter |OK | 0| 3383| | |cphad |OK | 0| 0| | |fwd |OK | 0| 0| | ------------------------------------------------All monitored processes have the OK status.
Page 20
Example output (truncated): [Expert@Shaka]# cpstat ha f all | more Problem Notification table ------------------------------------------------|Name |Status |Priority|Verified|Descr| ------------------------------------------------|Synchronization|problem| 0| 3383| | |Filter |problem| 0| 3383| | |cphad |OK | 0| 0| | |fwd |OK | 0| 0| | ------------------------------------------------State synchronization is in a problem state because the policy is unloaded on this cluster member. Installing the policy will fix this issue. Alternatively, the cphaprob list command displays the same information plus some additional details: Example output: [Expert@Zulu]# cphaprob list Registered Devices: Device Name: Synchronization Registration number: 0 Timeout: none Current state: OK Time since last report: 12139.6 sec Device Name: Filter Registration number: 1 Timeout: none Current state: OK Time since last report: 12124.5 sec Device Name: cphad Registration number: 2 Timeout: 5 sec Current state: OK Time since last report: 0.6 sec Device Name: fwd Registration number: 3 Timeout: 5 sec Current state: OK Time since last report: 0.6 sec All monitored processes are shown as OK. Assuming that state synchronization on the cluster is healthy, use the following command to check if the state tables are synchronized: fw tab t connections s Simultaneously execute the command on both cluster members; compare the values of #VALS. The values on both firewalls should be similar if the state synchronization mechanism is working unless a lot of delayed notification is in use.
Page 21
Example output: [Expert@Zulu]# fw tab t connections -s HOST NAME localhost connections [Expert@Zulu]# [Expert@Shaka]# fw tab t connections -s HOST NAME localhost connections [Expert@Shaka]#
The #PEAK may be different depending on the uptime and when the last peak number of connections occurred. The #VALS on a HA pair should always be similar.
Examine the output of the sync section of fw ctl pstat. Example output: Sync: Version: new Status: Able to Send/Receive sync packets Sync packets sent: total : 13880231, retransmitted : 5, retrans reqs : 524, acks : 70 Sync packets received: total : 692409645, were queued : 720, dropped by net : 517 retrans reqs : 5, received 43019 acks retrans reqs for illegal seq : 0 dropped updates as a result of sync overload: 0 Callback statistics: handled 42940 cb, average delay : 1, max delay : 4
If the dropped by net counter has incremented then some sync packets have been lost and the problem needs to be investigated to find the cause. For further information please refer to: sk34476: Explanation of Sync section in the output of fw ctl pstat command
Page 22
SecureXL
For optimum gateway performance SecureXL needs to be enabled, the SmartDefense and Web-Intelligence or IPS options that are enforced do not interfere with SecureXL and the extent that templating is performed is maximized by careful rulebase ordering. For further information, refer to: sk42401: Factors that adversely affect performance in SecureXL The following command can be used to determine that SecureXL is turned on and the creation of templates has not been disabled: fwaccel stat Example output showing SecureXL turned on and templating is enabled:[Expert@Zulu]# fwaccel stat Accelerator Status : on Accept Templates : on Accelerator Features : Accounting, NAT, Cryptography, Routing, HasClock, Templates, VirtualDefrag, GenerateIcmp, IdleDetection, Sequencing, TcpStateDetect, AutoExpire, DelayedNotif, McastRouting, WireMode Cryptography Features : Tunnel, UDPEncapsulation, MD5, SHA1, NULL, 3DES, DES, AES-128, AES-256, ESP, LinkSelection, DynamicVPN, NatTraversal, EncRouting [Expert@Zulu]#
Note: SecureXL is incompatible with FloodGate and will be disabled if FloodGate is active.
The following command can be used to examine the SecureXL statistics to get an understanding on how well SecureXL is configured and performing: fwaccel stats Examine the output of fwaccel stats: Check that templates are being created this number rises and falls as templates are created and expire. Examine the ratio of F2F packets to packets being accelerated - for best performance the firewall should be accelerating the majority of the packets; the amount of packets being forwarded to the firewall (F2F) should be minimal.
Page 23
Templates are being formed and only a small amount of F2F packets to accel packets.
Aggressive Ageing
Aggressive Aging helps manage the connections table capacity and memory consumption of the firewall to increase durability and stability; allowing the gateway machine to handle large amounts of unexpected traffic, especially during a Denial of Service attack. Aggressive Aging uses short timeouts called aggressive timeouts. When a connection is idle for more than its aggressive timeout it is marked as "eligible for deletion". When the connections table or memory consumption reaches a certain user defined threshold (highwater mark), Aggressive Aging begins to delete eligible for deletion connections, until memory consumption or connections capacity decreases back to the desired level. The user defined thresholds are set in the GUI for the specific protection enforced by the firewall (SmartDefense > Network Security > Denial of Service > Aggressive Ageing).
Page 24
To check the state of Aggressive Ageing on the firewall use the fw ctl pstat command: Example output: [Expert@Zulu]# fw ctl pstat | grep Aggressive Aggressive Ageing is not active [Expert@Zulu]# The above output indicates that Aggressive Ageing has been set in SmartDefense to Protect but the thresholds have not been reached to make it aggressively close connections that are eligible for deletion.
If Aggressive Aging has been set in SmartDefense to Inactive the output will say that Aggressive Ageing is disabled: [Expert@Zulu]# fw ctl pstat | grep Aggressive Aggressive Ageing is disabled [Expert@Zulu]# If Aggressive Aging is in Detect mode the output will say it is monitor only: [Expert@Zulu]# fw ctl pstat | grep Aggressive Aggressive Ageing is in monitor only [Expert@Zulu]#
There were some issues with the Aggressive Ageing mechanism which are fixed in R65 HFA_50: Improved SecureXL notifications to the firewall resolve a connectivity issue that occurs when the Sequence Verifier is enabled together with the Aggressive Aging mechanism. Implementation: An immediate workaround is to disable either the Sequence Verifier or the Aggressive Aging mechanism.
HFA Patching
Use the fwm ver and fw ver k commands to inspect the patching on the management station and the firewall modules. Check that the HFA patching on the module is the same version (HFA_50) or lower that the patching on the Provider-1 management station. The firewall module must never be patched with a higher version than the management station. Ensure patching on cluster members is identical. Example output: Provider-1 Management:[Expert@Manager]# fwm verThis is Check Point SmartCenter Server NGX (R65) HFA_50, Hotfix 650 - Build 011 Installed Plug-ins: Connectra NGX R62CM [Expert@Manager]#
Page 25
Cluster:[Expert@Zulu]# fw ver k This is Check Point VPN-1(TM) & FireWall-1(R) NGX (R65) HFA_40, Hotfix 640 - Build 091 kernel: NGX (R65) HFA_40, Hotfix 640 - Build 091 [Expert@Zulu]# [Expert@Shaka]# fw ver k This is Check Point VPN-1(TM) & FireWall-1(R) NGX (R65) HFA_40, Hotfix 640 - Build 091 kernel: NGX (R65) HFA_40, Hotfix 640 - Build 091 [Expert@Shaka]# Versions on the clustered firewalls (HFA_40) are identical and the versions are not above the Provider-1 version (HFA_50) Although the patching is good in the above example it is out of date. Check Point always recommends applying the latest HFA and Security Hotfixes on the SmartCenter and firewall modules. The latest HFAs and Security Hotfix release notes are available on the Check Point website: http://www.checkpoint.com/downloads/latest/hfa/index.html
CPinfo Package:
For troubleshooting purposes Check Point TAC will require a CPinfo taken from the firewall and SmartCenter Server or CMA. Ensure the CPinfo package is higher than 911000023 so the full set of diagnostics from the appliance can be gathered successfully. CPinfo version 911000023 often hangs during gathering the firewalls connection tables and produces a truncated output so it should be replaced with the latest version. The version installed on the appliance can be determined by running the following command: cpvinfo /opt/CPinfo-10/bin/cpinfo |grep Build Example output: [Expert@Zulu]# cpvinfo /opt/CPinfo-10/bin/cpinfo |grep Build Build number = 911000023 [Expert@Zulu]# The above version is problematic and should be upgraded. The most up to date version of CPinfo can be downloaded using the following link: sk30567: The CPinfo utility
Page 26
Verfying
After fixing any problems that were identified as serious or requiring attention the health check should be repeated to confirm that all the health checks are now good.
Page 27