High Availability Troubleshooting Guide
High Availability Troubleshooting Guide
June 2021
AGENDA
1. Split Brain
● Can happen due to a failure with HA1 port on either firewall in HA pair
● Can happen if the path of HA1 isn’t available (patch panel issue, Network connectivity issues, such as switch/router
failures, network flapping in HA1 path)
2. Config sync Issues
● Can happen if commit failure on peer device (passive unit) is failing, this could be due to multiple commit jobs
running at the same time or a demon failure.
3. Session Sync Issues
● Not very common, usually happens if HA2 link is down for some reason and session table is not synchronized
between the nodes
4. HA Link monitoring vs path monitoring Failure
● Can be configured for HA failover condition
● Failover happens when any or all link/path configured is unreachable
5. HA Link Flapping
● HA1, HA2 and HA3 links can experience flapping because of some hardware or SFP issues
● This flapping may cause traffic issues and HA failover
6. Upgrade Issues
● Can happen if Two major PAN-OS release difference is observed between HA nodes (ex: Primary FW on 8.1 while secondary
FW is upgraded to 9.1).
Split Brain - Overview
● Palo Alto Networks uses a private heartbeat link (HA1) to monitor the health and status
of each node in a high availability cluster
● Split-brain occurs when the private link goes down, but the cluster nodes are still up
● Each node believes that the other is no longer functioning and attempts to start
services that the other is running. In some instances the link may not be down, but due
to high load on the dataplane, heartbeats may be missed
● Split brain conditions occur when HA members can no longer communicate with each
other to exchange HA monitoring information
● Each HA member will assume the other member is in a non-functional state and will
take over as the Active (A/P) or Active-Primary (A/A)
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Split Brain - Troubleshooting Approach
● System Logs
● ha_agent.log
● DP-Monitor Logs
● MARVIN
Note: For HA issues, be sure to always get data from BOTH peers as issues may be on either device.
Split Brain - Troubleshooting Approach
Ha_agent.log
Split Brain - Troubleshooting Approach
Ha_agent.log
Split Brain - Troubleshooting Approach
System Logs:
Split Brain - Troubleshooting Approach
Resolution
● To prevent split-brain due to missed heartbeats, the Heartbeat Backup option should be
selected when configuring HA.
● The firewalls will use the management ports to provide a backup path for heartbeat and
hello messages. The option is found on the WebUI under Device > High Availability >
General > Election Settings
● Verify that HA1 link is directly connected or if there is a patch panel connection
● Verify the connectivity by removing the patch panel and making a direct connection
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Key Points For Escalation
● Please verify the physical connectivity of all HA links to verify flapping is not caused because of physical
issue
● Sys logs, DP-monitor log, HA-agent logs and MARVIN logs needs to be checked to identify and
understand the issue
● Try checking if there is any particular known issue for HA split brain, mention that Jira Issue in escalation
template
● Please make sure to write the best possible match for jira issue instead of giving 4-5 different jira issues
● Ask SME if there are any questions or for debug commands approval
● In an HA Active/Passive and Active/Active, config sync issue is very common and can
happen due to multiple reasons. The most common one is due to multiple commit jobs
running at the same time or a demon failure.
● Config sync is an automated process that happens when config changes are
committed on active firewall, once config is committed on active/active-primary
firewall, it then moves to passive/Active-secondary and if there is a prior job in progress
then config sync will wait until that job is completed
● In certain cases config synchronization may fail therefore we need to tail ms.log and
ha-agent.log to review the errors
● Following config doesn’t get sync in Active/Passive and Active/Active
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Config sync Issues - Troubleshooting Approach
● Try restarting mgmtsrvr, devsrvr processes, commit force and manual HA sync.
● If the above steps doesn’t resolve the issue check the config audit to see if there is any
specific part of the config difference, try deleting that part and then recommitting.
● As further troubleshooting, try committing locally on Passive/Active-secondary FW and
then reinitiate the config sync.
● As a troubleshooting step we can also try to sync config from passive/Active-secondary
to active/Active-primary. Doing this step will remove the config changes being made on
Active/Active-Primary FW. Please consult with customer before performing this step.
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Key Points For Escalation
● Ms.log, HA-agent logs, config audit (on GUI) and MARVIN logs needs to be checked to
identify and understand the issue
● Try checking if there is any particular known issue for config sync, use the potential
workaround and if that doesn’t help, mention that Jira-ID in escalation template for
reference
● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues
● Ask SME if there are any questions or for debug commands approval
● Session Sync Issues are mostly observed if HA2 and HA3 links are down or flapping.
● Session sync failure can happen if DP resources are running very high either due to
resource utilization or a potential memory leak
● ICMP/Host session doesn’t sync between an HA A/P pair
● ICMP/Host session, Multicast session and BFD session information doesn’t sync
between Active-Primary and Active-Secondary
Note: A host session is a session terminated on one of the firewall interfaces, such as ICMP session pinging
one of the firewall interfaces or a GP tunnel. Details Here
Session sync Issues - Troubleshooting Approach
● Please verify the physical connectivity of all HA links to verify there is no layer-1 issue
● Ms.log, HA-agent logs, and MARVIN logs needs to be checked to identify and understand
the issue
● Try checking if there is any particular known issue for config sync, use the potential
workaround and if that doesn’t help, mention that Jira-ID in escalation template for
reference
● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues
● Ask SME if there are any questions or for debug commands approval
HA LINKS:
● Control Link - HA1: The HA1 link is used to exchange hellos, heartbeats, and HA state information, and
management plane sync for routing, and User-ID information. The firewalls also use this link to
synchronize configuration changes with its peer. The HA1 link is a Layer 3 link and requires an IP
address
● Data Link - HA2: The HA2 link is used to synchronize sessions, forwarding tables, IPSec security
associations and ARP tables between firewalls in an HA pair. Data flow on the HA2 link is always
unidirectional (except for the HA2 keep-alive); it flows from the active or active-primary firewall to the
passive or active-secondary firewall. The HA2 link is a Layer 2 link, and it uses ether type 0x7261 by
default
● Backup Links - HA1/HA2 backup links: Provide redundancy for the HA1 and the HA2 links. In-band
ports can be used for backup links for both HA1 and HA2 connections when dedicated backup links are
not available
● Packet-Forwarding Link - HA3: In addition to HA1 and HA2 links, an active/active deployment also
requires a dedicated HA3 link. The firewalls use this link for forwarding packets to the peer during
session setup and asymmetric traffic flow. The HA3 link is a Layer 2 link that uses MAC-in-MAC
encapsulation. It does not support Layer 3 addressing or encryption
NOTE: When there is a HA failover from Active to Passive firewall, we see that some ports on previous active
firewall go down and then come up.
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
HA Link Flapping - Troubleshooting Approach
● HA_Agent Logs
● System Logs
● Brdagent logs
● DP-monitor Logs
HA Link Flapping - Troubleshooting Approach
SYSTEM LOGS:
HA Link Flapping - Troubleshooting Approach
BRDAGENT LOG:
HA Link Flapping - Troubleshooting Approach
HA LOGS:
HA Link Flapping - Troubleshooting Approach
DP-Monitor Logs:
HA Link Flapping - Troubleshooting Approach
● Please verify the physical connectivity of all HA links to verify flapping is not caused because
of physical issue
● Sys logs, DP-monitor log, HA-agent logs and brdagent logs needs to be checked to
identify and understand the issue
● Try checking if there is any particular known issue for the HA link flapping, mention that Jira
Issue in escalation template
● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues
● Ask SME if there are any questions or for debug commands approval
● Upgrade issue can happen if two major PAN-OS release difference is observed between
HA nodes (ex: Primary FW on 8.1 while secondary FW is upgraded to 9.1)
● It is extremely important to follow the upgrade process to ensure no traffic and failover
issues
● Preemptive option has to be disabled before the upgrade process to prevent any
unwanted failovers
● Ensure that customer have console access in case there are any access issues
● Please find the recommended steps of upgrade process in next slide for both HA A/P &
A/A
● If there is an issue with PAN-OS upgrade, try reinstalling it via CLI using.
➢ request system software check
➢ request system software download version
➢ request system software install version <version>
● Make sure content versions are not too old as that would cause upgrade failure.
● Validate if the the platform on which PAN-OS is being upgraded supports the version
● Validate if we have valid support licenses enabled on the firewall
● Validate we are following the recommended upgrade path in the next slide
Upgrade Issue - Troubleshooting Approach
● Ms.logs, DP-monitor log, HA-agent logs and brdagent logs needs to be checked to
identify and understand the issue
● Try checking if there is any particular known issue for the HA link flapping, mention that Jira
Issue in escalation template
● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues
● Ask SME if there are any questions or for debug commands approval