0% found this document useful (0 votes)
458 views41 pages

High Availability Troubleshooting Guide

Uploaded by

mba1130feb2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
458 views41 pages

High Availability Troubleshooting Guide

Uploaded by

mba1130feb2024
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

High Availability Troubleshooting Guide

Kashan Naqvi, Naoman Malik


Palo Alto Networks GCS: Performance & Stability SME

June 2021
AGENDA

● High Availability Overview


● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
High Availability Overview

● What is High Availability Active/Passive mode?


● Active device processes all the traffic while Passive devices stays in standby mode to take over all the traffic in case of
any failure observed on active device
● In the event of a hardware or software disruption on the active firewall, the passive firewall becomes active
automatically without loss of service
● The active device continuously synchronizes its configuration over HA1/HA1-backup and session information over
HA2/HA2-backup with the passive over
● Dedicated HA ports depends what platform is being used, some platform may not have dedicated HA ports
● Active/Passive mode is supported with interface modes – Virtual-Wire, Layer 2 or Layer 3.
● Passive device doesn’t pass traffic but can have its data-plane interfaces up/down depending on the configuration.

● What is High Availability Active/Active mode?


● Active-primary and active-secondary, both processes the traffic simultaneously
● Active/Active is supported only in virtual-wire and Layer 3 modes
● Can accommodate asymmetric routing
● Allows dynamic routing protocols (OSPF, BGP) to maintain active status across both peer
● Requires additional dedicated HA3 link
● HA3 link is used as packet forwarding link for session setup and asymmetric traffic handling
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Common Issues

1. Split Brain
● Can happen due to a failure with HA1 port on either firewall in HA pair
● Can happen if the path of HA1 isn’t available (patch panel issue, Network connectivity issues, such as switch/router
failures, network flapping in HA1 path)
2. Config sync Issues
● Can happen if commit failure on peer device (passive unit) is failing, this could be due to multiple commit jobs
running at the same time or a demon failure.
3. Session Sync Issues
● Not very common, usually happens if HA2 link is down for some reason and session table is not synchronized
between the nodes
4. HA Link monitoring vs path monitoring Failure
● Can be configured for HA failover condition
● Failover happens when any or all link/path configured is unreachable
5. HA Link Flapping
● HA1, HA2 and HA3 links can experience flapping because of some hardware or SFP issues
● This flapping may cause traffic issues and HA failover
6. Upgrade Issues
● Can happen if Two major PAN-OS release difference is observed between HA nodes (ex: Primary FW on 8.1 while secondary
FW is upgraded to 9.1).
Split Brain - Overview

● Palo Alto Networks uses a private heartbeat link (HA1) to monitor the health and status
of each node in a high availability cluster
● Split-brain occurs when the private link goes down, but the cluster nodes are still up
● Each node believes that the other is no longer functioning and attempts to start
services that the other is running. In some instances the link may not be down, but due
to high load on the dataplane, heartbeats may be missed
● Split brain conditions occur when HA members can no longer communicate with each
other to exchange HA monitoring information
● Each HA member will assume the other member is in a non-functional state and will
take over as the Active (A/P) or Active-Primary (A/A)
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Split Brain - Troubleshooting Approach

Logs to check in case of split brain

● System Logs
● ha_agent.log
● DP-Monitor Logs
● MARVIN

Command to check the HA status:


> show high-availability all
> show interface all

Note: For HA issues, be sure to always get data from BOTH peers as issues may be on either device.
Split Brain - Troubleshooting Approach

Ha_agent.log
Split Brain - Troubleshooting Approach
Ha_agent.log
Split Brain - Troubleshooting Approach

System Logs:
Split Brain - Troubleshooting Approach
Resolution
● To prevent split-brain due to missed heartbeats, the Heartbeat Backup option should be
selected when configuring HA.
● The firewalls will use the management ports to provide a backup path for heartbeat and
hello messages. The option is found on the WebUI under Device > High Availability >
General > Election Settings
● Verify that HA1 link is directly connected or if there is a patch panel connection
● Verify the connectivity by removing the patch panel and making a direct connection
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Key Points For Escalation

● Please verify the physical connectivity of all HA links to verify flapping is not caused because of physical
issue

● Sys logs, DP-monitor log, HA-agent logs and MARVIN logs needs to be checked to identify and
understand the issue

● Try checking if there is any particular known issue for HA split brain, mention that Jira Issue in escalation
template

● Please make sure to write the best possible match for jira issue instead of giving 4-5 different jira issues

● Ask SME if there are any questions or for debug commands approval

14 | © 2020 Palo Alto Networks, Inc. All rights reserved.


Config sync Issues - Overview

● In an HA Active/Passive and Active/Active, config sync issue is very common and can
happen due to multiple reasons. The most common one is due to multiple commit jobs
running at the same time or a demon failure.
● Config sync is an automated process that happens when config changes are
committed on active firewall, once config is committed on active/active-primary
firewall, it then moves to passive/Active-secondary and if there is a prior job in progress
then config sync will wait until that job is completed
● In certain cases config synchronization may fail therefore we need to tail ms.log and
ha-agent.log to review the errors
● Following config doesn’t get sync in Active/Passive and Active/Active
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Config sync Issues - Troubleshooting Approach

● Please confirm if “Enable Config Sync” is checked under Devices>High Availability>


General> setup.
● Start with checking pending jobs, once the pending jobs are completed try reinitiate
the config sync >request high-availability sync-to-remote running-config .
● If the config sync fails check mp-log ms.log and look for errors such as

● Try restarting mgmtsrvr, devsrvr processes, commit force and manual HA sync.
● If the above steps doesn’t resolve the issue check the config audit to see if there is any
specific part of the config difference, try deleting that part and then recommitting.
● As further troubleshooting, try committing locally on Passive/Active-secondary FW and
then reinitiate the config sync.
● As a troubleshooting step we can also try to sync config from passive/Active-secondary
to active/Active-primary. Doing this step will remove the config changes being made on
Active/Active-Primary FW. Please consult with customer before performing this step.
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Key Points For Escalation

● Ms.log, HA-agent logs, config audit (on GUI) and MARVIN logs needs to be checked to
identify and understand the issue

● Try checking if there is any particular known issue for config sync, use the potential
workaround and if that doesn’t help, mention that Jira-ID in escalation template for
reference

● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues

● Ask SME if there are any questions or for debug commands approval

19 | © 2020 Palo Alto Networks, Inc. All rights reserved.


Session sync Issues - Overview

● Session Sync Issues are mostly observed if HA2 and HA3 links are down or flapping.
● Session sync failure can happen if DP resources are running very high either due to
resource utilization or a potential memory leak
● ICMP/Host session doesn’t sync between an HA A/P pair
● ICMP/Host session, Multicast session and BFD session information doesn’t sync
between Active-Primary and Active-Secondary

Note: A host session is a session terminated on one of the firewall interfaces, such as ICMP session pinging
one of the firewall interfaces or a GP tunnel. Details Here
Session sync Issues - Troubleshooting Approach

● We will need to review the brdagent.log, system.log and HA.agent.log to review if


there is any link failure observed.
● We need to validate that there is no layer 1 issue, If there is a patch panel, try removing
that and connect the HA links directly.
● In DP-monitor.log check the resources, if they are high first troubleshoot that, once
reduced please check if HA links are still flapping.
● Check signal strength of the fiber optic for the HA links.
● Run the command > show high-availability interface ha* to review the config
and validate if the IPs assigned are correct.
● Try commit force on both the Units (only do this if config sync is completed)
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Key Points For Escalation

● Please verify the physical connectivity of all HA links to verify there is no layer-1 issue

● If there is DP resource issue then mention analysis of DP-moniotor.log

● Ms.log, HA-agent logs, and MARVIN logs needs to be checked to identify and understand
the issue

● Try checking if there is any particular known issue for config sync, use the potential
workaround and if that doesn’t help, mention that Jira-ID in escalation template for
reference

● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues

● Ask SME if there are any questions or for debug commands approval

23 | © 2020 Palo Alto Networks, Inc. All rights reserved.


HA Link Flapping - Overview

HA LINKS:

● Control Link - HA1: The HA1 link is used to exchange hellos, heartbeats, and HA state information, and
management plane sync for routing, and User-ID information. The firewalls also use this link to
synchronize configuration changes with its peer. The HA1 link is a Layer 3 link and requires an IP
address
● Data Link - HA2: The HA2 link is used to synchronize sessions, forwarding tables, IPSec security
associations and ARP tables between firewalls in an HA pair. Data flow on the HA2 link is always
unidirectional (except for the HA2 keep-alive); it flows from the active or active-primary firewall to the
passive or active-secondary firewall. The HA2 link is a Layer 2 link, and it uses ether type 0x7261 by
default
● Backup Links - HA1/HA2 backup links: Provide redundancy for the HA1 and the HA2 links. In-band
ports can be used for backup links for both HA1 and HA2 connections when dedicated backup links are
not available
● Packet-Forwarding Link - HA3: In addition to HA1 and HA2 links, an active/active deployment also
requires a dedicated HA3 link. The firewalls use this link for forwarding packets to the peer during
session setup and asymmetric traffic flow. The HA3 link is a Layer 2 link that uses MAC-in-MAC
encapsulation. It does not support Layer 3 addressing or encryption

NOTE: When there is a HA failover from Active to Passive firewall, we see that some ports on previous active
firewall go down and then come up.
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
HA Link Flapping - Troubleshooting Approach

Logs to check in case of HA Flapping

● HA_Agent Logs
● System Logs
● Brdagent logs
● DP-monitor Logs
HA Link Flapping - Troubleshooting Approach

SYSTEM LOGS:
HA Link Flapping - Troubleshooting Approach

BRDAGENT LOG:
HA Link Flapping - Troubleshooting Approach

HA LOGS:
HA Link Flapping - Troubleshooting Approach

DP-Monitor Logs:
HA Link Flapping - Troubleshooting Approach

● CLI Commands to verify HA link flapping


admin@PA-2(active-primary)> show high-availability ?
> all Show high-availability information
> control-link Show control-link statistic information
> dataplane-status Show dataplane runtime status
> flap-statistics Show high-availability preemptive/non-functional flap statistics
> interface Show high-availability interface information
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
KEY POINTS FOR ESCALATION

● Please verify the physical connectivity of all HA links to verify flapping is not caused because
of physical issue

● Sys logs, DP-monitor log, HA-agent logs and brdagent logs needs to be checked to
identify and understand the issue

● Try checking if there is any particular known issue for the HA link flapping, mention that Jira
Issue in escalation template

● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues

● Ask SME if there are any questions or for debug commands approval

33 | © 2020 Palo Alto Networks, Inc. All rights reserved.


Upgrade Issue - Overview

● Upgrade issue can happen if two major PAN-OS release difference is observed between
HA nodes (ex: Primary FW on 8.1 while secondary FW is upgraded to 9.1)
● It is extremely important to follow the upgrade process to ensure no traffic and failover
issues
● Preemptive option has to be disabled before the upgrade process to prevent any
unwanted failovers
● Ensure that customer have console access in case there are any access issues
● Please find the recommended steps of upgrade process in next slide for both HA A/P &
A/A

34 | © 2018, Palo Alto Networks. All Rights Reserved.


● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
Upgrade Issue - Troubleshooting Approach

● If there is an issue with PAN-OS upgrade, try reinstalling it via CLI using.
➢ request system software check
➢ request system software download version
➢ request system software install version <version>
● Make sure content versions are not too old as that would cause upgrade failure.
● Validate if the the platform on which PAN-OS is being upgraded supports the version
● Validate if we have valid support licenses enabled on the firewall
● Validate we are following the recommended upgrade path in the next slide
Upgrade Issue - Troubleshooting Approach

● HA Active/Passive PAN-OS upgrade procedure:


● Suspend passive device
● Upgrade passive device to the new PAN-OS release
● Make passive functional
● Wait for state synchronization to complete
● Suspend active device which will force the passive to become active
● Follow steps 2 and 3 to upgrade previously active device
● If pre‐emptive is configured, currently passive device will revert to active once state synchronization is
complete

● HA Active/Active PAN-OS upgrade procedure:


● Suspend ACTIVE-SECONDARY.
● Verify that ACTIVE-PRIMARY is now passing all traffic.
● Download and install PAN-OS on ACTIVE-SECONDARY.
● Reboot ACTIVE-SECONDARY to complete installation.
● Verify ACTIVE-SECONDARY is passing traffic in a functional state.
● Suspend ACTIVE-PRIMARY.
● Verify that ACTIVE-SECONDARY is now passing all traffic.
● Download and install PAN-OS on ACTIVE-PRIMARY.
● Restart ACTIVE-PRIMARY to complete installation.
● Verify ACTIVE-PRIMARY is passing traffic in a functional state.
● Verify that both firewalls are passing traffic correctly as an Active/Active pair.
37 | © 2018, Palo Alto Networks. All Rights Reserved.
● High Availability Overview
● Common Issues in High Availability
● Troubleshooting Approach
● Key Points for Escalation
KEY POINTS FOR ESCALATION

● Ms.logs, DP-monitor log, HA-agent logs and brdagent logs needs to be checked to
identify and understand the issue

● Try checking if there is any particular known issue for the HA link flapping, mention that Jira
Issue in escalation template

● Please make sure to write the best possible match for jira issue instead of giving 4-5
different jira issues

● Ask SME if there are any questions or for debug commands approval

39 | © 2020 Palo Alto Networks, Inc. All rights reserved.


QUESTIONS ?
slack channel #tac-performance-and-stability-sme
Debug Quick Reference - Internal
➢ debug high-availability on <error|warn|info|debug|dump>
➢ debug high-availability off
➢ debug high-availability show
➢ debug high-availability internal-dump
➢ debug high-availability dataplane-status
➢ debug high-availability ha3-bit-shift enable
➢ debug high-availability ha3-bit-shift disable
➢ debug high-availability ha3-bit-shift local-mode
➢ debug high-availability ha3-bit-shift ha-mode
➢ debug high-availability path-monitor-groups status
➢ debug high-availability path-monitor-groups show candidate
➢ debug high-availability path-monitor-groups show committing
➢ debug high-availability path-monitor-groups show committed
➢ debug high-availability path-monitor-groups delete virtual-router <value> destination-group-name <val
➢ ue> destination-ip <ip/netmask>
➢ debug high-availability path-monitor-groups add global-failure-condition <any|all> virtual-router <va
➢ lue> virtual-router-enable <yes|no> virtual-router-failure-condition <any|all> ping-interval <200-600
➢ 00> ping-count <3-10> destination-group-name <value> destination-ip <ip/netmask> destination-group-en
➢ able <yes|no> destination-group-failure-condition <any|all>
➢ debug device-server dump idmgr high-availability state
➢ debug dhcpd high-availability ignore-config-sync yes
➢ debug dhcpd high-availability ignore-config-sync no
➢ debug user-id dump idmgr high-availability state
➢ debug high-availability-agent on debug
➢ debug high-availability-agent on internal-dump
➢ debug dataplane internal vif link
➢ debug high-availability ha3-bit-shift enabled/disabled/local-mode/ha-mode

41 | © 2018, Palo Alto Networks. All Rights Reserved.

You might also like