Understanding and Avoiding Split Brain Scenarios in SAP HANA Environments

avoiding splitbrain
Reading Time: 4 minutes

Understanding and Avoiding Split Brain Scenarios

Split brain. Most readers of our blogs will have heard the term, in the computing context that is, yet we cannot help but imagine the chaos that would result if someone had two brains, both equally in control at the same time. Failover clusters comprise two or more nearly identical nodes connected to a private network. Application operations are conducted on the primary node and, in the event of a failure, operation a secondary (or tertiary) node is brought online to assume the primary node role and run the application. Network communication links among the cluster nodes are essential for clustering software to determine (a) the operational status (health) of the application environment and (b) which cluster is the primary node.

What is a Failover Cluster Split Brain Scenario?

In a failover cluster split brain scenario, communication between failover cluster nodes is broken. When this happens, the standby node may promote itself to become an active node because it believes the active node has failed. This results in both nodes becoming ‘active’. As a result, data integrity and consistency is compromised as data on both nodes would be changing independently.

While split brain can happen in any clustered environment, in an SAP HANA environment, there are two types of split-brain scenarios which may occur for an SAP HANA resource hierarchy if appropriate steps are not taken to avoid them.

  • HANA Resource Split Brain: The HANA resource is Active (ISP) on multiple cluster nodes. This situation is typically caused by a temporary network outage affecting the communication paths between cluster nodes.
  • SAP HANA System Replication Split Brain: The HANA resource is Active (ISP) on the primary node and Standby (OSU) on the backup node, but the database is running and registered as the primary replication site on both nodes. This situation is typically caused by either a failure to stop the database on the previous primary node during failover, having Autostart enabled for the database, or a database administrator manually running “hdbnsutil -sr_takeover” on the secondary replication site outside of the clustering software environment.

Avoiding Split Brain Issues in an SAP HANA Resource Hierarchy

Recommendations for avoiding or resolving each type of split-brain scenario in the SIOS LifeKeeper
(SIOS Protection Suite) clustering environment are given below. While in a split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quickCheck interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136363:WARNING: A temporary communication failure has occurred between servers hana2-1 and hana2-2. Manual intervention is required in order to minimize the risk of data loss. To resolve this situation, please take one of the following resource hierarchies out of service: HANA-SPS_HDB00 on hana2-1 or HANA-SPS_HDB00 on hana2-2. The server that the resource hierarchy is taken out of service on will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

  1. Investigate the database on each cluster node to determine which instance contains the most up-to-date or relevant data. This determination must be made by a qualified database administrator who is familiar with the data.
  2. The HANA resource on the node containing the data that needs to be retained will remain Active (ISP) in LifeKeeper, and the HANA resource hierarchy on the node that will be re-registered as the secondary replication site will be taken entirely out of service in LifeKeeper. Right-click on each leaf resource in the HANA resource hierarchy on the node where the hierarchy should be taken out of service and click Out of Service
  3. Once the SAP HANA resource hierarchy has been successfully taken out of service, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

SAP HANA System Replication Split Brain Resolution

While in this split-brain scenario, a message similar to the following is logged and broadcast to all open consoles every quick. Check interval (default 2 minutes) until the issue is resolved.

EMERG:hana:quickCheck:HANA-SPS_HDB00:136364:WARNING: SAP HANA database HDB00 is running and registered as primary master on both hana2-1 and hana2-2. Manual intervention is required in order to minimize the risk of data loss. To resolve this situation, please stop database instance HDB00 on hana2-2 by running the command ‘su – spsadm -c “sapcontrol -nr 00 -function Stop”’ on that server. Once stopped, it will become the secondary SAP HANA System Replication site.

Recommendations for resolution:

  1. Investigate the database on each cluster node to determine whether important data exists on the Standby node which does not exist on the Active node. If important data has been committed to the database on the Standby node while in the split-brain state, the data will need to be manually copied to the Active node. This determination must be made by a qualified database administrator who is familiar with the data.
  2. Once any missing data has been copied from the database on the Standby node to the Active node, stop the database on the Standby node by running the command given in the LifeKeeper warning message:

    su – adm -c “sapcontrol -nr <Inst#> -function Stop”
    where is the lower-case SAP System ID for the HANA installation and <Inst#> is the instance number for the HDB instance (e.g., the instance number, for instance, HDB00 is 00)

  3. Once the database has been successfully stopped, LifeKeeper will re-register the Standby node as the secondary replication site during the next quickCheck interval (default 2 minutes). Once replication resumes, any data on the Standby node which is not present on the Active node will be lost. Once the Standby node has been re-registered as the secondary replication site, the SAP HANA hierarchy has returned to a highly available state.

Being aware of common split-brain scenarios and taking these steps to mitigate them can save you time and protect data integrity.

Want to learn more about how SIOS LifeKeeper can protect your SAP HANA environment?


Recent Posts

SIOS Technology Expands Support in Linux Product Release

We’re excited to announce expanded support for the SIOS LifeKeeper for Linux 9.9.0 release, including: These newly supported configurations are fully compatible with […]

Read More

Achieving High Availability in the Retail Industry

Even minor drops in availability with retail applications can cause a substantial amount of loss of revenue and loss of business in the […]

Read More

Storage Considerations for Resizing Your Highly Available Cluster

When I was a Marine serving with a Tank Battalion, I remember that we’d all prepared ourselves to hear “FIRE IN THE HOLE” […]

Read More