Larry (not his real name) was a SIOS customer who had deployed a replication solution for high availability and disaster recovery (HA/DR) in the past. When he launched the PoC to test a two-node replication solution for Linux, using SIOS LifeKeeper and the DataKeeper replication, his top priority was protection of data integrity. Larry’s PoC test list included the standard items including: database start / stop, migrating the database to the backup node, maintenance activities, and server failover just to name a few. Larry was adamant that the solution be capable of both fast server switchover,(i.e., the graceful migration), and fast failover (i.e., the sudden and forced migration), of applications, databases, storage and services from one server to another. But, he was even more forceful and passionate that such activities should not cause data loss.
Protect Data Integrity By Avoiding Split Brain
In addition to these standard tests, Larry added specific tests to try to force a “split brain” scenario. Split brain is a condition that occurs when members of a cluster are unable to communicate with each other, but are in a running and operable state, and subsequently take ownership of common resources simultaneously. In effect, you have two bus drivers fighting for the steering wheel. Due to its destructive nature, split brain, can cause data loss or data corruption and is best avoided through use of a mechanism to determine which node should remain active (driving the bus) and which node(s) should stop writing to disk.
While split brain scenarios are relatively uncommon in clusters that deploy the use of quorum and quorum plus witness capabilities, the difficulty of split brain resolution increases exponentially with every node added to the cluster configuration. In a multitarget configuration with three or more nodes, clustering software not only has to orchestrate a failover to the correct node, it has to automatically switch replication from the new primary node to the tertiary node to maintain DR protection while making sure to arbitrate properly between nodes. In other clustering solutions those complex actions have to be manually scripted and manually updated in the event of a failover and again to restore normal operation, and it only gets harder when a split brain occurs.
Due to the features and improvements in the SIOS LifeKeeper and SAP HANA Application Recovery Kit (ARK), Larry had difficulty introducing a split brain scenario. However, when he was able to finally contrive one, he benefited greatly from understanding the logic that the SIOS products used to protect his data. Larry realized the high level of sophistication designed into the data protection provided by SIOS clustering software. He selected SIOS LifeKeeper.
The SIOS HANA Multitarget Automation Difference
Scenarios like Larry’s are just one of nine reasons SIOS’ HANA multitarget automation is a bigger deal than you think. Here are all nine:
- Enhanced Protection
SIOS’ solution simplifies the protection of a HANA database resource in a multitarget scenario. Wizard-based options quickly detect the current configuration and precisely add the information to the LifeKeeper configuration. Error detection is both concise and informative to help users resolve any issues and subsequently save time.
- Streamlined Administration
Natalie(not her real name) was responsible for an HANA multinode configuration. When servers failed or required maintenance, Natalie leveraged different scripts and tools to perform the required actions. This, however, was not scalable. After moving to SIOS LifeKeeper, Natalie and team had a simple UI to perform all core tasks such as stopping and restarting HANA and HANA system replication. Additionally, if a disaster strikes, the team can use the single, simplified SIOS UI instead of searching for the latest runbook, finding a copy of the right scripts, or calling Natalie at 2AM. .
- Simplified Monitoring
SIOS’ intuitive status reports in the UI provided the team with a quick way to determine the replication status. Using a single tool, versus a collection of monitoring boards and homemade scripts, simplifies administration and saves time.
- Automated Recovery
Some HANA HSR solutions are capable of performing a failover of the HANA replication between those two nodes. However, an administrator often has to re-register the replication after a system failover. In the case of three or more nodes, will the administrator understand how to update the registration on the third or forth nodes? Will they remember to use sync and async appropriately? The SIOS solution, capable of handling three or even four nodes for multitarget replication, will seamlessly automate the registration of target nodes after a failure.
- Flexibility and Scalability
The ability to protect a HANA cluster in two, three, or four node combinations means that customers have the flexibility to dial up their level of both availability and disaster recovery. Two node customers, with quorum, are able to provide availability protection against a disaster and handle maintenance activities with near zero downtime leveraging HANA takeover with handshake feature. Customers deploying three nodes can dial up additional disaster recovery functionality by deploying the third node with async replication in a different data center or region. For added benefit, three node customers can deploy a fourth node, with storage quorum, to enable high availability and disaster recovery in the event of an entire data center loss.
- Data Protection
Let’s go back to Larry’s issue. He was running HANA on primary node A with multitarget replication to Nodes B and C. What happens when your manual efforts end in disaster? Which node was the primary? Were things in sync when node A crashed? How do I avoid bringing up the wrong node? In addition to adding support for three or more nodes in a multi-target HSR configuration, the new HANA ARK includes additional admin tools to help in the event of a disaster or unfortunate split brain event.
The HANA_DATA_OUT_OF_SYNC_<tag> flag prevents users from accidentally restoring the database on the wrong system. The HANA_LAST_OWNER_<tag> flag helps administrators know when an action was taken on the primary system while standby nodes were not in sync. This flag tells the administrator that this node was the last owner and should be where replication is resumed. HANA_DATA_CONSISTENCY_UNKNOWN_<tag> helps SIOS to automatically resolve and restore replication when all communications between standbys were temporarily lost and then restored. When used with best practices, quorum deployment, and proper tuning, these tools allow administrators like Larry to avoid split brains and recover safely if and when they occur.
- Reporting, Performance and Disaster Recovery
Of course the true benefit for multi-target is in the extra nodes and the functionality that these nodes unlock. Using three nodes in the same data center can unlock the potential for more reporting via the logreplay_readaccess parameter, while still maintaining a node at a DR site. In addition, SIOS’ support for different replication modes gives users the option to have sync nodes and async nodes for better performance across data centers (or regions).
- Continuous Testing
How often does your team test homemade scripts? How often is your runbook reviewed with respect to configuration, administration, and 2 AM scenarios. The HANA multi-target solution was not only continuously tested by SIOS engineers, QA, and Customer Experience experts, but the solution also continues to be tested and validated for HANA failover and recovery processes with each release and update.
- Extensive Documentation
Some time ago our team worked with a customer for cluster administration. While his predecessor was very knowledgeable about their environment, staff promotions and reorganization had left many IT folks responsible for systems they knew little about. When asked about runbooks and documentation of their configuration, the customer was unable to find details from the previous team or previous administrators. In addition to rock solid automation, administration, monitoring, recovery, and data protection, the SIOS multi-target solution includes detailed, easy-to-use documentation about the implementation, operation, and management of a HANA multitarget system controlled by LifeKeeper.
Leveraging SIOS’ total solution means that customers can benefit from consistent, timely monitoring and detection, fast, reliable and efficient recovery, and a fully automated solution that guarantees high availability and disaster recovery protection. Contact us for more information on SAP HANA multitarget automation.
-By Cassius Rhue, VP Customer Experience