Working in a position with its roots in software engineering, system administration, and customer support positions, one has a unique opportunity of seeing a variety of configurations and a myriad of issues. Additionally, such a position also gives one perspective on users’ various needs, pain points, and concerns in a way that someone working in a purely engineering role might not be exposed to.
As a result of almost 5 years on the support team, I have noticed patterns in various teams with which I have worked. Further, when called to help on various configurations, I have a unique opportunity to draw parallels between the different use cases and root causes. . As a result, there is a foundation that I like to ensure is set when it is time to begin collaborating with a new team. Setting this foundation means ensuring administration practices facilitate working optimally with an HA/DR suite, ensuring teams know how to design for High Availability and how to leverage the utilities beyond the software on their systems to achieve success. This foundation can be crucial to ensuring a team knows how to meet or exceed their operational standards. It seemed appropriate to summarize the common questions and their answers to serve as a resource for those who are new to, but interested in implementing a High Availability solution, or simply want to change to using a new High Availability solution. Whether you are a student just now starting to study system administration/systems engineering, or you are a veteran software engineer who has been asked to expand the scope of your role to include system architecture planning, the points below can aid in your journey to get the most out of a high availability/disaster recovery suite.
Without further ado, the questions below summarize the common talking points I have seen in my role, and will help make your search for understanding key concepts and finding a fitting solution easier.
What is Disaster Recovery and what does it entail?
Disaster recovery, when coupled with high availability, works to optimize the recovery time objective (RTO) – how long a service is inaccessible before being restored – and recovery point objective (RPO) – the data you can stand to lose when restoring from a backup.
The recovery time objective describes how long a system can be down and still fit within an operational standard. Commonly, this metric is phrased in terms of percentage – the common “five nines of uptime” refers to 99.999% uptime, or right around a maximum of 5 minutes of downtime per year. Recovery point objective is a bit more complicated, it describes the amount of data that can be lost while still falling within operational standards. For example, if a system cannot lose any data in the wake of a disaster, then that is called “Zero RPO”. It can be helpful to think of the systems existing on a timeline, and the recovery point objective as the answer to the following question: “If systems experience a disaster, how far back in the system timeline can you ‘rewind’ and still meet operational standards”?
How does Disaster Recovery differ from traditional approaches to weathering outages?
Traditionally, without a highly available infrastructure, an environment experiencing a disaster may have a lengthy return time objective. Systems need to be restored, issues may need to be resolved, and applications started by administrators. Depending on the severity of the issue, it could take hours or more to get back up and running. Teams must work efficiently and exhibit tight communication to ensure service is restored without mistake, lest they risk additional delay in returning to operation. Additionally, the data lost during this sort of outage could be significant. If backups were not taken recently or if the copies of up-to-date data are not accessible, then teams could be relying on data that has gone “stale” and experience operational setbacks on an organizational scale due to the loss of critical data. To look at things from a customer perspective, how long are you willing to wait to obtain access to an online service when you need it? As a customer, how accepting are you if an online storefront loses record of your transactions?
When introducing a highly available infrastructure, a means to mirror storage, and a means to orchestrate the high availability, the factors influencing RTO and RPO are all optimized, and a disaster can be weathered with far more grace. A highly available infrastructure is redundant, so a standby system is available to take over operation. Further, the orchestrator – software to manage the clustered environment – is able to systemically start services on a standby system with greater responsiveness, reliability, and efficiency than a manual intervention can achieve. As a result, the return time objective is reduced, and rather than taking hours to recover from a disaster, it can take mere minutes or less.
Another facet of highly available infrastructure is the redundancy of data. Disks can be “mirrored”, in which disks that are attached to different systems can all receive the exact same data in real time. As a result, the data available on the aforementioned standby system can be an exact copy, effectively maintaining a backup of the data immediately before a disaster occurs. In turn, when service is restored, applications are running with a near-zero recovery point objective, keeping the recovery point objective to the most current state of operation possible when it is time for the orchestrator to move operations to the standby system.
What are the most common mistakes organizations make when designing high availability disaster recovery (HADR) strategies, and how can they avoid them?
One of the most common missteps observed is the lack of a QA/Testing environment. The SIOS Customer Experience team has responded to multiple instances of such, where organizations attempt to do application/operating system patching/upgrades or just routine maintenance and experience issues due to inadequate planning or some sort of unfortunate incompatibility. Then, there is a downtime that occurs for the environment, and a maintenance procedure turns into a recovery procedure. This introduces delays, complications, and potential for a spiraling issue to occur within a production environment.
By far, the biggest recommendation that can be offered to organizations is to create a one-to-one copy of the production environment that operates in a quality assurance capacity. Every procedure that needs to occur on production should first go through a “dress rehearsal” in the QA environment. This gives organizations the freedom to exercise the planned operations and make improvements without risking the productive capacity of their infrastructure. Practicing operations in a safe, low-stakes environment ensures that teams are ready to operate in the production environment without the risk of encountering an unexpected issue and having to go “off script” to respond quickly and correctly while under pressure. If a problem happens in the QA environment, then support teams can be contacted, and the issue can be investigated with the safety of this issue being insulated from affecting business operations. This can greatly improve the potential for solutions to be found and implemented into operations in a controlled, planned, and effective manner.
The aforementioned benefit of the QA environment is important for any organization; however as organizations adopt more complex maintenance strategies, the existence of this test environment becomes all the more important. The use of this testing environment not only facilitates smoother upgrade procedures but also allows companies to mitigate risk when adopting maintenance models that introduce complexity for the return of improved system availability during maintenance activities. In any scenario, testing the maintenance plan in a QA environment, improving the plan based on findings from the “dress rehearsal”, and using the experience gained from this practice enables organizations to manage production systems while minimizing the risk of encountering issues.
What is the importance of eliminating single points of failure?
Another common obstacle that teams can experience arises from having a “weakest link” in the architecture that does not benefit from the degree of planning that other facets of the environment receive. This is best described with an example. The SIOS Customer Experience team once worked with a customer who designed extensively around keeping SAP applications running in their environment and were very well insulated from issues affecting the systems running the SAP applications. Unfortunately, this customer invested much of the planning effort into protecting their applications and did not afford that same planning effort to other aspects of their environment. As a result, all the systems relied on a singular internal DNS system that resolved hosts within their private network. Despite all of the effort in protecting SAP, when an issue occurred on their DNS system, the whole environment experienced significant issues when name resolution was no longer available. Effectively, the effort placed into protecting their SAP applications did not help their environment weather the issue, simply because the DNS was a “weak link” that all of the other systems relied upon to function properly. When planning environments, it is crucial to step back and look at the bigger picture – pay attention to the weakest links that show up in an architecture. Improving the weakest links uplifts the potential for the entire environment to weather a disaster.
For organizations relying heavily on cloud services, how can they protect against Zone or region-wide disasters?
Protecting against zone or region-wide disasters can be done by just distributing resources geographically. For example, one might host their primary application server in the US-East region. Then, to be protected against an outage affecting the US-East region, there are standby systems hosted in a “Disaster Recovery Site” that is far away from the US-East region – maybe the US-West region. While this does introduce some additional steps to ensure cross-region communication, the effort is invaluable as this provides protection against zone and region-wide levels of disaster. A total outage of the cloud provider’s US-East region can be withstood by bringing applications in service in the US-West region. Protection against outages that occur in a specific region doesn’t need to be complicated, and ensuring a Disaster Recovery site exists to assume operations will improve application availability and data redundancy in production environments.
How do you recommend organizations balance the complexity and cost of implementing robust HA/DR strategies with the need for business agility?
There is a common assumption that HA/DR solutions are either complex or expensive, or both. In the wake of this assumption, it is essential to keep a strong perspective on the stakes at hand. Systems are operational for some business purpose, and this translates into the production of revenue. When systems are down due to an outage, there is much more cost than just the lost revenue. Without an HA/DR strategy in place, an outage requires employees to be actively troubleshooting the issue, producing a cost of employee-hours to factor into the cost of downtime, perhaps even at hours when employees are not well-rested and prepared to do their best work. In addition to this, there is a lingering collateral cost in terms of interruption of regular duties and delay/slowness when employees have to task switch into resolving production issues and then switch back to their regular duties. Even further, there are reputational costs that could cause failure to recognize opportunities for revenue. For instance, what comes to mind if you think of “CrowdStrike”? Even if this doesn’t immediately bring the issues and related bad press that CrowdStrike experienced in July of 2024, at the time of writing this (March 25th, 2025), their stock prices have only just returned to the levels they were at before the issue on July 19th, 2024. Taking into account the opportunity cost of configuring an HA/DR solution, the aforementioned factors can vastly change the analysis. Commonly, SIOS customers find that the implementation of an HA/DR solution saves them money in the long run. Additionally, backed by decades of improvement and iteration on the HA/DR offerings from SIOS Technology, the complexity of configuring such a solution is more approachable and less complex than ever. If there are factors at play that still bring concern over the complexity of introducing an HA/DR solution to a production environment, SIOS Technology has professional services offerings that can help to train teams, perform installation and configuration activities, or simply validate existing configurations. With these opportunities, bringing High Availability into a system architecture is not only less complex than it has ever been, but it can be implemented faster than ever before. Finally, for organizations concerned about complexity due to unique configurations or trying to reach the absolute maximum utility of an HA/DR solution, our world-class support team is available to help bring any implementation to its full potential.
How do SIOS Technology’s solutions play a role in helping organizations implement the disaster recovery approach that you advocate for?
SIOS Technology’s solutions can meet all of the aspects addressed previously, to recount some of them:
Modern approaches to disaster recovery are adopted by way of our LifeKeeper and DataKeeper products, which together we call SIOS Protection Suite. Whether on Linux or Windows, these products are available to provide cluster-wide orchestration of resources to ensure a quick and efficient response to disasters while also ensuring data is replicated and available on standby systems. LifeKeeper monitors applications for faults and communicates between nodes to ensure systems are valid targets for application recovery. Datakeeper replicates data in real time to ensure standby systems are able to inherit applications in the event of an issue and continue operation on the latest available data. Hand in hand, these products work to minimize the length of time applications are down and minimize the loss of data in the event of a disaster.
These products also integrate fully within your environment. There are mechanisms to provide efficient networking control so clients can always resolve the connection to the application servers. The solutions at play will not only monitor applications or specific components of a system, but also an entire system and environment. Through the use of “quorum” functionality, environments are monitored at a “big picture” level to ensure applications are restored on the correct systems and data is protected. There are protections in place for a myriad of disaster scenarios, so SIOS Protection Suite is able to respond appropriately.
SIOS Protection Suite is also able to work across regions, providing the protection we discussed against zone or region-level disasters. Applications can be migrated across regions, and data can be replicated across regions with the same ease as it can be replicated within the same region. Additionally, environments can be multi-tiered. Multiple nodes can be hosted in the primary region and act as either active or standby systems, providing fast responsiveness to system-level issues, while a disaster recovery site in a different region can also be maintained to ensure there is protection from region-level disasters with the same speed and efficacy of protection.
Finally, the SIOS Protection Suite product benefits from decades of real-world use. It has been put through its paces in a wide range of scenarios and deployment configurations, and benefited from years of ease-of-use improvements. As a result, this is a solution that is flexible, easily adopted, and fits seamlessly into production environments. The complexity of designing and configuring an HA/DR solution is avoided by adopting SIOS Protection Suite and enjoying the benefits of a rich development history with countless improvements, coupled with the world-class support team that is available to help in case of any questions or concerns that may arise. In addition to all of this, there are also opportunities to undergo collaborative installation or validation procedures for SIOS Protection Suite offerings, ensuring your environment is ready for whatever the world can throw at it. Finally, teams that need strongly experienced staff and want to maximize their leverage of SIOS Protection Suite and its components, SIOS offers training engagements where teams are able to work with our staff to understand the components at play and have an active discussion to facilitate deep understandings that ensure staff can hit the ground running with all of the information needed to implement the solution to its highest potential.
Protect your business from downtime and data loss—request a demo or start your free trial to see SIOS in action.
Author: Philip Merry, CX – Software Engineer at SIOS Technology Corp.