Keys to Success for Protecting Business-Critical Applications
High Availability and Disaster Recovery have to cover a broad range of use cases. There are as many use cases as there are organizations, far exceeding the capabilities of any single High Availability and Disaster Recovery solution to provide out-of-the-box support for every scenario. While many common applications have a wide array of High Availability and Disaster Recovery solutions available, more specific use cases limit the selection available to protect business-critical applications.
Of course, LifeKeeper cannot cover every use case out of the box. LifeKeeper, however, provides a versatile and flexible framework that can be adapted to a wide range of use cases to remedy this limitation. While powerful, this framework can appear complex to an outsider. This blog is here to help give a leg up when starting to conceptualize a Generic Application Recovery Kit for your specific use case.
Related Blogs and Background Reading Recommendations
Within this blog, there is also the assumed familiarity with the LifeKeeper Resource Hierarchy framework and LifeKeeper Clustering in general. For background on these topics, the blogs listed below provide fantastic context. Additionally, this blog builds upon a previous blog regarding one means to close the gap between possible use cases and supported protection mechanisms via the use of the “Quick Service Protection Application Recovery Kit” (QSP ARK) within LifeKeeper, linked below.
- Linux Clustering / Windows Clustering (Writing credits to Ms. Hoagland, Vice President of SIOS Global Sales and Marketing, and the SIOS Marketing Team)
- Application Intelligence in Relation to High Availability (Writing credits to Ms. Hendricks-Sinke, Senior Software Engineer at SIOS)
- Resource actions and background on The Generic Application Recovery Kit (Writing credits to Mr. Birmingham, Senior Technical Evangelist)
- Choosing Between GenApp and QSP: Tailoring High Availability for Your Critical Applications (Writing credits to Ms. Hendricks-Sinke, Senior Software Engineer at SIOS).
This blog, however, will explore the options available when the QSP ARK cannot meet the demands for High Availability and Disaster Recovery for a particular application or use case.
Conceptualizing Applications and Defining the Approach
Asking the Smallest Question About Application Health
System administration and software engineering are both fields with lots of nuance. There can be so many different elements behind a question that a simple, straightforward answer can be difficult to obtain. Conversational, this can be easily navigated. In code, complex answers are difficult to accommodate. Asking the “smallest” question is the practice of targeting an inquiry to the smallest element possible, while ensuring that the answer has clearly defined criteria.
“Is the application running?” This is a “big” question; it may require a verbose answer. Yes, the application is running, but it is not responding. Yes, the application is running, but it is running on that other system – not the one you’re talking about. The criteria of the answer are ambiguous, and the answer is nuanced – a level of detail that developers would rather not have to handle.
“Is the application’s process running, and is the application actively responding to queries?”
Though longer to say, it is a smaller question. It clearly defines the conditions under which the answer is yes or no. While this change is an improvement, it is not yet the “smallest” question. The previous falls victim to the same pitfall of asking “Are both X and Y true?” A yes or no answer cannot give the level of detail to determine the truth of X and Y independently. The smallest question requires specificity; it must provide full insight into the status of the smallest element of the greater whole. “Is the application’s process running on the desired system?” That’s a small question – in this case, this is the smallest question. Keep in mind, there might be multiple “smallest” questions – in this example, “is the application responding to queries” would also qualify.
While questions can be broken down almost indefinitely, there is a limit. Asking “the smallest question?, comes with the implication of “Asking the smallest question that still provides useful/actionable information”. Asking “Am I on the train to Philadelphia?” is sufficient; going further to ask “Am I on the train to Philadelphia, and which direction is Philadelphia?” provides more information – but it is not actionable. I cannot change the direction of the train. I know from the answer to “Am I on the train to Philadelphia?” if I need to call into work to inform my boss that I will be late.
Though clear in this example, this is less obvious when developing for a generic application. Throughout the process of protecting the generic application, one must still keep perspective on the bigger picture. This, like anything else, is a skill – with practice and collaboration comes the ability to determine when a question is the smallest question, and when further nuance stops providing additional useful information.
Broad questions that have been broken into smaller, specific, and targeted inquiries about individual elements are the basis on which Generic Application Recovery Kits are built. Each “big question” can be answered through a composite of the answers provided for each of the elements implicated within.
Once the questions are broken into their smallest elements, the information that needs to be relayed becomes significantly clearer. Knowing the information needed, the remaining work in developing a Generic Application Recovery Kit is all a matter of how to get the information that is needed from the information that is provided. One must work with the information that is given.
Working with Application APIs and LifeKeeper APIs
Often, applications provide Graphical User Interfaces (GUIs) to display information or show changes that occurred to the application. While fantastic for human-driven use, this is less useful when the administration is being done by an application. GUIs are for people to use, and applications (foregoing a massive amount of programming effort and unnecessary complexity) are not equipped to interface with the GUI of another application as a human would. For the purposes of LifeKeeper and a Generic Application Resource, the exchange of information between the Generic Application Recovery Kit’s action scripts and the application being protected must be done by an Application Programming Interface, or an “API”.
LifeKeeper provides its own API for interacting with the LifeKeeper, the hierarchy, and the resources within the hierarchy. In the case of LifeKeeper’s API, the command-line utilities contained within the product are the easiest to use in a Generic Application. As a general recommendation, only the command line utilities outlined in LifeKeeper Product Documentation (Linux Commands Documentation / Windows Commands Documentation) should be used. Even with this recommendation, these commands should be used with care and attention to detail to ensure that unintended actions are not taken.
Of course, LifeKeeper is not the only factor in the Generic Application. The application being protected will also need an API presented so action scripts can leverage the application’s API to achieve the desired outcomes. Developing a Generic Application Recovery Kit does require knowledge of the protected application’s API and the use of that API within the action scripts that compose the Generic Application Recovery Kit.
Using Return Codes and Output Streams in Recovery Scripts
Whether it is the API for LifeKeeper or the protected application, information will be primarily output in two ways:
- Return Codes
- Output Streams (sometimes called “STDOUT/STDERR output” or just “terminal output”)
How Return Codes Help Determine Success or Failure
Return codes, in the broadest sense, provide a quick way to see if a utility succeeded or failed. Typically (In the context of a shell environment), a return code of 0 indicates success, while a nonzero return code indicates failure.
Depending on the application, the exact value of the return code may give more insight into the error encountered. Often, the outcome of actions performed via an application’s API can be surmised simply by checking the return code.
In more nuanced cases, it may be that the return code is simply used to tell the program which course of action to take following a call to the application’s API. Return codes are especially useful when dealing with utilities that concern the state of some underlying element.
How Output Streams Provide More Detailed Application Information
Output Streams, while more complex to make use of in a program, are sometimes necessary for information exchange or to verify outcomes. If running a utility to get the system’s hostname, the return code alone will not indicate what that hostname is, unless the utility was successful in retrieving the hostname. In some cases, an API utility may return a successful return code if the requested information was obtained, but that information has to be evaluated for validity based on the circumstances.
Whether using return codes or output streams, developing a Generic Application requires the use of the information at hand. When thinking of ways to achieve resource actions (outlined in the next section) or determine information about an application or LifeKeeper resource, try to think in terms of return codes and output streams, not GUI interfaces. It can be helpful to imagine trying to convey information over the phone. This is to say, information is best communicated, actions are best defined, and scenarios are best handled when the inputs and outputs of utilities are reported exactly as they are to be provided as input or are reported as output.
Building a Foundation for Generic Application Protection
This section kept the strategies very conceptual. These strategies lay the foundation for thinking about an application through the responses provided to questions and actions issued upon that application. Going forward, the approach will become more specific to LifeKeeper and the process of creating a Generic Application Recovery Kit. In the meantime, these strategies develop like any other skill, through practice. In technical communication, writing procedures, or any capacity in which you find yourself, practicing these conceptualization strategies will benefit not only in the short term but over time as well.
Need help protecting a business-critical application that does not fit a standard high availability model? SIOS can help you evaluate your environment and determine the right LifeKeeper approach. Request a demo today.
Author: Philip Merry, L3 Support Engineer at SIOS Technology Corp.