High Availability Troubleshooting and the “Reboot” Mindset

Reading Time: 4 minutes

“Turn it off, turn it back on again.” Anyone who has had experience troubleshooting any kind of computer issue has heard this piece of advice. It is notorious for being the most common tech solution, and for turning anyone into a master IT troubleshooter. The problem is that it is never actually the solution; it just happens to solve most things. By turning it off and turning it back on again, we quickly get back up and running, but we never really find out what the problem was in the first place.

Why “Turn It Off and Back On Again” Is Risky in High Availability Systems

Additionally, in the world of high availability, “turn it off” can be a huge problem. Even minutes of downtime can be a major problem for companies that must have their critical infrastructure remain up. Because of this, working in tech support for SIOS, we don’t often give this notorious piece of tech advice, but we do have our own version.

Many who have called in for tech support at SIOS for a Windows DataKeeper mirroring issue will have been told to run the command “cleanupmirror.” In the right situation, this is an excellent command for quickly getting someone out of a major problem. The command essentially completely deletes the mirror configuration and any possible remnants of it, so that we can recreate the mirror fresh, free from whatever problem plagued it previously. Note that this does not actually remove any data, just the replication between the systems.

The command does not require downtime, but it does mean that the systems are not highly available until the mirror finishes resyncing. This is one of our go-to troubleshooting steps in support, but like “turn it off, turn it back on again,” it can sometimes hide a more serious underlying issue, and it can sometimes be overkill.

Today, I wanted to talk about one such case, where running cleanupmirror got the customer out of an immediate problem, but almost made us miss a fairly serious issue, which could affect a wide range of customers, but had a really easy workaround and fix.

A Real-World DataKeeper Mirroring Issue During Migration

When the support team joined, the customer had already been troubleshooting this for quite some time, and they were starting to panic. They were doing their final switchover tests as part of their migration when DataKeeper mirroring started having issues. At this point, their critical infrastructure was down, and they were worried it was going to start affecting their business. This was a high-stress situation, but fortunately, the support engineers here did an excellent job. They balanced the pressure, rush, and need to find a good solution, and ran the tried and true “cleanupmirror” command, followed by recreating the mirror in working order. They got the customer out of a bind, and everybody moved on. Fortunately, they also asked the customer to send in logs, “for good measure.”

The logs on this case were somewhat confusing. The logs indicated that a volume had been resized, but the customer had claimed that they had not performed any resizing activity on the call. Sometimes customers leave out important information, so we thought that maybe they had left that detail out on the call, but the resize didn’t make any sense. The change in size was very small, and it happened to all volumes at the same time as the first switchover. It wouldn’t have made sense for the customer to resize their terabytes of large drives by subtracting less than a gigabyte all at once, perfectly in sync with the first switchover, so we looked a little deeper. It turned out that the target drives were slightly larger than the source drives, and there was an issue in our product with how it handled mismatched drive sizes.

Identifying the Root Cause Prevented Repeat Downtime

Once we figured this out, we realized that all that was needed to resolve this issue was to continue the mirror. This is a common, quick, and easy operation that would have taken seconds to completely fix the issue. No days-long resync before we got back to having high availability. Additionally, once we found this issue, it was a very quick and easy fix to implement for the next product version.

It turned out that the customer had a unique migration scenario, which required them to make the targets slightly larger, because matching up the sizes was impossible. They still had several systems left to migrate, and if we had left the case at “cleanupmirror,” they would have run into this issue every time. Because we found the root cause, we were able to give them a quick and easy workaround, and an even quicker preventative measure they could take before executing the first switchover. We were also able to publish a solution, so that the next customer who ran into this would be able to solve it in minutes.

Why Root Cause Analysis Matters in High Availability

So, what is the big problem with “turn it off, turn it back on again”? It hides the root cause. So, does that mean that you should never use it? It is still some of the best tech advice there is. Often, you really don’t need to know what the root cause is, and turning it off and back on again gets you out of a pinch really quickly.

The important part for an IT professional is that when you don’t need to get out of a pinch, and you can afford some time to investigate first, you should. When you don’t, you should go back later and look at the logs to try to see if you can figure out what happened.

So, please, turn it off and turn it back on to your heart’s content. Be the magician who solved that one problem in minutes, and leave everyone wondering how you did that. But… every once in a while… take some time to go back and figure out why you needed to turn it off and back on again… and consider the possibility that there could have been an even easier solution.

To learn more about how SIOS DataKeeper and high availability solutions can help you avoid hidden issues like this, request a demo from our team today.

Author: Carter Chandler CX Associate, Software Engineer at SIOS Technology

What We Do

The SIOS Advantage

Products & Services

Not Sure What You Need?

Solutions

Blog

Blog Categories

Recent Posts

Resources

Resource Library

Company

SIOS in the news

The Danger of Turn It Off, Turn It Back On Again Thinking in High Availability

Why “Turn It Off and Back On Again” Is Risky in High Availability Systems

A Real-World DataKeeper Mirroring Issue During Migration

Identifying the Root Cause Prevented Repeat Downtime

Why Root Cause Analysis Matters in High Availability

Recent Posts

Disaster Recovery Planning in an Unpredictable World

Active-Active vs. Active-Passive

Broadcom/VMware: Time To Decouple High Availability From Your Hypervisor