If you are planning to implement a high availability cluster to protect SQL Server, you need to understand your Windows Server cluster quorum options. Here are some frequently asked questions:
Q: if we have a three node on premises cluster running Windows 2012 R2 is it possible to put another node in the Azure cloud and failover to it? Azure node?
A: Yes. You are creating a four-node cluster. You have three nodes on premises, one node in the cloud. If you’re doing a failover cluster instance on premises, and then you want to replicate that to the cloud with SIOS DataKeeper. The fourth node can be part of the same failover cluster instance. If you’re leveraging Always On Availability Groups (AOAG), as of SQL 2016, you are allowed to have three replicates with automatic failover. The third node would be an asynchronous replica for disaster recovery. In that configuration, there won’t be any automatic failover to that node, but you can manually recover things to that fourth node. Note that regardless of whether you’re using Always On Failover Clustering Instance (FCI) or AOAG, all four of your nodes will participate in the same cluster. By default, they will all be part of the quorum configuration. That will be a formula cluster. You have to have a witness. You can leverage the Azure cloud witness, so your witness would reside in Azure. That gives you a very high level of availability, even if the entire Azure stack in the Azure cloud as a whole went away, you would only lose two votes. You still have three votes on premises, so your on-premises cluster would continue to work. The dynamic quorum would kick in and change your five votes down to three votes, so you still have a very high level of availability. A more likely scenario isn’t that the entire Azure cloud would go away, but that a fiber outside your building gets cut, so you lose connectivity to the Internet. That’s a very costly scenario. You would still have a very highly available configuration on premises. Even that catastrophic failure wouldn’t affect your availability on premises. Now, if you don’t want to leverage the Azure cloud witness, or you can’t use it because you are still running Windows Server 2012 R2, your best bet is to put a file share witness on premises. Use another little server to make a file share witness. That gives you all four votes of your cluster on-premises. If you lose the entire site, in a catastrophic failure. Node four will still be here. Of course, it’s not going to failover automatically because it can’t take the quorum, If you lost four out of five votes, you would use an easy, well documented process to simply force the quorum online on this cloud witness. Essentially you restart the cluster service with a /fq switch for force quorum, and you’d be back up and running pretty quickly.
Q: If you failover from an on-premises SAN-based cluster to a node in the cloud, is the SAN storage just for the local machines? Do they have a separate set of storage in the cloud, or is the cloud server actually able to access the SAN storage?
A: If you’re using FCI, then all three nodes, on premises, are going to be attached to the local SAN. Now, if you weren’t replicating to the cloud, then you have a few options. If your fourth node is in the cloud, there’s no way for you to use any sort of SAN-based replication solution like EMC, SRDF. That’s not possible when you’re considering the cloud because you have no access to the storage and chances are you’re not using the same brand of storage anyway. You have use software-based replication like SIOS DataKeeper. In this scenario, your three nodes all share a single disk on premises (your SAN), and SIOS DataKeeper simultaneously replicates the data to that fourth node – keeping all of your storage synchronized. Your fourth node may have local attached storage or if it’s in in the Azure cloud, it may have premium disk, or in AWS, it may be an EBS provisioned IOPs storage device. The DataKeeper software is storage agnostic so it will replicate from any storage. Let’s say, you have a SAN and your cluster uses an E-drive for a lot of files, and an F drive for a terabyte of your data files. You should have an E drive and an F drive on those four, and each one should be a terabyte in size so that we can mirror to that node.
Q: Can you failover and failback when you have a cluster with both on-premises nodes such as SAN and nodes the cloud?
A: Yes. If you use FCI and SIOS DataKeeper software, SIOS integrates with failover clustering so that when the DR server comes online, it becomes the source of the mirror. Anything written on the DR node will be replicated back to the on-premises cluster automatically. Suppose you had a massive power failure in your primary data center and it was going to be offline for a long time. You could enact a disaster recovery plan, where you configure your DR node to come online automatically. With SIOS DataKeeper, even if you choose asynchronous replication you still have an excellent RPO. If the RTO is more important than your recovery point objective, then configure automatic failover to the DR node. If you do that, you assume that in the event of disaster you can tolerate potentially losing a few writes. If you don’t want automatic failover, you can configure a manual failover as well. That way you can control when to pull the switch and bring the node online When your DR node comes online, you can run your SQL Server there for a day, a week, or a year. SIOS DataKeeper tracks of those rights in an intent log, which is a small fixed size file. When the primary data center comes back online, DataKeeper will detect it automatically, initiates a partial resync, so just the blocks that have changed will be sent over. Once it reaches a mirror state, the administrator can bring the cluster back online in your primary data center, and operation will switch over automatically to your primary node as part of the failover clustering.
Q: If you have multiple nodes in a cluster, are there limitations with clustering as far as what can be accessed in terms of the RAM and CPU limitations – or is it based on the Windows operating system and the version of SQL Server?
A: You should size each cluster node similarly. Before Windows Server 2008, to be supported, your cluster had to be built with components on Microsoft’s hardware compatibility list. With Windows Server 2008, they replaced the hardware compatibility list with a cluster validation tool. When you build a cluster, you run this tool, and it looks at each node, as well as your networking, storage, service pack levels, and a variety of other things and it calls out configuration options or any hardware issues of concern. It will tell give you warnings or failures.
To evaluate warnings, you have to understand what they’re warning you about. For instance, in most cloud configurations, you’re only using one network interface card, and so they will call that out as a potential single point of failure. Well, we know in virtual environments that that Virtual NIC is usually backed by multiple physical NICs, so it is not a single point of failure. There is no set rule that you must have x amount of RAM, or this CPU as long as you pass cluster validation and you’re confident your hardware can meet the needs of your application. Make sure your machine is powerful enough to run the workload. Don’t make node one a 16-core great server running a very important SQL Server workload, and then make your backup server an old laptop.
Ideally, your secondary node should be the same type of server. Just realize that you’re going to have to run that workload at some point and there is no point to failing over to a node that can’t handle the workload. If you are running in a cloud environment you can “cheat” on the sizing a bit to save on infrastructure costs. The best thing about the cloud is that you can provision a smaller instance than you can in a physical server environment. You are paying per core out in the cloud. Although that server has to be running, you don’t have to pay for 16 cores, 24/7, 365 days a year. You can scale that down to the minimum that you need for a successful failover. In the event of disaster, it only takes a couple minutes to increase your cloud node. You can go from a D1 instance in Azure to a G5 instance as long as it stays within your same family. You just open up the Azure console and choose “make this a G5” and it will reboot the server. When you have a disaster, another couple minutes is typically acceptable. Now you’re only paying for larger instance size when you actually need it.
– Dave Bermingham, SIOS Technology Corp. Technical Evangelist; Microsoft Cloud and Datacenter Management MVP