Microsoft pushes Failover Cluster’s abilities every year. Its latest additions with Windows Server 2016 are quite extensive and very exciting! I will take you through these cool new features in this blog post and show you how they will improve your cluster’s functionality.
To those of you who are just learning about Failover Cluster, it’s a Windows Server feature that allows you to group multiple servers across various sites into a cluster to provide redundancy to your working environment. By redundancy, I mean mere seconds of downtime, even when your sites are hundreds of miles apart. Although you get monitoring tools for your whole environment, the system is very much self-sufficient because it monitors and balances your environment on its own. Should any deviation or disaster occur, it would perform the necessary operations automatically within the set of parameters that you configure beforehand.
Deployment and upgrade
The setup of the cluster is pretty much the same as before, so you can check Andrew’s blog, How to deploy Failover Cluster on Windows Server 2012 R2, for detailed deployment information. The main difference today is that you can now create a Failover Cluster, not only with nodes in multiple domains, but also without any domain at all by simply adding them into the workgroup. If you’re already rocking 2012 R2 cluster, the dreaded upgrade is no longer an issue! The upgrade of 2012 R2 cluster to 2016 is now easily done with the help of the newest feature called cluster operating system rolling upgrade. Check out this awesome guide by Clint Wyckoff that shows, step-by-step, just how easy and user-friendly this upgrade is compared to the whole cluster rebuild from before, which was far from convenient. Now onto the new features!
Nothing is more important than ensuring proper business continuity; it’s what clusters are for. The concept of a quorum in Failover Clustering is used to decide whether nodes should run in a cluster or stop if some are unavailable and no majority of “votes” can be achieved for a cluster to continue working. There are three conventional quorum configurations available, depending on your setup:
- Node majority without a witness
- Node majority and a witness (disk or file share)
- No majority with only disk as a witness
To continue running a cluster, you must have a majority of votes cast. For greater availability, a File Share witness is usually deployed on a site that’s separate from the main (or main and DR) site. Because not everyone can build an additional site just for the role of quorum witness, this is where Cloud Witness steps up. With Cloud Witness, you don’t even need to run a VM in the cloud — just use Azure Blob Storage for voting. This is a much less expensive solution compared to building an additional site with its own maintenance costs.
Site-aware Failover Clusters
Site-aware Failover Clusters is a feature that adds more awareness to your stretched cluster, so when one node fails on a specific site, other nodes on the same site that are still operational and accessible will be used for failover, rather than sending them over to another site. This means you can specify the preferred site (usually your production) that would be prioritized in placement of new VMs and quorum decisions. All in all, configuring sites brings a handful of benefits like fine-tuning heartbeat parameters between sites and storage and failover affinity, all of which will make your resource usage much more resource- and time-agnostic than before.
In addition to site-aware benefits, there are fault domains. By default, nodes are already regarded as a fault-tolerance entity. But now, new levels of fault tolerance are introduced with chassis, rack and site domains. This allows you to granularly ensure that on each of these levels your data will be preserved and safe in case of a disaster. While this comes at the price of additional hardware and some more planning, you will get a true fault-tolerant environment. The newest Storage Spaces Direct (S2D) fully utilizes this functionality to achieve its level of resiliency. You also get extended monitoring capabilities with the Health Service feature, which is enabled by default with S2D.
Figure 1: Representation of fault domains levels (Source: Microsoft).
Virtual Machine Load Balancing
Despite the name, VM load balancing is more of a resource management feature that makes sure every node is evenly loaded with roles. You have a choice of three aggressiveness levels that range from 60% to 80% of the load on each node, after which the balancing kicks in. And it does this automatically after set periods of time, always on the lookout for imbalance in the system. And should a new node join the cluster, it would be detected and integrated appropriately, thus taking the load off other nodes.
The coolest thing is that all these great tools — site-awareness, fault domains and VM load balancing — are automatic. Once you set everything up exactly how you want it, the system will, for the most part, monitor and handle any non-catastrophic disruption itself.
Figure 2: Load Balancing aggressiveness levels (Source: Microsoft).
To bind separate sites and make them truly DR-ready, Microsoft has introduced the Storage Replica feature. It is an in-house tool to provide block-level replication between sites. There are two types of replication available: synchronous and asynchronous. Synchronous replication is for the sites that are relatively close (around a 100-mile radius) with enough connection speed to accommodate continuous replication and zero data loss. Asynchronous replication is for sites that are stretched between the cities and allow some data loss. Asynchronous replication uses an SMB3 protocol for data transfer. Now, with simplified SMB multichannel and multi-NIC cluster networks, you can implement even more network redundancy and throughput.
There is so much more to each of these features. I plan to touch on the most important and interesting of features in future posts. Let me know in the comments down below which features you’re most interested in!