Using Amazon FSx for SQL Server Failover Cluster Instances
Intro
If you are considering deploying your own Microsoft SQL Server instances in AWS EC2, you have some decisions to make regarding the resiliency of the solution. Sure, AWS will offer you a 99.99% SLA on your Compute resources if you deploy two or more instances across different availability zones. But don't be fooled, there are a lot of other factors you need to consider when calculating your true application availability. I recently blogged about how to calculate your application availability in the cloud. You probably should have a quick read of that article before you move on.
When it comes to ensuring your Microsoft SQL Server instance is highly available, it really comes down to two basic choices: Always On Availability Group (AG) or SQL Server Failover Cluster Instance (FCI). If you are reading this article I'm making an assumption you are well aware of both of these options and are seriously considering using a SQL Server FCI instead of a SQL Server Always On AG.
Benefits of a Microsoft SQL Server Failover Cluster Instance
The following list summarizes what AWS says are the benefits of a SQL Server FCI:
Challenges with FCI in the Cloud
Of course, the challenge with building an FCI that spans availability zones is the lack of a shared storage device that is normally required when building a SQL Server FCI. Because the nodes of the cluster are distributed across multiple datacenters, a traditional SAN is not a viable option for shared storage. That leaves us with a two choices for cluster storage: 3rd party storage resources like SIOS DataKeeper or the new Amazon FSx. Let's take a look at what you need to know before you make your choice.
Buyer Beware!
Before you decide to use FSx, you must take the following into limitations into consideration.
Service Level Agreement
As I wrote in how to calculate your application availability, your overall application SLA is only as good as your weakest link. In this case, the FSx SLA of 99.9% is your weakest link!
Normally 99.99% availability represents the starting point of what is considered "highly available". This is what AWS promises you for your compute resources when two or more are deployed in different availability zones.
In case you didn't know the difference between three nines and four nines...
- 99.9% availability allows for 43.83 minutes of downtime per month
- 99.99% availability allows for only 4.38 minutes of downtime per month
By hosting your cluster storage on FSx you are effectively negating the benefit of your 99.99% compute availability, leaving you with just a 99.9% overall application availability.
In contrast, with a solution like SIOS DataKeeper, you would have to experience a simultaneous failure of two EBS volumes in two different availability zones before you experienced downtime. Assuming a single EBS volume has a 99.9% SLA, the statistical probability that at least one of the EBS volumes will be online at any given time is 99.9999%.
1 - (.001 * .001) = .999999
Costs
Assuming the dismal SLA of FSx didn't scare you away, let's take a close look at the costs associated with the solution compared to the SIOS DataKeeper solution. Your costs will vary greatly depending upon your requirements, but once you determine the amount, speed and latency you hope to achieve, AWS has a handy calculator that you can use to compare the solutions. In the example below I provisioned what I consider to be pretty typical of what I see in the real world. Of course if you for the DataKeeper and EBS solution you have to add the cost of DataKeeper to the solution, but even the most expensive pay as you go option ($0.50 * 2 * 730 = $730) still puts the solution at ~60% less than a comparable FSx solution.
Storage Location
When configuring FSx for high availability, you will want to enable multi-AZ support. By enabling multi-AZ you have an effectively have a "preferred" AZ and a "standby" AZ. When you deploy your SQL Server FCI nodes you will want to distribute those nodes across the same AZs.
Now in normal situations, you will want to make sure the active cluster node resides in the same AZ as the preferred FSx storage node. This is to minimize the distance and latency to your storage, but also to minimize the costs associated with data transfer across AZs. As specified in the FSx price guide, "Standard data transfer fees apply for inter-AZ or inter-region access to file systems."
Unfortunately, there is currently no way to tie both the storage and compute together, such that if one or the other fails, the other fails over as well to minimize the latency and to ensure no additional costs are incurred for accessing the data. Currently the cost for transferring data across AZs, both ingress and egress, is $0.01/GB.
Without keeping a close eye on the state of your FSx and SQL Server FCI, you may not even be aware that they are running in different regions until additional latency is noticed or until you get an unexpected data transfer charge at the end of the month.
In contrast, in a configuration that use SIOS DataKeeper, the storage failover is part of the SQL Server FCI recovery, ensuring that the storage always fails over with the SQL Server instance. This ensures SQL Server is always reading and writing to the EBS volumes that are directly attached to the active node.
Controlling Failover
In an FSx multi-subnet configuration there is a preferred availability zone and a standby availability. Should the FSx file server in the preferred availability zone experience a failure, the file server in the standby AZ will recover. AWS reports that this recovery time takes about 30 seconds.
Unfortunately a 30 second failure of the storage could also cause the SQL Server isAlive resource check to fail if it runs at the same time that a storage failover is occuring. It's a little hit or miss here as the isAlive check is scheduled to run every 60 seconds, so you may miss the outage window or you may not.
To make matters worse, FSx multi-site has automatic failback enabled, meaning that for every unplanned failover of FSx, you also have to deal with an unplanned failback, doubling your unplanned downtime. In contrast, typically when a SQL Server FCI experience an unplanned failover you would either just leave it running on the secondary or schedule a failback after hours or during the next maintenance period.
If you WANT to initiate a planned switchover of the FSx file server there is not an easy button or command to run to cause a switchover. Instead, there is a workaround where you have to change the amount of throughput of the FSx server which will cause the FSx service to failover to the standby node. To move it back you will once again have to change the throughput to cause a failback.
SQL Server Analysis Services Cluster Not Supported With FSx
If you want to include SSAS in your cluster, I'm afraid you will not be able to use FSx. The How to Cluster SQL Server Analysis Server white paper clearly states that SMB cannot be used and that cluster drives with drive letters must be used. In contrast, the DataKeeper Volume resource presents itself as a clustered disk and can be used with SSAS.
Network Saturation
When using SMB storage, every read and write has to go across the network. All of this traffic competes with client access traffic and counts towards the overall EC2 instance network utilization. Each EC2 instance size has a cap on how much network traffic is allocated for that instance. The bigger the instance size the more network bandwidth is allocated to the instance. You must be sure that the combination of storage traffic and client traffic does not exceed the network bandwidth allocated to your EC2 instance type. In some scenarios you may be forced to increase your instance size to accommodate the extra traffic associated with using SMB storage. To complicate matters, network throughput on some EC2 instance sizes is not guaranteed, it is only guaranteed upto an x amount, meaning it is capped at that amount for sure, but at certain times might have access to something less than the max specified.
Summary
While FSx certainly can make sense for typical SMB uses like Windows user files and other non-critical services, the less than stellar SLA of only 99.9% falls far short of the 99.99% SLA commonly considered the baseline for high availability. The reason you build a SQL Server FCI that spans availability zones is to achieve an 99.99% availability SLA. As soon as you attach FSx storage as a dependency to the cluster your 99.99% SLA goes out the window and you now have an SLA for your cluster of 99.9%, or almost 44 minutes of downtime per month.