Building a production-ready Hyper-V Cluster on the cheap

My most recent project was to improve the resiliency and uptime for a server farm consisting of ~18 virtual servers spread across two standalone Hyper-V hosts. The objective was to convert the two hosts (Dell Powerdge T710 servers) into a 2-node Hyper-V cluster as inexpensively as possible while still maintaining solid performance and leaving headroom for moderate growth.

The workloads being virtualized consist of four Exchange 2010 servers (2-node DAG and 2 CAS/HT servers running NLB), a couple of file servers, and a hodgepodge of domain controllers, SQL, and web servers with a few Citrix Xenapp servers thrown in (most of the Citrix servers are physical boxes). The network consists of three distinct internal networks brokered by Juniper firewalls, named DMZ, Trust, and Vault.

When it comes to clustering on 2008/2008R2, the storage solutions that get the most publicity are Fiber Channel and iSCSI. Both of these solutions offer the ability to connect multiple nodes and, when configured properly, can perform very well. Both require quite a bit of hardware and tweaking to do so however – including a redundant set of FC or gigE switches (to avoid a single point of failure) and in the case of iSCSI, further network tweaks for performance. There is a third shared storage option that just doesn’t get talked about very much: Shared SAS.

Shared SAS

Serial attached SCSI is the successor to the older parallel SCSI interface we all knew and loved (and the command set for which is still used in SAS, FC, even SATA). SAS performs well (even the older 3gbit spec) and is relatively inexpensive, and while parallel SCSI is no longer legal for shared storage clusters under Server 2008 and beyond, shared SAS is fully supported!

SAN budgets can vary wildly depending on the target application, storage types, fancy features such as snapshots, thin provisioning, etc. but even the cheapest redundant solutions quickly get you into the thousands of dollars range, if not tens of thousands. You can set up software-based solutions such as Starwind or Sanmelody on commodity hardware you have laying around (and I have done this with great success for lab work) however to build those into redundant SANs requires expensive licenses that can rival the cost of turnkey solutions.

Ebay to the rescue

Ultimately we decided on purchasing a used Dell MD3000 shared-SAS SAN that still had a year of 4-hour response warranty left on it. The 3U MD3000 is still sold by Dell despite being upstaged by its 6gbps brother, the MD32xx. While the newer model can only hold twelve 3.5″ drives, the MD3000 can handle fifteen. The real sweetener on this find was that the SAN came with eight 450GB 15k RPM SAS drives. To buy this MD3000 redundant configuration new with the eight disks and warranty would have cost us over $20,000. All told (including the four SAS5/E adapters and cables) we paid under $6,000. When next July rolls around we will have to re-up on the warranty coverage, but that is still a hefty savings (now I know some of you are used to working with SAN budgets that exceed our entire annual operating budget, but this was big savings for us!).

Networking

After storage is nailed down, the next big pain point of clustering (especially with Hyper-V) tends to be figuring out the best networking configuration. Spending a lot of money and effort on a cluster to improve uptime does little good if something as simple as a failed ethernet cable or switch port can bring down a VM. Strangely enough, this is exactly the case – a nic failure or switch failure that causes connectivity failure in a VM is not something a Hyper-V cluster can detect and work around. So your options either become scripting elaborate routines to check network connectivity and migrate the affected virtual machines, or to turn to NIC teaming.

The Sorrowful Saga of NIC Teaming and Hyper-V

To the surprise of many, Windows has never had any way to aggregate NICs. Microsoft has always left teaming solutions (and support of them) to the hardware vendors. Both Broadcom and Intel have stepped in, with varying levels of success. Typically teaming is very easy to set up, but virtualization can introduce some quirks that need to be understood and worked around. At this time, both Broadcom and Intel offer support for teaming solutions under Hyper-V. As always, however, your mileage may vary. I have read way too many horror stories about both vendor solutions, but as of this writing Intel seems to have the edge and has worked out the kinks (revision 15 ProSet drivers and later). Intel also offers a special teaming mode called Virtual Machine Load Balancing which is very handy, providing both load balancing and failover options at the VM level.

Our T710s came with the integrated quad port Broadcom nic, and a PCIe quad port Intel nic. The cluster teaming plan called for more available ports for full redundancy, so each server was outfitted with an additional quad port Intel nic ($200 each on Ebay), as well as another Intel single-port PCIe nic, for a lucky total of 13 ports per node.

What could you possibly need all those NIC ports for?

The 2008 R2 revision of Hyper-V clustering introduces a number of important network-based features and services that increase the need for dedicated gigE interfaces. The one most people have heard about is Live Migration – Microsoft’s answer to VMWare’s VMotion. Live Migration allows you to seamlessly transition a running virtual machine from one cluster node to another. This is obviously incredibly handy for managing server load, as well as shifting workloads off of physical hosts so that they can be taken offline for hardware or software maintenance. Production-ready Live Migration should have a dedicated gigE nic (10gigE works great too if you have deep pockets).

Next is Cluster Shared Volumes – a handy feature in 2008 R2 that allows you to store multiple virtual machines on a single LUN (trying to provision per-VM LUNs is a headache I don’t even want to think about). In our case, the use of CSVs allowed us to take our two physical disk arrays on the SAN (8-drive RAID10 and 5-drive RAID5) and format each as a single LUN. I won’t go into too much detail here on CSV as there are million blogs out there with great information. CSV should also have a dedicated NIC available to it, as it can be used to relay storage commands in case a node loses its connection to the shared storage (there must be a performance penalty for this, but I suppose it is good enough to limp along until you can live migrate the workloads to another fully functional node – which is why LM and CSV should have separate nics to work with).

The nodes themselves need a NIC for basic network connectivity and administration, so now we’re up to 3 NIC ports per node and we haven’t even talked about the virtual machines yet! As previously mentioned, the cluster is pointless if a failed NIC or switch port can bring down a VM, so we decided to implement Intel nic teaming using the VMLB mode. We carved up our two intel quad port nics into three teams (one for each of our aforementioned networks): TeamV for Vault (4 ports), TeamT for Trust (2ports), and TeamD for DMZ (2 ports). This setup provides at least two ports per network to guard against both nic or switch failure, and avoids the need for configuring VLANs (which is a whole other mess for another blog) – we do have VLANs configured on our cisco switches, but only for trunking purposes – I like to avoid complicating the Hyper-V configuration as it is far easier to train administration staff without having to have them learn the ins and out of networking at that level. If you prefer you could certainly cable all ports into a giant team and use VLANs to separate out the traffic. At any rate, we’re now up to 11 NIC ports used (LM, CSV, host administration, and 8 teamed VM ports).

Not all VMs need be highly available

While most of our VMs are fairly critical and will made highly available across the cluster, certain machines are not – either because they just aren’t as important, or because the services they provide offer high availability in other ways. For example, each of our nodes has a domain controller Hyper-V guest configured to run from local storage (not part of the cluster). As long as domain clients are configured with both DCs in their DNS settings, they can deal with one or the other going down for a while. Similarly, each of our nodes has an Exchange 2010 mailbox server (running in a DAG) and an Exchange 2010 CAS/HT server. This way, the failure of an entire physical node does not derail DC or Exchange services. It also allows us to play around with fancy storage options for the Exchange servers, but I will save that for another post…

I mention Exchange because we use Windows Network Load Balancing to provide high availability across our two CAS servers – and NLB creates port flooding on switches (by design). By dedicating one of our NIC ports to each node’s CAS server, we are able to limit the port flooding on the cisco switches to only the ports those dedicated NICs plug into. There are likely more sophisticated solutions to this problem, but this is what we came up with (had to use those spare broadcom ports for SOMETHING!).

Our count is now up to 12 NICs, but I happened to have a couple of spare Intel PCIe NICs laying around, so I figured it wouldn’t hurt to get an extra in there. This would allow me to set up a broadcom team to provide switch redundancy on the nodes’ admin IPs, which could always be reconfigured and set up to take over in the event of one of the primary cluster links going down – so 13 it is! Below is a picture of the color coded NIC cabling (the two short orange cables are crossovers for the LM and CSV links) – I have not attacked it with zip ties yet! You can also see each node’s two SAS adapters/cables in the photo.

Putting it all together

Now that the hardware part of things was all set, it was time to adjust the software side. The procedure basically broke down like this:

Join each node to the domain (they were previously not domain members to avoid dependence on the virtual domain controllers), configuring DNS to point first to the other node’s virtual DC and second to an offsite domain controller with a reliable WAN link.
Install latest NIC drivers and configure the teams. Configure private networks as desired (for example I have my crossover networks set to 10.10.10.x and 10.10.11.x).
Remove all the VHDs from VMs to be made highly available, then export their configurations and copy the VHDs to cluster storage (this step can be done many ways but I prefer manually copying the VHDs instead of having them tied to the Hyper-V export process; saves time if you end up having to re-do the import for some reason).
Delete existing virtual networks and create new ones using the nic team interfaces, ensuring that all virtual networks to be used for highly available VMs are named identically across the nodes.
Install the Failover Clustering feature and use its administration tool to run the Cluster Validation wizard – hopefully this will pass!
Go ahead and create the cluster, and the wizard should be smart enough to set up the proper cluster networks (three of them in my case, the two privately addressed interfaces and the admin interface) – the virtual networks have no protocols bound to them other than the Hyper-V protocol so they are ignored by the cluster wizard.
Add any storage disks that were not automatically added by the wizard (mine added all three of mine).
Using the Failover Cluster management tool, enable CSVs and turn the appropriate storage disks into CSVs.
Using the Hyper-V management tool, import the VM configurations you exported onto the CSVs as desired, and copy/move the VHDs into place and re-add them into the configurations, as well as specifying the appropriate virtual network(s).
In Failover Clustering management, configure a service and select Virtual Machine, and select all the VMs you want to make highly available – they should pass validation!
Fire up the VMs, and then attempt to Live Migrate one or two over to the other node. This worked like a charm for me… The only quirk I ran into was a Live Migration failure after editing a VM config from the Hyper-V management tool – it would appear that any settings changes need to be made from within Failover Clustering and not the Hyper-V tool to avoid this problem!

The entire procedure took roughly 13 hours, however a decent chunk of that time was spent simply waiting for large VHDs to copy, and/or dealing with unexpected surprises such as the fact that the MD3000 had the wrong rail kit with it (despite having the proper part # on the box) – nothing a pair of pliers wasn’t able to fix… The only single point of failure left is the ethernet drop the colo provides – and in the near future we may add a second drop to remove that last potential failure point:

Thanks for reading!

12 thoughts on “Building a production-ready Hyper-V Cluster on the cheap”

Pingback: How to build a production-ready Hyper-V Cluster on the cheap (via No Alarms and No Surprises) « oleg40a
Egon on January 17, 2011 at 3:48 pm said:

Hi,

do you think it may be possible to build a 4 Node Cluster with MD32xx SAS Array?
Dual SAS HBA conected to Dual 4 Port Controller?
Thanx!

Reply ↓
- g0b3ars on January 17, 2011 at 5:26 pm said:
  
  yes it is possible, however you don’t get redundancy in that case…
  
  Update: rereading this and yes, I believe it is possible with the md32xx since it has 4 ports per controller – so you could have 4 redundantly connected hosts…
  
  Reply ↓
  - Egon Gamper on July 25, 2012 at 1:40 pm said:
    
    In Production now over one year. 4 Node Cluster (2 SAS HBA each node) on a Dual Controller MD3200 SAS (with MD1200 Expansion 10 TB Total). Tested Multipath failover, Controller Fail/Online Update, … absolutely no problems. Stable, fast and very chap!
TeamingConfused on August 3, 2011 at 8:10 am said:

Does VMLB work on accros multiple switches without loss when failing over? The docs state it protects against switch port failure, it doesn’t mention switch redudancy like SFT which mentions it explicitly and doesn’t require stacking.

Reply ↓
- g0b3ars on April 7, 2012 at 3:55 pm said:
  
  My understanding is that yes, a switch failure would cause the downed VMs to find a new home on an active adapter.
  
  Reply ↓
Stephen on July 24, 2012 at 9:12 am said:

How did you make your raid 5 lun (presumably user data storage??) available to both nodes?

Reply ↓
- g0b3ars on July 24, 2012 at 3:20 pm said:
  
  Hi Stephen, I’m not quite following your question – shared storage arrays such as the md3xxx are designed to make storage simultaneously available to multiple nodes…
  
  Reply ↓
g0b3ars on August 12, 2012 at 4:52 pm said:

UPDATE: Our clusters are now running on Windows Server 2012, which makes life a lot easier… Native nic teaming is available directly in the OS, and Microsoft has finally fixed the ridiculous import/export fiasco so you can easily register any VM just by pointing to its configuration folder. We also upgraded to a Dell MD3200 chassis which, while it can only hold 12 drives, can service 4 fully redundant hosts and is also significantly faster (6gbit SAS instead of 3gbit, and improved controllers).

Reply ↓
- AnthonyB on February 25, 2013 at 10:24 pm said:
  
  Hi g0b3ars, curious to know if you ever got Windows Server 2012 to work correctly with your Dell MD3000 as part of the process ?
  
  Reply ↓
  - g0b3ars on February 26, 2013 at 3:27 am said:
    
    Yes, works just fine with the MD3000. Load the software for the MD3200 and it happily administers both arrays under Win 2012.
  - AnthonyB on February 26, 2013 at 3:28 am said:
    
    Thanks for that, I will be trying that this week sometime. Here goes.