beingexchanged: When your cluster goes “oops,” Using RecoverCMS

First, a quick note: I’m posting this one from Windows Live Writer on Windows 7 RC1, which I’m happy to say is remarkably stable and much faster overall than Vista. I’d recommend it wholeheartedly!

Funny story, I once had a client who swore that clustering was enough protection for their messaging environment, until an outage took out their entire cluster at once – causing them to be down for about week. Now, that’s not the funny part, but what caused the outage is somewhat hilarious, more on that later.

Exchange 2003 and earlier had a pretty straight-forward method for recovering an entire MSCS cluster if one had failed on you. You built one or more nodes of a brand new cluster, created an Exchange Virtual Server (EVS) Resource Group with the same parameters (names, IP’s etc) as the production system had, and Exchange would do the rest.

With Exchange 2007, the rules changed significantly, leaving many cluster users confused as to how the system now works if they suffer a cataclysmic failure of the production cluster. Adding both Single Copy Cluster (SCC) and Continuous Cluster Replication (CCR) to the mix just makes things more confusing, so Microsoft created a new recovery method for Exchange 2007 clusters. Called RecoverCMS, the system is really a setup task rather than a true failover system, but since your failover system just went belly-up, that’s not a bad thing.

If your Recovery Time Objectives are flexible enough to handle some downtime if an entire cluster fails, then you can leverage this system to get back up and running, either at the original production site, or at a new location. There are some definite limits to what you can do with it which I’ll explain later, but he basics of how it works are pretty simple.

Step one is rebuild, repair or replace the original cluster hardware. If the repair works then you’re done, just restore any missing data from tape or other backup (due disclaimer, see below, I am biased on backup tools) and then resume normal operations. If you rebuild or replace completely, bring up a new server that is configured with Exchange 2007 in the Passive Cluster Node configuration. You can find out how to do that:

Here for CCR clustering or,

Here for SCC Clustering

During that process you will also have installed the Exchange 2007 binaries on at least one node of the cluster system, so go to the directory that has the Exchange setup files and execute the following command:

Setup.com /recoverCMS /CMSName:<name> /CMSIPaddress:<ip>

Where <name> is the name of the EVS you’re restoring from, and <IP> is the IP address you want the recovered system to have – in theory the same IP as the original EVS had.

The rest of the procedure is pretty automated, and when finished, you will have a new EVS running on your new cluster node(s) that matches the original EVS and has all the users already assigned to it. From there, you can restore your data if it was also lost to the disaster.

There are a few things that are extremely important to be aware of before you begin:

1 – Keep in mind that /recoverCMS is designed to restore a failed cluster only. Attempting to use it for migration or for any other purpose will result in unpredictable behavior and is not supported by MSFT.

2 – You will need to manually create the volumes that existed on the failed cluster before you run /recoverCMS. If volumes are missing then the recovery will fail. They don’t have to be the same physical disk or size, just large enough to hold the data and with the same drive letters as the original cluster held.

3 – The System Attendant service will start and then immediately stop after you recover, this is normal, just bring the resource back online when you’re ready.

4 – Your databases are not mounted after a recovery, you must do this manually through PowerShell or the Exchange Management Console after you’re done with the restore.

5 – Do NOT try to use this across OS’s. If you started on Windows Server 2003, you must recover to Windows Server 2003, and 2008 to 2008. It will not work if you try to go from one to the other.

6 – While you can pre-configure many portions of this system, it will still take some time to run through a /recoverCMS procedure from start to finish, so if you need a second-stage failover, /recoverCMS isn’t the best bet. I’m quite biased on this (see disclaimer below), but unless you can be down for a few hours if both cluster nodes fail, you might want to go with another tool to provide remote site failover in addition to SCC or CCR clustering.

7 – Finally, SCR and CCR will not automatically work with /recoverCMS. You will need to stop SCR if it’s running before you recover, and neither will resume automatically after the recovery is done. Once you’re set up in the new node configuration, re-enable CCR and SCR manually as required.

/RecoverCMS is a great way to restore a failed cluster system to new hardware or rebuilt hardware after a fault. You still need to back up your data to some device outside the cluster itself, but once you have that backup /recoverCMS can get your cluster back up and running much faster than the manual methodologies used in previous versions of Exchange.

As to the funny story I mentioned at the top of the blog, this particular client was in a hardened datacenter with UPS systems, 24/7 staff and a backup generator. They were convinced that clustering was going to be more than enough for them. After trying to explain that a shared-disk cluster (the only option at the time) had weak points, I finally gave up and let them be. A few months later I got a great phone call. Apparently – unbeknownst to the client – the datacenter crew had run all power connections through the UPS – including the generator. The UPS was rated to handle the full power load of the datacenter on 1 of its 2 redundant circuit loops. So far so good. Well, this particular datacenter was in the middle of the dot-com boom (this was some time ago) and had grown exponentially in a short period of time. What they had was well over half the full expected load on each of the two circuits, and one was failing. So they diligently got replacement parts and moved the load over to the good circuit. Since was over half the expected load, and circuit 2 was already under over half the load, they immediately overloaded the UPS, shorting it out. The way it was explained to me, a solenoid shot through the casing of the UPS…and there was indeed a nice hole in the unit to back that up. No one was hurt, but needless to say, the whole datacenter was offline until they replaced the UPS, 4 days later, so they lost about one business week, without anything happening to the physical cluster at all. Just goes to show you that anything that can go wrong, will.

Labels: Availability, CCR, Exchange 2007, Failover Cluster, MSCS, SCR, Server 2003, Server 2008, Settings