atom beingexchanged: Get back to where you once belonged (Failover Cluster version)

Tuesday, September 15, 2009

Get back to where you once belonged (Failover Cluster version)

In honor of the re-release of the Beatles stuff all over the world (games, CD’s, maybe iTunes at some point), I took the title of today’s post from their song “Get Back” on the album Let It Be (Remastered).

I am, of course, going to tie this to something in Exchange; specifically Exchange 2007 Standby Clustering. Standby clustering refers to the theory of using a replication engine (like the native CCR or a 3rd-party system like Double-Take Availability – see disclaimer below) to place a copy of the data for the Storage Groups of the production cluster onto a secondary cluster.  Once the data is replicated, you can use the /RecoverCMS commands to recreate the production Exchange Cluster Mailbox Servers (CMS’s) on that secondary cluster.

The solution set for bringing up the Storage Groups and CMS’s on another physical cluster setup in the same or another location is fairly well established.  If a single node fails on a production cluster, other nodes take over the failed Storage Groups and work resumes in a very automated fashion.  If multiple nodes, or the entire cluster, fail you use /RecoverCMS and the associated protocols to manually get everything working on another system – so long as a copy of the data exists to work from.

The problem has traditionally been best expressed by the phrase, “And then what?”

If the original cluster failed completely, the answer was simple.  Rebuild the systems with the same node names, but prepare the systems as though they would be a new /RecoverCMS target system.  However, if you have not lost the production systems, and they’re stable enough to be used again, you would still have to reinstall them without some additional help.  The most common reasons for this kind of outage are routine testing of the failover systems and extended power failures that generators and UPS systems can’t handle.

Microsoft does offer a command set to fix this particular problem, but it is not well known or publicized.  As a matter of fact, during a recent client troubleshooting session, we had a couple or techs from Microsoft on the phone (Premier Support in this case) and they were not aware of this particular method for cluster restoration.

Once you have fixed whatever went wrong, if your production cluster is still viable (and is suitably stable for continued use), you can use a command set called /ClearLocalCMS to remove the original CMS entries from the original production cluster.  Doing so is not without risks, and you should familiarize yourself with this KB article on the subject before you try it. 

/ClearLocalCMS will remove the CMS components off the original production nodes, clean up AD, and disable the virtual computer object for the original cluster CMS.  This ensures that Exchange doesn’t accidentally address the original cluster system, even after the restore process begins.  Once the CMS is cleaned, you can go about restoration of the data using the same tools as you used to get it over to the standby cluster in the first place.

To get back to your original servers, use the /RecoverCMS command in the opposite direction (from DR back to production) and then use /ClearLocalCMS commands to re-prepare your DR cluster for use in the next emergency.

Jumping between clusters is not an automated or easy process, but it does work correctly if you follow all the steps in both directions.  This set of command suites (/RecoverCMS and /ClearLocalCMS) can allow you to get back to where you once belonged, every time.

Labels: , ,

Bookmark and Share
posted by Mike Talon at

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home