atom beingexchanged: May 2009

Wednesday, May 27, 2009

When your cluster goes “oops,” Using RecoverCMS

First, a quick note:  I’m posting this one from Windows Live Writer on Windows 7 RC1, which I’m happy to say is remarkably stable and much faster overall than Vista.  I’d recommend it wholeheartedly!

Funny story, I once had a client who swore that clustering was enough protection for their messaging environment, until an outage took out their entire cluster at once – causing them to be down for about week.  Now, that’s not the funny part, but what caused the outage is somewhat hilarious, more on that later.

Exchange 2003 and earlier had a pretty straight-forward method for recovering an entire MSCS cluster if one had failed on you.  You built one or more nodes of a brand new cluster, created an Exchange Virtual Server (EVS) Resource Group with the same parameters (names, IP’s etc) as the production system had, and Exchange would do the rest.

With Exchange 2007, the rules changed significantly, leaving many cluster users confused as to how the system now works if they suffer a cataclysmic failure of the production cluster.  Adding both Single Copy Cluster (SCC) and Continuous Cluster Replication (CCR) to the mix just makes things more confusing, so Microsoft created a new recovery method for Exchange 2007 clusters.  Called RecoverCMS, the system is really a setup task rather than a true failover system, but since your failover system just went belly-up, that’s not a bad thing.

If your Recovery Time Objectives are flexible enough to handle some downtime if an entire cluster fails, then you can leverage this system to get back up and running, either at the original production site, or at a new location.  There are some definite limits to what you can do with it which I’ll explain later, but he basics of how it works are pretty simple.

Step one is rebuild, repair or replace the original cluster hardware. If the repair works then you’re done, just restore any missing data from tape or other backup (due disclaimer, see below, I am biased on backup tools) and then resume normal operations. If you rebuild or replace completely, bring up a new server that is configured with Exchange 2007 in the Passive Cluster Node configuration.  You can find out how to do that:

Here for CCR clustering or,

Here for SCC Clustering

During that process you will also have installed the Exchange 2007 binaries on at least one node of the cluster system, so go to the directory that has the Exchange setup files and execute the following command:

Setup.com /recoverCMS /CMSName:<name> /CMSIPaddress:<ip>

Where <name> is the name of the EVS you’re restoring from, and <IP> is the IP address you want the recovered system to have – in theory the same IP as the original EVS had.

The rest of the procedure is pretty automated, and when finished, you will have a new EVS running on your new cluster node(s) that matches the original EVS and has all the users already assigned to it.  From there, you can restore your data if it was also lost to the disaster.

There are a few things that are extremely important to be aware of before you begin:

1 – Keep in mind that /recoverCMS is designed to restore a failed cluster only.  Attempting to use it for migration or for any other purpose will result in unpredictable behavior and is not supported by MSFT.

2 – You will need to manually create the volumes that existed on the failed cluster before you run /recoverCMS.  If volumes are missing then the recovery will fail.  They don’t have to be the same physical disk or size, just large enough to hold the data and with the same drive letters as the original cluster held.

3 – The System Attendant service will start and then immediately stop after you recover, this is normal, just bring the resource back online when you’re ready.

4 – Your databases are not mounted after a recovery, you must do this manually through PowerShell or the Exchange Management Console after you’re done with the restore.

5 – Do NOT try to use this across OS’s. If you started on Windows Server 2003, you must recover to Windows Server 2003, and 2008 to 2008.   It will not work if you try to go from one to the other.

6 – While you can pre-configure many portions of this system, it will still take some time to run through a /recoverCMS procedure from start to finish, so if you need a second-stage failover, /recoverCMS isn’t the best bet.  I’m quite biased on this (see disclaimer below), but unless you can be down for a few hours if both cluster nodes fail, you might want to go with another tool to provide remote site failover in addition to SCC or CCR clustering.

7 – Finally, SCR and CCR will not automatically work with /recoverCMS.  You will need to stop SCR if it’s running before you recover, and neither will resume automatically after the recovery is done.  Once you’re set up in the new node configuration, re-enable CCR and SCR manually as required.

/RecoverCMS is a great way to restore a failed cluster system to new hardware or rebuilt hardware after a fault.  You still need to back up your data to some device outside the cluster itself, but once you have that backup /recoverCMS can get your cluster back up and running much faster than the manual methodologies used in previous versions of Exchange.

As to the funny story I mentioned at the top of the blog, this particular client was in a hardened datacenter with UPS systems, 24/7 staff and a backup generator.  They were convinced that clustering was going to be more than enough for them.  After trying to explain that a shared-disk cluster (the only option at the time) had weak points, I finally gave up and let them be.  A few months later I got a great phone call.  Apparently – unbeknownst to the client – the datacenter crew had run all power connections through the UPS – including the generator.  The UPS was rated to handle the full power load of the datacenter on 1 of its 2 redundant circuit loops.  So far so good.  Well, this particular datacenter was in the middle of the dot-com boom (this was some time ago) and had grown exponentially in a short period of time.  What they had was well over half the full expected load on each of the two circuits, and one was failing.  So they diligently got replacement parts and moved the load over to the good circuit.  Since was over half the expected load, and circuit 2 was already under over half the load, they immediately overloaded the UPS, shorting it out.  The way it was explained to me, a solenoid shot through the casing of the UPS…and there was indeed a nice hole in the unit to back that up. No one was hurt, but needless to say, the whole datacenter was offline until they replaced the UPS, 4 days later, so they lost about one business week, without anything happening to the physical cluster at all.  Just goes to show you that anything that can go wrong, will.

Labels: , , , , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Wednesday, May 20, 2009

Outlook, can you hear me? Can you feel me near you?

Might be showing my age and/or taste in music with that particular title (and if you’re totally confused by it, check out This YouTube video), but I think that it’s a great way to describe an annoyance that can happen if you’re using versions of Outlook before Outlook 2007.  Since a large portion of the users of Office are on the 2003 version (and many even earlier than that), resolution to a new server in the event of a disaster recovery event is a subject that is just as confusing as the famous rock opera I’m making use of in my title today.

When Outlook 2007 was introduced to the world with Exchange 2007, a lot was made (and rightfully so) of the new AutoDiscover features that this platform brought into the Enterprise Email marketplace.  The long and short of the AutoDiscover solution set is this:

When an Outlook 2007 client cannot find its home server – either because it is a brand new install of Outlook or because the home server has moved or been replaced – the Exchange 2007 AutoDiscover system can help Outlook 2007 find its home.  If the Outlook client can see an Exchange Server (or be directed to one by Active Directory), the Server can tell Outlook where the mailbox information for the user’s profile exists, and direct Outlook to connect to the appropriate CAS or Mailbox systems and get connected.  All the user/Admin has to do is tell Outlook the user’s email address and password, and AD with Exchange 2007 will handle it from there.  So if you’re installing Outlook for the first time, you don’t have to manually configure the Profile anymore – a great boon to Admins everywhere.

This system also kicks in if you perform Database Portability during a disaster, and have replicated the database with SCR; or have used a 3rd party disaster recovery/availability solution (see disclaimer below for all my bias information on that one =).  Once the Exchange system is responding again, AD can ferry the Outlook 2007 client to the new home for that mailbox, requiring only that the end-user close and reopen Outlook to complete the process.

However, what many folks do not realize right off the bat is that this solution set is ONLY available if you have both Outlook 2007 and Exchange 2007 as your messaging platform.  All users who need to take advantage of AutoDiscover must be using that combination of tools, and no other.  As you might expect, POP3 and IMAP systems do not AutoDiscover, but the majority of my clients were unaware that Outlook 2003 and earlier also cannot take advantage of this system, even if you have upgraded to Exchange 2007 as the messaging platform of choice.  It’s also worth noting that AutoDiscover doesn’t officially work in Exchange 2003 – no matter what Outlook version you are on.  Before I get blasted by mail on this one, I know some folks have sometimes seen it to work on Outlook 2003 with Exchange 2007, but it bombs more than it works, and officially it’s not supported.  For proof, I direct you to this article by the MS Exchange Team.

Since the code to perform AutoDiscover wasn’t in Outlook 2003, users on that client software will not be able to dynamically re-link to the new Exchange server unless the original mailbox server is still responding.  If it is, then Outlook can find the new server via the original server and re-home itself.  If not, Outlook must be manually re-directed to the new server.

Of course, there are ways around this.  You could update DNS to re-direct anyone calling for “Server 1” to the IP Address of “Server 2” – effectively re-routing all client software including POP and IMAP.  Outlook 2003 will still need to be re-profiled unless you take over the Service Principle Name (SPN) of “Server 1” on “Server 2,” but it will be a smoother transition.  Using a 3rd party tool (see disclaimer below) you may have the option of automated DNS and SPN updates, which will allow even legacy Outlook clients to jump to the new server with no more intervention than is required on Outlook 2007 with Exchange 2007 – even if you’re whole system in on the 2003 versions of those software platforms or earlier.

So, you are not without lots of options if you have any legacy servers and/or clients – or non MAPI clients – in your environment. You just need to be aware that the Exchange 2007/Outlook 2007 solutions for AutoDiscover services are not backward compatible, and plan accordingly.  Right now it looks like Exchange 2010 will have AutoDiscover that is backward compatible to Outlook 2007 only, so this soon-to-be-released platform is not going to solve this particular problem unless you’re planning on upgrading everything else in the environment to at least Outlook 2007 first.

Since I like to let folks continue to discover on their own, here’s a link to the White Paper from MSFT on AutoDiscover.

Finally, Update Roll-Up 8 for Exchange 2007 is out there, which can make life easier if you’re doing a fresh install of Exchange 2007 and want to get up to date with patches and fixes post SP1 quickly.  You can get it at this link.

Labels: , , , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments

Thursday, May 14, 2009

Best of TechEd!

Double-Take Move (the new migration offering from Double-Take Software) won the coveted Best of TechEd award for System Management and Operations last night!  Yes, I am biased on this one (see disclaimer below), but I am really jazzed about snagging the award this year.  There were fewer attendees, and therefore fewer votes that all the nominees had to vie for.  It was a good campaign, we kept it clean, and the best product took home the prize!

Thanks to everyone who voted for us, and we’ll be ready to wow you with another spectacular product next year!

Bookmark and Share
posted by Mike Talon at 0 Comments

Wednesday, May 13, 2009

Live from TechEd – first look Exchange 2010

Got a chance to talk with one of the leads over at the Exchange HA booth this year.  From what I was able to gather, 2010 will indeed have the ability to fail over a single storage group to another Exchange Server, locally or in another site; or both.  This is a welcome change from the more simplistic SCR we’re used to today for site failover and recovery. 

Based on a CCR solution set, Database Availability Groups will replicate using a modified variant of the SCR log shipping solution, but the CCR failover systems.  This will let recovery happen quickly to another server or another site for all Outlook 2007, AirSync, OWA and other Exchange based clients.

The lead did confirm that legacy Outlook clients will not move on their own if the primary Exchange server fails, so you’ll have to make DNS updates manually, or use a 3rd party failover solution set in conjunction with DAG solutions (see disclaimer below).

Also, the latest beta is feature complete, so what you see in the beta is what you will get in the RTM, plus or minus a few things for bug fixes, etc.

More reports as I get info.

Bookmark and Share
posted by Mike Talon at 0 Comments

ClusterFunk rocked again!

And so once again this year I will put it to the management of most other companies in the world…

Tonight, the CEO of Double Take Software (along with our Director of Professional Services, Western Region Channel Manager, one of our Inside Sales Reps and a few guests) rocked out to a crowd of several hundred ClusterFunk fans.  What did your CEO do this evening?

For the third year in a row, Double-Take Software has rocked TechEd with our house band, ClusterFunk, and of course we invited a large portion of the folks visiting the conference.  We had folks from just about every vendor and a TON of our current and future clients; and we at Double-Take were thrilled to see each and every one having a great time!

Of course, the conference itself is still going on, day two was packed with folks going to seminars on various Microsoft technologies, and visiting the vendors in the Expo Hall too.  Your intrepid blogger got a chance to check out some of the new Windows Mobile 6.5 devices today.  So far, the interface has been updated to allow for gesture control (flick and you scroll up or down, for example).  Battery life remains unchanged though, which has always been a limiting factor for nearly every mobile device, but slightly more of an issue with WinMobile when it comes to Exchange ActiveSync technologies.  The HTC Diamond Touch 2 seems to be leading the pack in terms of features and functions, however it lacks a tactile keyboard so it’s not for everyone.

I also got the chance to check out Virtual PC, which is finally being re-vamped and bundled with Windows 7.  More details to follow, but so far it looks like a scaled-down version of the Hyper-V technology shipping with Server 2008, which is a welcome departure from the architecture of Virtual PC in the past.  Better command and control systems and much more flexibility are both on the table for the new VPC systems.  You can be sure I’ll be asking more questions on it later in the week, and I’ll bravely try out the beta (on Windows 7’s release candidate) as soon as I can for you.

Yes, I will be taking a look at the Exchange 2010 area tomorrow. They’re located in the Office Technologies pavilion, which I did not have a chance to visit today.

More updates as the week goes on. Follow TalonNYC on Twitter or keep track of this blog for daily updates.

Bookmark and Share
posted by Mike Talon at 0 Comments

Tuesday, May 12, 2009

Day one expo floor recap

Hi folks!  Day two will be starting for the DBTK Crew shortly, but day one at TechEd 2009 was great.  Pictures will be following near the end of the week.

The show floor as a little light on folks in the morning, but the evening session was quite packed.  Lots of both meeting new people and catching up with folks we saw last year.  Also got a visit from @Krewe and many of the Krewe members as well, great to finally see people we’ve only been tweeting to for the last few months.

Tonight, the ClusterFunk party kicks off at 8pm at the Conga Room, so if you’re at TechEd and haven’t picked up your wristband, we’re at booth 816.  Look for the Double-Take Bronze Sponsor banner, we’re – amazingly – right underneath it =)

More tomorrow, and keep an eye on @TalonNYC to follow events as they happen.

Bookmark and Share
posted by Mike Talon at 0 Comments

Sunday, May 10, 2009

Made it to LA!

TechEd 2009 begins tomorrow, remember to follow TalonNYC on Twitter for updates.  I’ll try to post stuff here as well!  Should be a lighter turnout this year, but still a good show.

Bookmark and Share
posted by Mike Talon at 0 Comments

Friday, May 8, 2009

Gearing up for TechEd

Howdy readers!  All next week I will be blogging and Twittering from the Double-Take Software booth.  So keep tuned in here for updates all through the week, and/or follow me on Twitter as TalonNYC.

I’ll try to keep everyone posted as much as I can during the event, but traditionally it has been a wild ride and a hectic schedule, so bear with me.

For those going, see you there!

Bookmark and Share
posted by Mike Talon at 0 Comments

Monday, May 4, 2009

The Dread Pirate Re-Seed – Part 2

Last week, we talked about re-seeding operations between two nodes of an Active/Passive CCR Cluster.  Since both nodes will most likely be inside the same logical network (though are not required to be in Server 2008), even unexpected re-seed operations on a CCR Cluster should be relatively painless.  Though it’s a full copy to target, it is happening over a LAN, and therefore faster and less resource-intensive than a WAN copy would be.

This week, let’s look at how the game changes when you work with Server (or Standby) Continuous Replication (SCR) when you have re-seed operations.  Just to be clear, re-seeds are not the normal method for data protection with SCR. Normally, when a log file is closed and a new prime log (usually E00 for the first Storage Group) is created, the closed log is shipped over to a target Exchange 2007 if SCR is enabled.  Once on the target, the log is held until the target reaches the number of logs specified in the log replay caching system – 50 by default.  After that, Exchange commits the logs to the database on the target, effectively providing SQL-like Log Shipping (or Continuous Data Mirroring, Asynchronous) between two Exchange 2007 Servers.

Whenever the two servers are not in communication with each other, there is the potential for operations to occur on the source that are not seen by the target, and vice versa.  Since this could easily result in a corrupted database, Exchange will check to ensure it knows what state the data on both servers is in before resuming normal SCR operations.  If the state is known, then SCR continues – transmitting all logs not yet on the target, committing all but the last 50 logs, and continuing on its way as normal.

If the state is unknown, a re-seed operation will need to occur.  More specifically if there is a gap in log file enumeration for some reason, you will require a re-seed.  You can see the reasons why re-seeds are required at the TechNet website listed here.

The one that is of most concern to Exchange Admins is that if the two servers are not in communication during a backup window, you will have to re-seed the database before normal SCR operations can continue.  When speaking of a local SCR pair this is not a big issue, as connectivity will be much more solid than WAN performance, and even if a re-seed is required, it will be relatively fast.  But across a WAN, it is more likely that you can occasionally suffer WAN outages that do not interrupt business operations.  If these outages extend past the time of the backup window, the backup tool will most likely truncate logs committed to tape without SCR being able to transmit those changes to the target Exchange Server.

Since minor network outages are a common (though hopefully not very common) issue with modern networks, the likelihood of requiring a re-seed due to this sequence of events is something that could be considered a normal part of Exchange SCR operations.  If your databases are small, then it won’t be an issue.  For larger databases, remember that a re-seed operation is a full copy of all data files to the target device from source, which could be problematic depending on your WAN throughput.

So with SCR, re-seeds do become a definite issue to contend with.  Since the occurrence of re-seed operations should be limited, you may be able to keep the systems successfully in sync without too much trouble. However, if you have larger databases or smaller WAN pipes, re-seeds can create problems for your network, especially if you use a backup tool that truncates your logs, or use circular logging for any reason.

Of course, there are alternatives to SCR for WAN protection and availability (see disclaimer below, I’m biased here), so you are not out of luck in terms of WAN operations with Exchange 2007.  Locally, CCR is a spectacular choice for nearly any sized Exchange system, with the exception of very large (over 1TB) datasets that may just take to long to re-seed even over a LAN.  Remotely, if you have properly planned for re-seeds then you will also be able to successfully utilize SCR, but if you do not – or cannot – plan for these required operations, then you’ll risk a failure in your failsafe system, which is not a great situation to be in.

In short, re-seeds are a normal part of SCR operations. They should not happen every day (or even every week) but you will most likely experience them over the course of your SCR lifetime, so plan for them now.

 

Next week is TechEd 2009!  I’ll be out in Los Angeles with the Double-Take Software crew, so please stop by and say hi. I’ll also be blogging from the event, so keep an eye on this column and follow me on Twitter at http://twitter.com/talonnyc.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at 0 Comments