atom beingexchanged: Can I get a Witness?

Wednesday, November 4, 2009

Can I get a Witness?

Continuous Cluster Replication in Exchange 2007 allows for two nodes of a Distributed Failover Cluster (DFC) for Exchange to be held in different physical locations and different physical network segments.  This is a good thing to leverage if you’re not concerned with local High Availability, but can lead to some interesting issues if something goes wrong.  The two nodes will use their quorum resources to find out which server should be in control of the cluster (and therefore assign resources accordingly) – but that doesn’t help if the nodes cannot see each other due to network failure.

One of two conditions would happen if you ran into this situation as described this far.  You could run into split brain, where both nodes thing they’re in charge and bring up Exchange resources.  This can take hours or even days of manual work to fix, and therefore Microsoft has taken steps to prohibit it.  If either node can’t figure out who’s supposed to be in charge, both go offline to prohibit split brain at all costs.

The second potential situation is the opposite, that neither node can figure out who is in control and both therefore shut down.  While this doesn’t put your data in danger, it does effectively shut off your Exchange system, stopping all messaging flow.  Neither situation is good, but by default, if arbitration is not possible via either quorum or other means, this safer situation occurs.  Luckily, there are “other means,” specifically the File Share Witness (FSW).

FSW is a file share (as its name implies) that both nodes can see under normal circumstances.  It must be placed on a server that isn’t part of the cluster.  Usually, you find it on a file server within the environment, but be aware that it will need to be at least Windows Server 2003 SP1 or better.  The FSW should also be placed either locally to the preferred node (the one you want to “win” in the event  of arbitration) or in an independent location that can be seen by both networks where CCR nodes reside.

In a CCR cluster, there are only two nodes, so right off the bat, if an arbitration event occurs, neither node could gain a majority and take over if there was some communication failure or other emergency.  The FSW acts as a third resource that can be polled to find out who is in control. Both nodes will attempt to take ownership of the FSW, but due to the physical placement of the Witness Server, only one will successfully do so.  That node stays online as owner of the cluster, the other node prohibits resources from going live until the emergency has been resolved.  As you can see, placement of the FSW becomes a critical component to the overall success of this arbitration system.

If you have only two physical locations, your best bet is to place the FSW on a server in the secondary site.  This allows the cluster to properly arbitrate to the remote site if the production site goes offline.  If you have more than two locations, then you can place the FSW on a server at a third location, just make sure connectivity to that site is stable and constant to and from both CCR servers.  If that link is unstable to one or more sites, you can create accidental arbitration events when they’re not really needed.  The benefit to putting the FSW at a 3rd site is that you can survive a link outage at either CCR node location without having to manually force one node or the other to take control (called a Force Quorum Operation). 

Here’s an example of what I mean.  If you have only two sites, and place the FSW at Site 2, a network link failure at Site 1 would force arbitration to Site 2 since the CCR node at Site 1 would not be able to communicate with either the node at Site 2 or the FSW hosted there.  In this scenario, there may be no value to failing over to Site 2, but you would automatically fail over anyway.  If, however, the FSW is hosted at a 3rd site, and both sites can see it, then a network fault between Site 1 and Site 2 would not flip everything to Site 2. Since Site 1 is the preferred owner, and can maintain control of the FSW, it will stay in control of the cluster.

You can find out a lot more about configuring FSW for Exchange 2007 via this TechNET article. The use of FSW technology is mandatory for CCR, and will continue to be a good idea for Exchange 2010 and Database Availability Groups as well.  Learning how this technology works today will allow you to create redundant solutions that last through your future Exchange solution sets.

Labels: , ,

Bookmark and Share
posted by Mike Talon at

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home