atom beingexchanged: Exchange gets tired.

Thursday, June 4, 2009

Exchange gets tired.

I’m not talking about getting old here, but rather about exhaustion, which can happen in even newer versions of Exchange Server. Checkpoint exhaustion is a function of the ESE database system set used in Exchange, and designed to stop the system from corrupting itself if too many database transactions get held up in a log file by anything.

The normal sequence of events is well known for most relational databases.  Information added to a database is sent to a so-called Lazy Page Writer to minimize disk access by aggregating I/O operations.  When this process happens, a write-ahead logging system tracks the operations, so that if they need to be replayed, they can be safely re-generated if the Lazy Writer data is lost.  As the transactions are committed to the database itself, the log systems note this by checkpointing those transactions, essentially noting that they’ve been safely committed to the database.  For most versions of Exchange Server this happens every 20MB or so (that’s 20MB of log data, not database writes).

When things do not go according to schedule, the logging systems can recover data that was in the Lazy Writer but not yet committed to the database with emergency systems.  Some, like the normal startup process for an Exchange database, are done automatically.  Others, like ESEUtil rescue operations are done manually, and hopefully avoided entirely.  But, the point at which the logs logjam up so much that the system forces emergency procedures can be more difficult to explain.

The short answer is that Exchange can log up to 1,008 transactions before Checkpoint Exhaustion hits.  This is the point where Exchange can no longer just keep writing to the logs and still know that those transactions can be safely recovered even if the Lazy Writer doesn’t have the transactional data anymore.  If the system sees that the logs are tracking more than 1,008 transactions that have not yet been checkpointed, it will forcibly dismount that database – in this case Storage Group and related Stores.

Now, when does Checkpoint Exhaustion happen?  That’s a bit trickier to explain, but there are a few definite times you might see it.

1 – The system is horrifically under-powered in terms of how fast it can get data out of the Lazy Writer and onto disk.  This is rare, but can happen if there is a failing RAID controller or other disk-device.

2 – Something has placed a lock on log files or databases.  In this case, the offending application is stopping Exchange from writing to either a log or database, or at the very least stopping the checkpointing process from happening regularly. Since this means that Exchange cannot successfully checkpoint the database, logs continue to build and build without being “caught up.” This is also not common, but can occur with management tools that go awry.

3 – Backup systems don’t work as expected.  With most Exchange-aware backup tools, the backup system prevents Exchange from checkpointing during the backup operation.  Normally, this is a good thing.  Checkpoint prevention means that no new data will be officially committed while the backup is in the process of…well…backing up.  This means that a point-in-time tape or disk backup can be transactionally intact, even if it cannot maintain strict write integrity native to the backup solution.  Since the Lazy Writer and the log files both have copies of the information, as long as the block is lifted at the end of the backup, all is safe and all is well.

However, if the backup barfs and never lifts the lock, you could have a major issue.  Since no new checkpoints can be set, unchecked log data starts piling up, and eventually you’ll hit the 1,008 transaction limit and see a forced-dismount happen on that Storage Group.  Luckily, restarting the Exchange services nearly always clears this problem, and since the logs still have a copy of all non-checked data, you’re not going to lose anything but time.

Checkpoint Exhaustion was built to solve a particularly thorny problem.  The earlier versions of Exchange Server that are still supported had a hard-coded limit to 1,024 transactions before the database would fault (not just go offline).  That meant that if the Checkpoint Depth hit that level, you could lose data.  So the exhaustion systems allow the database to safely shut down without losing track of any data – which is a not-so-great but definitely better than data loss type of solution set.

So, to avoid exhaustion, there are several things you can do.

1 – Make sure you’re not overloading your Exchange Servers.  Stay within best-practice guidelines for Microsoft Exchange versions, which can be found on TechNET. and be sure you periodically check for hardware issues.

2 – Try to minimize the use of applications that could lock the Exchange Server file sets.  3rd-party non-Exchange indexing for example

3 – Make sure your backup runs are completing properly. One bad run that never got caught can snowball into an exhaustion situation easily.  If you do have a bad backup run, declare a maintenance window of 20-30 minutes and restart the Exchange services, or even gracefully bounce the box.

 

Don’t forget, if you have an Exchange related topic you’d like to ask about, drop an email to miketalonnyc@gmail.com .  Next update should be on Monday or Tuesday of next week, back to my usual schedule. 

A special hello to Larry Meister at Mercury Networks in Rochester, NY, thanks for inviting me to speak at the White Hat Security Day conference.  It was a great event, as usual!  If you’re looking for a partner for your IT needs in Western NY State, Larry is your guy.

Labels: , , , ,

Bookmark and Share
posted by Mike Talon at

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home