Thursday, December 15, 2011

BLOG POST: Centralising a Backup Solution for a Distributed SQL Server 2005 Infrastructure

Recently, following a number of alerts on one of my client's SQL boxes, I discovered that their backup methodology was slipshod at best.  Eight servers in total had a mixture of SIMPLE recovery and FULL recovery models, weekly, ad-hoc and transaction log backups, no backup management in place other than poorly configured SSIS packages in SSMS Maintenance plans and no method to perform point-in-time restores (or even guarantee the validity of the 'last known good' backup set).
This raised an interesting question.

What is the 'best' way to configure backups?

Arguably it's entirely down to the requirements of the customer - which boils down to two questions.  How much data can you afford to lose?  And how long can you wait for a data restore?  These two questions will clearly depend on the industry in question - a car assembly plant, for example, will require a point-in-time restore capability carried out in the shortest possible period of time.  On the other hand, 'Nan's Cake Shop' with a SQL Server Express backend to an order management application will probably be able to cope for a week or two unhindered.

My client's requirements were ambiguous - while they were able to continue without their IT for a short period in the event of data loss, the ability to restore was important.  And although they are a non-profit organisation, there would be financial and commercial implications to the loss of any database functionality.

The physical structure was one increasingly familiar to many DBAs today - VMWare using ESX.  Six of the seven affected servers were VM, with just one physical box.  All seven shared a SAN and a Cisco switch.  There were four physical ESX servers with the entry point (using VSphere) on a different server, currently the recipient of backup files from other machines and a data repository.  Microsoft's official stance is one of non-support for ESX-based VMWare, but they are coming around to the idea of virtualisation with technology such as SQL Azure, and extended support for the 'cloud' with Denali.

For my client, I chose a centralised design based on the classical 'star' formation familiar to anyone who's ever taken a networking course.  At the centre, I chose the server with the largest amount of assigned HDD space (VM, remember!) with the 'satellites' being the SQL Servers, which all share a domain.  For each satellite, I configured a Maintenance Plan comprising of six sub-plans - one, for a full local backup.  Two, for a local transaction log backup.  Three, for a maintenance cleanup for local full backup files.  Four, for the maintenance cleanup of local transaction log backup files.  Five, for the remote cleanup of the hub server full backup files.  And six, for the remote cleanup of the hub server transaction log backup files.  Any databases in SIMPLE mode went into the FULL recovery model.

Next, I scheduled each sub-plan according to purpose.  The transaction log files were scheduled for local backup every 15 minutes at 00, 15, 30, 45.  The full backups were scheduled daily at 02:00.  The cleanups were scheduled daily with a 4-day retention (at 01:00 to prevent disk use spike) for full backups, hourly with a 2-day retention for transaction log backups, daily at 03:00 with a 10-day retention for remote full backups and hourly with a 5-minute offset for remote transaction log backup cleanup (05, 20, 35, 50). 

Next, I wrote a couple of simple batch files to XCOPY /E /Y from the source satellite directories to the target hub directory, and mapped the appropriate network drives, scheduling the batch files to run daily (for full backups) and every 15 minutes with a 10-minute offset (00, 10, 20, 30, 40, 50) for transaction log files.  These jobs were to load the backup files onto the hub disk storage (which is actually on the same SAN as the satellites!)

Finally, I configured a range of alerts and operators in SQL Server on each satellite to send e-mail push notifications out should the appropriate BACKUP FAILED event be captured in the error log - these alerts will also notify on disk space consumption, which is the largest risk of my approach.  So no need for anyone to go and check these backups more than once a quarter.

Careful backup management will normally depend on interaction with your SAN admin to determine the best solution.  There are many flaws with mine - for example, the constant flow of data actually travels out into the domain (past the switch) to the domain controller where it is re-routed back to servers which share the same disks!  Although what I've put together works, it won't work for everyone and while it will scale, it will do so at the expense of bandwidth.  It also has redundant feedback loops, with every satellite responsible for tidying the hub when just one could suffice.  Others more imaginative than I will quickly point out that this system could have been administered using Policy Management, PowerShell or any one of a number of other methods.  No approach is invalid (unless it actively causes harm) but the best interests of the clients' systems should always be borne in mind.

Many DBAs in the field will find themselves dealing with a wide variety of different systems often using very different technology.  In a previous role, I have found myself dealing with Oracle 7(!) databases on an old, unwieldy Sun Solaris server whose only job was to act as a message broker.  In the very next call, I'd be dealing with a BladeServer running Windows Server 2008 R2 and supporting a dozen different MS SQL Server instances.  So the answer to the question asked earlier - what's the 'best' way of configuring a backup solution - there is no 'best' way, although there are certainly best practices and principles to be followed. 

I believe that the most important of these are scalability, reliability and fitness for purpose.

More information on configuring SQL Server backups can be found here:

No comments:

Post a Comment